AI Style-Transfer Art

If you want to enjoy my art, click through the slideshow below! If you want to learn more about how this works, scroll down! Click on each image and then click "go to link" to see the original two images.

Starry Turtle

Vivid Shrubbery

Picasso's Beach

Geometric Bloom

Flowing Flower

Fall Valley


Christmas Painting

Fall Road

Wet Woodpile

Frog and Flowers

How Does Style Transfer Work?

As seen above, style transfer isn't a "one and done" thing. In fact, it's an iterative process (only the first few epochs are displayed in the gif)

Style transfer uses backpropagation in an odd way. Usually, when you backpropagate and do gradient descent in machine learning, you're optimizing a set of weight/bias matrices, so you're making the function. However, there is not too much significance about these matrices, except that they do what we want them to do, whether it be classifications, predictions, or any other application.

However, in the case of style transfer, the output image is the trained weights. Why? Let's talk about the ideal style transfer image. This image must not lose the fundamental characteristics of the original image (so the turtle doesn't turn into a bear!), but it also must have the style of the image. So, this is leading us to consider two loss functions. If you're not familiar with machine learning processes, a loss function is a metric that governs how the model is optimized, with the goal of minimizing this loss function. Note: from now on, we will consider all images to be 2d matrices (they are technically 3d but we're going to discount that for ease of intuition)


-content image: the image we want to apply a "style" to 

-style image: the source of the style, typically a painting

-output image: the style transfer image

Loss #1: the content loss

This loss is quite intuitive. You take the L2 loss function, aka squared loss, element-wise between the content and output matrices

Loss #2: the style loss

Now, this one is quite different and definitely non-intuitive. After all, how does one define "style"? There is actually a differentiable loss function that does just that, and it's called the Gram matrix! It's kind of like a correlation matrix, and it's defined by multiplying a matrix by its own transpose, giving all possible dot products. This isn't the "loss", though. You just use this matrix to compare to other Gram matrices, using the same L2 loss function as described in loss 1.

Now, a naive approach would be to apply these loss functions directly to the style, content, and output images, but there is a fundamental problem: the losses are pulling the output image in two directions! The more "stylish" that output becomes, the less it will represent the content image and thus the higher loss 1 (content loss) will be. However, the closer that an image represents the content image, the less "style" it contains. How do we solve this problem?

Let's call the help of a supermodel! No, not a catwalk-style supermodel. I'm talking about the almighty VGG19 Convolutional Neural Network image classifier (image in, label out), which achieves around 91% accuracy on ImageNet, a standard performance evaluation image dataset. How can this supermodel help us out? Our goal is to make these two losses work together on converging to our final style transfer images.  

To answer this question, we must first understand how convolutional neural networks (CNN) work. In a nutshell, a CNN applies various "filters" to an image, making broader and broader feature representations as we move through the structure. Maybe, on the first layer, it's detecting edges in the image, but on its last few layers, the "filters" will detect human faces or entire objects. This level of detection is an optimal feature representation. (This explanation of CNNs is dreadfully short, and to understand the magic of convolutions, especially in Fourier space, I recommend you read other sources like Medium).

Still, how can this help us? So the VGG19 model will represent all "features" on its last few layers. These feature representations are very generic, meaning that a picture of a car flipped, mirrored, in a snowstorm, riding on a dragon (ok maybe not the last part) will activate the same neurons as those of a normal car picture, taken in a studio. Likewise, our final ideal style transfer image, as long as it doesn't stray from its "big ideas", will activate the same values in the last few layers of VGG19 as the original content image.

So, instead of taking the content loss between the content and the output image directly, we feed both of those images into VGG19 and take the content loss between the last few layers of the content image-fed model, and the output image-fed model.

Style loss is very similar, but due to higher mathematical implications, instead of taking the style loss between the last few layers of an output-fed VGG19 and a style-fed VGG19, we make gram matrices of each layer of the VGG19 model for both the style and the output, and take the L2 losses for each one. 

We made it! The rest is really easy stuff. With loss functions defined, you will define your output matrix, and use gradient descent (usually Adam optimizer is better) to minimize your losses for that matrix, treating that matrix as a "weight". You also assign actual (scalar) weights to the style and content loss. Usually, you weight the content loss 10,000x more than the style loss to make sure that the model doesn't recreate the artwork. Also, the standard is to initialize your output image matrix as the content image matrix at first. 

Shown below is a diagram about the loss functions, which is the trickiest part of understanding style transfer. 

Here's another gif of style transfer evolution. Hopefully, you appreciate this better now that you know how this stuff works! 

Website Last Edited 10-11-2020

Is something not working? Email me! (maxjdu [bat - b] stanford [circular point] edu)


Go to link