- 本文为李宏毅 2021 ML 课程的笔记
目录
- Basic Idea of GAN
- Conditional Generation by GAN
- Unsupervised Conditional Generation
- Theory behind GAN
- fGAN: General Framework of GAN
- Tips for Improving GAN
- How to Train a GAN? Tips and tricks to make GANs work
- Feature Extraction
- Photo Editing
- Sequence Generation
- Evaluation of GAN
- More Generative Models
Basic Idea of GAN
All Kinds of GAN…
Basic Idea of GAN
Unconditional Generation
- Image Generation
- Sentence Generation
Conditional Generation
- We will control what to generate (e.g. 给定文字产生对应图像,给定图像产生另一张图像 (风格转换)…)
Generator outputs a complex distribution
- The data we want to generate has a distribution P d a t a ( x ) P_{data}(x) Pdata(x); A generator G G G is a network. The network defines a probability distribution P G ( x ) P_G(x) PG(x).
- It is difficult to compute P G ( x ) P_G(x) PG(x). We can only sample from the distribution. ⇒ \Rightarrow ⇒ Hard to measure the closeness between P G ( x ) P_G(x) PG(x) and P d a t a ( x ) P_{data}(x) Pdata(x)
- It is difficult to compute P G ( x ) P_G(x) PG(x). We can only sample from the distribution. ⇒ \Rightarrow ⇒ Hard to measure the closeness between P G ( x ) P_G(x) PG(x) and P d a t a ( x ) P_{data}(x) Pdata(x)
Generator
- Generator: a neural network (NN), or a function
- Input: a vector; Each dimension of input vector may represent some characteristics
- Output: a high dimensional vector (image / sentence) ⇒ \Rightarrow ⇒ a Complex Distribution
Discriminator
- Discriminator: a neural network (NN), or a function
- Input: a high dimensional vector (image / sentence)
- Output: a scalar (Larger value means real, smaller value means fake)
Generator and Discriminator
- 首先我们需要准备一个由真实图片组成的数据集,然后我们的 Generator v1 由向量生成了一堆图片,但由于一开始 Generator v1 的参数是随机初始化的,它生成的图片实际上就是一堆随意的输出。此时我们就可以训练 Discriminator v1,使它分辨出哪张图片是生成器生成的,哪张图片是真实的;在训练完 Discriminator v1 后,我们转而训练 Generator v1,使它生成的图片能尽量骗过 Discriminator v1 (生成使 Discriminator v1 输出得分高的图片),这样就得到了 Generator v2…
- 这样不断地重复,不断得到更好的 Generator 和 Discriminator…
This is where the term “adversarial” comes from.
Algorithm
- Initialize generator and discriminator
- In each training iteration:
- Step 1: Fix generator G G G, and update discriminator D D D (Discriminator learns to assign high scores to real objects and low scores to generated objects)
- Step 2: Fix discriminator D D D, and update generator G G G (Generator learns to “fool” the discriminator)
- How to implement? 可以把 Generator 和 Discriminator 组合起来,看作一整个网络。我们只需要让最后网络输出的数值越大越好。同时注意,我们在进行参数更新时只调整前几个对应 Generator 的 hidden layer 的参数
- How to implement? 可以把 Generator 和 Discriminator 组合起来,看作一整个网络。我们只需要让最后网络输出的数值越大越好。同时注意,我们在进行参数更新时只调整前几个对应 Generator 的 hidden layer 的参数
- Step 1: Fix generator G G G, and update discriminator D D D (Discriminator learns to assign high scores to real objects and low scores to generated objects)
- 算法的数学描述:
Note: input 的 vector 采样自某个分布 (Uniform distribution, Gaussian distribution…); 具体这些 vector 是几维的可能是一个需要调整的超参
GAN as structured learning
Structured Learning / Prediction
- Output is composed of components with dependency (e.g. output a sequence, a matrix, a graph, a tree …)
Why Structured Learning Challenging?
- One-shot / Zero-shot Learning:
- In classification, each class has some examples.
- In structured learning, If you consider each possible output as a “class”, since the output space is huge, most “classes” do not have any training data. So machine has to create new stuff during testing.
- Machine has to learn to do planning.
- Machine generates objects component-by-component, but it should have a big picture in its mind. (Because the output components have dependancy, they should be considered globally.)
Structured Learning Approach
Can Generator learn by itself?
Traditional Supervised Learning
- 在常规的监督学习中,我们可以收集一个数据集,样本为服从某个分布的向量,标签为对应的图片。我们直接用该数据集训练网络即可 (这里有个难点: 如何确定每张图片对应的向量? → \rightarrow → 在人为确定每张图片对应的向量时,应该使比较相似的两张图片对应的向量也比较相似;e.g. 用向量的第一个分量表示输出图片代表的数字,第二个分量表示倾斜的程度)
- 还有一个方法可以更方便的标注出每张图片对应的向量:Encoder in auto-encoder provides the code
Auto-encoder
- Encoder: Compact representation of the input object
- Decoder: Reconstruct the original object
- Train: 将 Encoder 和 Decoder 组合起来,希望输入和输出尽量相似;这里注意到,其实 Decoder 就是我们想要的 Generator !
Decoder as a generator
-
问题: Training data 是有限的,难以保证 Decoder 的质量
-
解决方法:Variational Auto-encoder (VAE)
Variational Auto-encoder (VAE)
- Encoder 不止产生一个 code ( m 1 , m 2 , m 3 ) (m_1,m_2,m_3) (m1,m2,m3),还产生每个 dimension 的方差 ( σ 1 , σ 2 , σ 3 ) (\sigma_1,\sigma_2,\sigma_3) (σ1,σ2,σ3);之后再从标准正态分布中采样出一个噪声向量 ( e 1 , e 2 , e 3 ) (e_1,e_2,e_3) (e1,e2,e3),将噪声向量与方差向量相乘后得到最终的 noise,加到 code 上得到有 noise 的 code;因此,最后 Decoder 需要根据有 noise 的 code 来还原出原来的图片,这样可以使 Decoder 更加稳定
- 但如果训练时只是 Minimize reconstruction error,那么由于 e \boldsymbol e e 是一个噪声向量,它会干扰 Decoder,因此最后 NN 很有可能会学得使 σ = 0 \boldsymbol \sigma=\boldsymbol0 σ=0。因此在训练 VAE 时,通常会加上另一个假设 (具体推导需要参考 paper):
What do we miss?
- 我们在使用 auto encoder 时,希望输入和输出尽量相近 (e.g. 可以使用欧氏距离来计算两张图片的相似度); But it does not really try to simulate real images!
- It will be fine if the generator can truly copy the target image. But what if the generator makes some mistakes… Some mistakes are serious, while some are fine.
- 关键是在 Structured Learning 中,component 和 component 之间的关系是十分重要的,而我们上面的方法没法很好的表现出 component 之间的相关性 → \rightarrow → Need deep structure to catch the relation between components. (相比 GAN,使用 auto decoder 来生成图片往往需要更深的网络)
Can Discriminator generate?
- It is easier to catch the relation between the components by top-down evaluation
How to learn the discriminator?
- I only have some real images ⇒ \Rightarrow ⇒ Discriminator only learns to output “1” (real)
- Discriminator training needs some negative examples (Quality of negative examples is critical)
How to generate realistic negative examples? - General Algorithm
- Given a set of positive examples, randomly generate a set of negative examples.
- In each iteration
- Learn a discriminator D D D that can discriminate positive and negative examples.
- Generate negative examples by discriminator D D D
x ^ = arg max x ∈ X D ( x ) \hat x=\argmax_{x\in\mathcal X}D(x) x^=x∈XargmaxD(x)
- Learn a discriminator D D D that can discriminate positive and negative examples.
- 因此,关键就是要解 arg max \argmax argmax 问题 (最简单的方法无疑是枚举所有可能的 x x x,但这种方法开销太大)。在 GAN 中,Generator 实际上就是用来解 arg max \argmax argmax 问题并以此生成 negative examples 的
GAN: 二次元人物头像生成
- Source of images: http://zhuanlan.zhihu.com/p/24767059
- DCGAN (Deep CNN GAN): http://github.com/carpedm20/DCGAN-tensorflow
- In 2019, with StyleGAN ……
- Progressive GAN: Progressive Growing of GANs for Improved Quality, Stability, and Variation
- Today …… BigGAN: Large Scale GAN Training for High Fidelity Natural Image Synthesis
Conditional Generation by GAN
Conditional GAN
- paper:
- Conditional GAN: Conditional Generative Adversarial Nets
- Class conditional image generation: Conditional Image Synthesis With Auxiliary Classifier GANs
Text-to-Image
- Traditional supervised approach: Problem: 同一种叙述可能对应多张图片,而 NN 会尝试 minimize 跟所有图片的 distance,最终可能产生一张 blurry image (It is blurry because it is the average of several images).
- e.g. Text: “train”; Annotation: 各种不同角度、不同种类的火车照片; 最终网络的输出可能是多种火车混合的一张模糊图像 (A blurry image!)
- e.g. Text: “train”; Annotation: 各种不同角度、不同种类的火车照片; 最终网络的输出可能是多种火车混合的一张模糊图像 (A blurry image!)
Conditional GAN
- Generator: 除了一个 vector z \boldsymbol z z 以外,还给定一段 text (condition),最终生成一副相关的 image;注意到 z \boldsymbol z z 为一个 distribution,因此 x \boldsymbol x x 也为一个 distribution (Generator learns to approximate P ( x ∣ c ) P(x|c) P(x∣c))
- Why output a distribution?
- The same input has different outputs ⇒ \Rightarrow ⇒ Especially for the tasks needs “creativity” (For Conditional Generation)
- avoid generating blurry image
- 当然为了避免 Generator 无视 condition,也可以给 Generator 加 dropout 而省略 z z z,这样仍然可以让输出有一个 random 的效果
- Why output a distribution?
- Discriminator: 如果沿用之前的 Discriminator,那么 Generator 只能学会产生真实的图像 (But completely ignore the input conditions); 因此需要作如下改进:
- Training data: ( c ^ , x ^ ) (\hat c,\hat x) (c^,x^)
- Positive example: ( c ^ , x ^ ) (\hat c,\hat x) (c^,x^) ; Negative example: ( c ^ , G ( c ^ ) ) , ( c ′ ^ , x ^ ) (\hat c,G(\hat c)),(\hat {c'},\hat x) (c^,G(c^)),(c′^,x^)
- Training algorithm
注意,在训练 Discriminator 时,最大化的目标中包含了两种错误情况 (fake image、condition 与真实图片不匹配)
最后的式子中应为 θ g ← θ g + η ∇ V ~ ( θ g ) \theta_g\leftarrow\theta_g+\eta\nabla\tilde V(\theta_g) θg←θg+η∇V~(θg) (gradient ascent)
不同的 Discriminator 架构
下面的架构可以更好地分辨两种不同的 err (生成图片不够 realistic;条件与图片不匹配)
StackGAN
- paper: StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks
- idea: 先产生低分辨率的图片,再逐步产生更高分辨率的图片
Image-to-image (Patch GAN)
- 我们的目标是由几何图形生成真实的房屋建筑,在下图中,
- close 表示使用 traditional supervised learning (使输出的图片与真实图片尽可能相近),可以看出,close 生成的图片比较模糊
- GAN 表示使用 GAN (Conditinal GAN),可以看出,GAN 生成的图片更加清晰,但也多了一些其他奇奇怪怪的结构
- GAN + close 表示在 GAN 的基础上,在训练 Generator 时,增加一个优化目标,不仅要使 Discriminator 输出的分数更高,也要使 Generator 生成的图片与真实图片尽可能接近 (如图中红色箭头所示);可以看出,GAN + close 生成的图片效果还是不错的
Patch GAN
- 在上面的 Image-to-image 中,作者还提出了 Patch GAN。通过改进 Discriminator 的结构来提高模型效果。传统的 Discriminator 是直接输入整张图片输出最终的得分,但在针对大图片时,网络需要的参数可能比较多,开销较大且训练时容易过拟合。而 Patch GAN 的主要思想就是针对大图片,一次只查看图片的一部分 (patch),输出该部分的得分 (具体的 patch 大小则是一个超参了)
Speech Enhancement
e.g. 去掉语音中的杂音
- 下面的语音用 spectrum 表示,因此可以直接套用图像处理的网络架构
- Conditional GAN
Video Generation
- Generator: 给 Generator 看一段影片,让它预测影片接下来发生的事情
Unsupervised Conditional Generation
Unsupervised Conditional Generation
- Transform an object from one domain to another without paired data (e.g. style transfer; 我们只有一堆风景照和一堆艺术画,但风景照和艺术画之间并不是两两对应的)
Approach 1: Direct Transformation (For texture or color change)
Direct Transformation
- Problem: ignore input (Discriminator 只负责判别画是否属于艺术画,因此 Generator 可能学会只输出某些艺术画,使得输出的画与输入的照片完全无关)
- The issue can be avoided by network design. Simpler generator makes the input and output more closely related. (shallow network 不太受这个问题的影响,可以直接 train)
Encoder Network
CycleGAN
- 我们也可以同时 learn 两个 Generator 和 Discriminator
Issue of Cycle Consistency
- paper: CycleGAN: a Master of Steganography (隐写术)
- CycleGAN 会把 input 的信息藏起来,输出的时候再把它呈现出来 (Generator 把信息藏在了人看不出来的地方) (e.g. 下图中屋顶上的黑点消失了)
Related Work
- Dual GAN
- Disco GAN
跟 CycleGAN 一样的方法 (不同的人在同一时间想出来的,发表在了不同的会议上…)
StarGAN (multiple domains)
- StarGAN 只需 1 个 Generator 和 1 个 Discriminator 就可以实现多个 domian 的互转 (也利用了 Cycle Consistency)
Approach 2: Projection to Common Space (Larger change, only keep the semantics)
相比于 Direct Transformation,Projection to Common Space 可以支持更大程度的转换
Projection to Common Space
Target
Training
- 利用 Auto-Encoder 的思想,相当于 train 两个 Auto-Encoder (分别为图中的红色箭头和蓝色箭头所示)
- 如果只 learn auto-encoder,decoder output 的 image 会很模糊,因此还可以再加上 Discriminator,这就相当于 train 两个 VAE-GAN
Problem
- Because we train two auto-encoders separately, the images with the same attribute may not project to the same position in the latent space.
- latent space: 隐空间;隐空间的作用是为了找到 模式 (pattern) 而学习数据特征并且简化数据表示
Sharing the parameters of encoders and decoders
- Couple GAN [Ming-Yu Liu, et al., NIPS, 2016]; UNIT [Ming-Yu Liu, et al., NIPS, 2017]
- 使两个 Encoder 和 Decoder 共享参数 (如下图虚线所示):Encoder 共享后面几个 layer 的参数,Decoder 共享前面几个 layer 的参数
- 最极端的情况是共享所有参数,这样 Encoder 还需要读入一个 flag 表示图片位于哪个 domain
- 最极端的情况是共享所有参数,这样 Encoder 还需要读入一个 flag 表示图片位于哪个 domain
Domain Discriminator
- Domain Discriminator: The domain discriminator forces the output of E N X EN_X ENX and E N Y EN_Y ENY have the same distribution. [Guillaume Lample, et al., NIPS, 2017]
- input: latent vector; output: 判断 latent vector 属于哪个 domain
- input: latent vector; output: 判断 latent vector 属于哪个 domain
Cycle Consistency:
- ComboGAN [Asha Anoosheh, et al., arXiv, 017]
类似 CycleGAN
Semantic Consistency
- Used in DTN [Yaniv Taigman, et al., ICLR, 2017] and XGAN [Amélie Royer, et al., arXiv, 2017]
- 计算 latent vector 的相似度 ⇒ \Rightarrow ⇒ 语义上的相似度
U-GAT-IT
To learn more…
Theory behind GAN
Maximum Likelihood Estimation
- Given a data distribution P d a t a ( x ) P_{data}(x) Pdata(x)
- We have a distribution P G ( x ; θ ) P_G(x;\theta) PG(x;θ) parameterized by θ \theta θ
- We want to find θ \theta θ such that P G ( x ; θ ) P_G(x;\theta) PG(x;θ) close to P d a t a ( x ) P_{data}(x) Pdata(x)
Maximum Likelihood Estimation = Minimize KL Divergence
- Sample { x 1 , x 2 , . . . , x m } \{x^1,x^2,...,x^m\} {x1,x2,...,xm} from P d a t a ( x ) P_{data}(x) Pdata(x). We can compute P G ( x i ; θ ) P_G(x^i;\theta) PG(xi;θ).
- Likelihood of generating the samples:
L = ∏ i = 1 m P G ( x i ; θ ) L=\prod_{i=1}^m P_G(x^i;\theta) L=i=1∏mPG(xi;θ) - Find θ ∗ \theta^* θ∗ maximizing the likelihood (also minimizes the KL divergence):
θ ∗ = arg max θ ∏ i = 1 m P G ( x i ; θ ) = arg max θ ∑ i = 1 m log P G ( x i ; θ ) ≈ arg max θ E x ∼ P d a t a [ log P G ( x ; θ ) ] = arg max θ ∫ x P d a t a ( x ) log P G ( x ; θ ) d x = arg max θ [ ∫ x P d a t a ( x ) log P G ( x ; θ ) d x − ∫ x P d a t a ( x ) log P d a t a ( x ) d x ] = arg min θ ∫ x P d a t a ( x ) log P d a t a ( x ) P G ( x ; θ ) d x = arg min θ K L ( P d a t a ( x ) ∣ ∣ P G ( x ; θ ) ) \begin{aligned}\theta^*&=\argmax_\theta\prod_{i=1}^m P_G(x^i;\theta)=\argmax_\theta\sum_{i=1}^m \log P_G(x^i;\theta) \\&\approx \argmax_\theta E_{x\sim P_{data}}[ \log P_G(x;\theta)] \\&=\argmax_\theta \int_xP_{data}(x)\log P_G(x;\theta)dx \\&=\argmax_\theta[ \int_xP_{data}(x)\log P_G(x;\theta)dx-\int_x P_{data}(x)\log P_{data}(x)dx] \\&=\argmin_\theta \int_xP_{data}(x)\log\frac{P_{data}(x)}{P_G(x;\theta)}dx \\&=\argmin_\theta KL(P_{data}(x)||P_G(x;\theta)) \end{aligned} θ∗=θargmaxi=1∏mPG(xi;θ)=θargmaxi=1∑mlogPG(xi;θ)≈θargmaxEx∼Pdata[logPG(x;θ)]=θargmax∫xPdata(x)logPG(x;θ)dx=θargmax[∫xPdata(x)logPG(x;θ)dx−∫xPdata(x)logPdata(x)dx]=θargmin∫xPdata(x)logPG(x;θ)Pdata(x)dx=θargminKL(Pdata(x)∣∣PG(x;θ))
Generator is a NN
- Generated distribution:
P G ( x ) = ∫ z P p r i o r ( z ) I G ( z ) = x d x P_G(x)=\int_z P_{prior}(z)\mathbb I_{G(z)=x}dx PG(x)=∫zPprior(z)IG(z)=xdx- Difficult to compute the likelihood; Hard to learn by maximum likelihood ⇒ \Rightarrow ⇒ GAN can solve it!
P p r i o r ( z ) P_{prior}(z) Pprior(z) can be a Gaussian distribution.
Generator (Defines a distribution P G ( x ) P_G(x) PG(x))
- A generator G G G is a network. The network defines a probability distribution P G ( x ; θ ) P_G(x;\theta) PG(x;θ)
Discriminator (Evaluates the “difference” between P G ( x ) P_G(x) PG(x) and P d a t a ( x ) P_{data}(x) Pdata(x))
Our objective
G ∗ = arg min G D i v ( P G , P d a t a ) G^*=\argmin_GDiv(P_G,P_{data}) G∗=GargminDiv(PG,Pdata)
How to compute the divergence? - Sampling is good enough ……
- Although we do not know the distributions of P G P_G PG and P d a t a P_{data} Pdata, we can sample from them.
Discriminator D D D evaluates the “difference” between P G ( x ) P_G(x) PG(x) and P d a t a ( x ) P_{data}(x) Pdata(x)
- Example Objective Function for D D D ( G G G is fixed):
- 须在 NN 后加 sigmoid 来保证 log \log log 内的值有意义
- Training: Using the example objective function is exactly the same as training a binary classifier (i.e. minimize the cross entropy error)
- The maximum objective value is related to JS divergence.
- intuition: small divergence ⇒ \Rightarrow ⇒ hard to discriminate (cannot make objective large); large divergence ⇒ \Rightarrow ⇒ easy to discriminate
max D V ( G , D ) \max_DV(G,D) maxDV(G,D)
- Given G G G, what is the optimal D ∗ D^* D∗ maximizing (Assume that D ( x ) D(x) D(x) can be any function)
- Given x x x, the optimal D ∗ D^* D∗ maximizing (Since D ( x ) D(x) D(x) can be any function)
i.e. Find D ∗ D^* D∗ maximizing: f ( D ) = a log ( D ) + b log ( 1 − D ) f(D)=a\log(D)+b\log(1-D) f(D)=alog(D)+blog(1−D) (a concave function)
注意到, D ∗ D^* D∗ 的输出值在 0 和 1 之间,也符合在 D ( x ) D(x) D(x) 之后加 sigmoid 的做法 - 下面我们就可以把 D ∗ D^* D∗ 代入 V ( G , D ) V(G,D) V(G,D),得到 max D V ( G , D ) \max_DV(G,D) maxDV(G,D),发现它与 Jensen-Shannon divergence 相关; 也就是说,我们将 train D D D 后得到的 D ∗ D^* D∗ 代入 V V V (objective function) 就可以得到 JS divergence
max D V ( G , D ) = V ( G , D ∗ ) = E x ∼ P data [ log P data ( x ) P data ( x ) + P G ( x ) ] + E x ∼ P G [ log P G ( x ) P data ( x ) + P G ( x ) ] = ∫ x P d a t a ( x ) log P data ( x ) P data ( x ) + P G ( x ) d x + ∫ x P G ( x ) log P G ( x ) P data ( x ) + P G ( x ) d x = ∫ x P d a t a ( x ) log 1 2 P data ( x ) ( P data ( x ) + P G ( x ) ) / 2 d x + ∫ x P G ( x ) log 1 2 P G ( x ) ( P data ( x ) + P G ( x ) ) / 2 d x = ∫ x P d a t a ( x ) [ log P data ( x ) ( P data ( x ) + P G ( x ) ) / 2 − log 2 ] d x + ∫ x P G ( x ) [ log P G ( x ) ( P data ( x ) + P G ( x ) ) / 2 − log 2 ] d x = − 2 log 2 + ∫ x P d a t a ( x ) log P data ( x ) ( P data ( x ) + P G ( x ) ) / 2 d x + ∫ x P G ( x ) log P G ( x ) ( P data ( x ) + P G ( x ) ) / 2 d x = − 2 log 2 + K L ( P data ∥ P data + P G 2 ) + K L ( P G ∥ P data + P G 2 ) = − 2 log 2 + 2 J S D ( P data ∥ P G ) Jensen-Shannon divergence \begin{aligned} &\max _{D} V(G, D)=V\left(G, D^{*}\right) \\ =&E_{x \sim P_{\text {data }}}\left[\log \frac{P_{\text {data }}(x)}{P_{\text {data }}(x)+P_{G}(x)}\right]+E_{x \sim P_{\text {G}}}\left[\log \frac{P_{\text {G}}(x)}{P_{\text {data }}(x)+P_{G}(x)}\right] \\=&\int_{x} P_{d a t a}(x) \log \frac{P_{\text {data }}(x)}{P_{\text {data }}(x)+P_{G}(x)} d x+\int_{x} P_{G}(x) \log \frac{P_{\text {G}}(x)}{P_{\text {data }}(x)+P_{G}(x)} d x \\=&\int_{x} P_{d a t a}(x) \log \frac{\frac{1}{2}P_{\text {data }}(x)}{(P_{\text {data }}(x)+P_{G}(x))/2} d x+\int_{x} P_{G}(x) \log \frac{\frac{1}{2}P_{\text {G}}(x)}{(P_{\text {data }}(x)+P_{G}(x))/2} d x \\=&\int_{x} P_{d a t a}(x)[ \log \frac{P_{\text {data }}(x)}{(P_{\text {data }}(x)+P_{G}(x))/2} -\log2]d x+\int_{x} P_{G}(x) [\log \frac{P_{\text {G}}(x)}{(P_{\text {data }}(x)+P_{G}(x))/2}-\log2] d x \\=&-2\log2+\int_{x} P_{d a t a}(x) \log \frac{P_{\text {data }}(x)}{(P_{\text {data }}(x)+P_{G}(x))/2} d x+\int_{x} P_{G}(x) \log \frac{P_{\text {G}}(x)}{(P_{\text {data }}(x)+P_{G}(x))/2} d x \\=&-2 \log 2+\mathrm{KL}\left(P_{\text {data }} \| \frac{P_{\text {data }}+P_{G}}{2}\right)+\mathrm{KL}\left(P_{G} \| \frac{P_{\text {data }}+P_{G}}{2}\right) \\ =&-2 \log 2+2 J S D\left(P_{\text {data }} \| P_{G}\right) \quad\text{{Jensen-Shannon divergence}} \end{aligned} =======DmaxV(G,D)=V(G,D∗)Ex∼Pdata [logPdata (x)+PG(x)Pdata (x)]+Ex∼PG[logPdata (x)+PG(x)PG(x)]∫xPdata(x)logPdata (x)+PG(x)Pdata (x)dx+∫xPG(x)logPdata (x)+PG(x)PG(x)dx∫xPdata(x)log(Pdata (x)+PG(x))/221Pdata (x)dx+∫xPG(x)log(Pdata (x)+PG(x))/221PG(x)dx∫xPdata(x)[log(Pdata (x)+PG(x))/2Pdata (x)−log2]dx+∫xPG(x)[log(Pdata (x)+PG(x))/2PG(x)−log2]dx−2log2+∫xPdata(x)log(Pdata (x)+PG(x))/2Pdata (x)dx+∫xPG(x)log(Pdata (x)+PG(x))/2PG(x)dx−2log2+KL(Pdata ∥2Pdata +PG)+KL(PG∥2Pdata +PG)−2log2+2JSD(Pdata ∥PG)Jensen-Shannon divergence
JS divergence ∈ [ 0 , log 2 ] \in[0,\log2] ∈[0,log2]
Algorithm
Our objective
G ∗ = arg min G D i v ( P G , P d a t a ) = arg min G max D V ( G , D ) G^*=\argmin_GDiv(P_G,P_{data})=\argmin_G\ \max _{D} V(G, D) G∗=GargminDiv(PG,Pdata)=Gargmin DmaxV(G,D)
How to find G ∗ G^* G∗?
- (1) Initialize generator and discriminator
- (2) In each training iteration:
- Step 1: Fix generator G G G, and update discriminator D D D ⇒ \Rightarrow ⇒ Given a generator G G G, max D V ( G , D ) \max _{D} V(G, D) maxDV(G,D) evaluate the “difference” between P G P_G PG and P d a t a P_{data} Pdata
- Step 2: Fix discriminator D D D, and update generator G G G ⇒ \Rightarrow ⇒ Pick the G G G defining P G P_G PG most similar to P d a t a P_{data} Pdata
Notation
- L ( G ) = max D V ( G , D ) L(G)=\max_DV(G,D) L(G)=maxDV(G,D); i.e. loss function for generator
Algorithm
- Given G 0 G_0 G0
- Find D 0 ∗ D_0^* D0∗ maximizing V ( G 0 , D ) V(G_0,D) V(G0,D) (Using Gradient Ascent): V ( G 0 , D 0 ∗ ) V(G_0,D_0^*) V(G0,D0∗) is the JS divergence between P d a t a ( x ) P_{data}(x) Pdata(x) and P G 0 ( x ) P_{G_0}(x) PG0(x)
- Obtain G 1 G_1 G1: Decrease JS divergence (?)
- Find D 1 ∗ D_1^* D1∗ maximizing V ( G 1 , D ) V(G_1,D) V(G1,D): V ( G 1 , D 1 ∗ ) V(G_1,D_1^*) V(G1,D1∗) is the JS divergence between P d a t a ( x ) P_{data}(x) Pdata(x) and P G 1 ( x ) P_{G_1}(x) PG1(x)
- Obtain G 2 G_2 G2: Decrease JS divergence (?)
- …
Decrease JS divergence (?)
- 注意到,我们上面在 Algorithm 中注明了,在 train Generator 时 (对 L ( G ) L(G) L(G) 作梯度下降) 未必会使 JS divergence 减少。原因是当 Generator 改变时,用同一个 Discriminator 计算出的 V ( G , D ) V(G,D) V(G,D) 就不是在衡量 JS divergence 了
- 那么为什么我们说对 L ( G ) L(G) L(G) 作梯度下降可以看作减少 JS divergence 呢?这是因为我们新增了假设: D 0 ∗ ≈ D 1 ∗ D_0^*\approx D^*_1 D0∗≈D1∗
- 该假设要求我们: Don’t update G G G too much
In practice, how to compute m a x D V ( G , D ) max_DV(G,D) maxDV(G,D)
- We can use sampling to approximate expectation
- Sample { x 1 , x 2 , . . . , x m } \{x^1,x^2,...,x^m\} {x1,x2,...,xm} from P d a t a ( x ) P_{data}(x) Pdata(x), sample { x ~ 1 , x ~ 2 , . . . , x ~ m } \{\tilde x^1,\tilde x^2,...,\tilde x^m\} {x~1,x~2,...,x~m} from generator P G ( x ) P_G(x) PG(x)
Cross entropy error: − y log ( y ~ ) − ( 1 − y ) log ( 1 − y ~ ) -y \log (\tilde{y})-(1-y) \log (1-\widetilde{y}) −ylog(y~)−(1−y)log(1−y )
Summary
- train Discriminator 是为了衡量 JS divergence,因此理论上我们想要让每个 iteration 中都将 Discriminator 训练至收敛。但实际上我们只需进行 k k k 次 Gradient Ascent 得到 JS divergence 的一个大致的 lower bound 即可,不必训练 D D D 至收敛 (即使我们训练至收敛,仍然可能收敛至 local minima 或者由于 D D D 的表现能力有限,无法到达 global minima) (在更极端的情况下,在 train D D D 时可以只更新 1 次参数,也可以得到不错的效果)
- 注意到之前关于 Decrease JS divergence (?) 的讨论中作出的假设。为了维持这个假设,更新 Generator 参数时不能使其更新幅度过大,因此我们在每个 iteration 中只对 G G G 的参数进行 1 次梯度下降
- 注意到,在 train G G G 时,由于 D D D 的参数固定,因此 V ~ \tilde V V~ 的第一项与 G G G 无关,在训练时只需将 V ~ \tilde V V~ 的第二项作为优化目标即可
Objective Function for Generator in Real Implementation
- Minimax GAN (MMGAN): 在开始训练 Generator 时, D ( x ) D(x) D(x) 会比较小,代表 Generator 生成的图片无法骗过 Discriminator,而此时 log ( 1 − D ( x ) ) \log(1-D(x)) log(1−D(x)) 微分很小,训练会变得很慢
- Non-saturating GAN (NSGAN): 为了改善上面的缺点,可以将 log ( 1 − D ( x ) ) \log(1-D(x)) log(1−D(x)) 替换为 − log ( D ( x ) ) -\log(D(x)) −log(D(x))。它们的趋势一样,但在开始时训练速度更快 (没有理论保证)
- Real implementation: label x x x from PG as positive
- Real implementation: label x x x from PG as positive
fGAN: General Framework of GAN
- paper: f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization
- One sentence: you can use any f-divergence (fGAN 可以让我们最小化各种不同的 divergence;但实际上它们的差别是比较小的)
f-divergence
f-divergence
- P P P and Q Q Q are two distributions. p ( x ) p(x) p(x) and q ( x ) q(x) q(x) are the probability of sampling x x x.
- f f f is convex; f ( 1 ) = 0 f(1) = 0 f(1)=0
- D f ( P ∣ ∣ Q ) D_f(P||Q) Df(P∣∣Q) evaluates the difference of P P P and Q Q Q
- f f f is convex; f ( 1 ) = 0 f(1) = 0 f(1)=0
If P P P and Q Q Q are the same distributions, D f ( P ∣ ∣ Q ) D_f(P||Q) Df(P∣∣Q) has the smallest value, which is 0
- If p ( x ) = q ( x ) p(x)=q(x) p(x)=q(x) for all x x x
KL divergence: f ( x ) = x log x f(x)=x\log x f(x)=xlogx
Reverse KL divergence: f ( x ) = − log x f(x)=-\log x f(x)=−logx
Chi Square divergence: f ( x ) = ( x − 1 ) 2 f(x)=(x-1)^2 f(x)=(x−1)2
Fenchel Conjugate
Every convex function f f f has a conjugate function f ∗ f^* f∗ ( ( f ∗ ) ∗ = f (f^*)^*=f (f∗)∗=f)
- 从下图中可以看出, f ∗ ( t ) f^*(t) f∗(t) 一定是 convex function
f ∗ ( t ) f^*(t) f∗(t) 就是在 t t t 取定值时,找到 x x x 使得 x t − f ( x ) xt-f(x) xt−f(x) 最大。如上图所示,我们事先画出所有固定 x x x 后关于 t t t 的 x t − f ( x ) xt-f(x) xt−f(x) 直线 (假设有 3 个可能的 x x x), f ∗ ( t 1 ) f^*(t_1) f∗(t1) 就是 t = t 1 t=t_1 t=t1 时 3 条直线的最大值,即图中红线所示曲线
- e.g. 当 f ( x ) = x log x f(x)=x\log x f(x)=xlogx 时 (KL divergence),我们可以计算出 f ∗ ( t ) = e t − 1 f^*(t)=e^{t-1} f∗(t)=et−1
Connection with GAN
- f ( x ) f(x) f(x) 与 f ∗ ( x ) f^*(x) f∗(x) 互为 Fenchel Conjugate Function
- 因此可以在计算 f f f divergence 时,将 f ( x ) f(x) f(x) 用其 Fenchel Conjugate Function 替代
- D D D is a function, whose input is x x x, and output is t t t;我们可以求得 D f ( P ∣ ∣ Q ) D_f(P||Q) Df(P∣∣Q) 的 lower bound
- 我们可以通过找最大的 lower bound 来让其逼近 D f ( P ∣ ∣ Q ) D_f(P||Q) Df(P∣∣Q)
- 至此,我们就得到了 D f ( P d a t a ∣ ∣ P G ) D_f(P_{data}||P_G) Df(Pdata∣∣PG) 的表达式
- 进一步可以写出 G ∗ G^* G∗ 的表达式: (Original GAN has different V ( G , D ) V(G,D) V(G,D))
- 现在我们可以根据我们想要 minimize 的 f f f divergence,找出其 f ∗ f^* f∗,然后就能求得 V ( G , D ) V(G,D) V(G,D),进而训练 GAN 来最小化改 f divergence 了!
- 下面我们来看, f f f divergence 到底是想要解决什么问题呢?
Mode Collapse, Mode Dropping
Mode Collapse
- Mode Collapse: 在 train GAN 的时候,real data 的 distribution 很大,但 generated data 的 distribution 却很小
- e.g. 如下图所示,在做图像生成时,输出的图片来来回回就那几张
- e.g. 如下图所示,在做图像生成时,输出的图片来来回回就那几张
Mode Dropping
- Mode Dropping: real data 的 distribution 可能有多个 mode,但 generated data 确涵盖了其中一部分 mode。表面看起来 generated data 能会觉得还不错,而且多样性也够,但其实产生出来的数据只有真实数据的一部分
Why?
- 之所以会发生 Mode Collapse 和 Mode Dropping 直观上还是比较容易理解的:当 Generator 学会产生某种图片以后,它发现这种图片总能骗过 Discriminator,于是它就一直生成这种图片
- Dive deeper: Flaw in Optimization? (just a guess…): 当 P d a t a > 0 , P G = 0 P_{data}>0, P_G=0 Pdata>0,PG=0 时,KL divergence → ∞ \rightarrow\infty →∞,因此最小化 KL divergence 可能会使 P G P_G PG 尽可能覆盖所有 P d a t a P_{data} Pdata,不会出现 Mode collapse 但最后生成图片的质量不会太高;而如果最小化 Reverse KL,当 P G > 0 , P d a t a = 0 P_{G}>0, P_{data}=0 PG>0,Pdata=0 时,Reverse KL divergence → ∞ \rightarrow\infty →∞,因此 Generator 可能会变得相当保守,进而出现 Mode collapse
- 但实验结果证明,选择不同的 divergence 并不能有效缓解 Mode Collapse 或 Mode Dropping
- 但实验结果证明,选择不同的 divergence 并不能有效缓解 Mode Collapse 或 Mode Dropping
Ensemble
- 可以通过集成学习来有效避免 Mode Collapse 和 Mode Dropping。例如我们要产生 25 张图片,那么我们就可以训练 25 个 GAN,每个 GAN 各生成 1 张图片。这样即使每个 GAN 都遇到了 Mode Collapse 或 Mode Dropping 的问题,最后生成的 25 张图片也会是不太一样的 (如果只生产一张图片,那么我们可以随机选择一个 Generator 进行生成)
Double-loop v.s. Single-step
Tips for Improving GAN
JS divergence is not suitable
- In most cases, P G P_G PG and P d a t a P_{data} Pdata are not overlapped. - Why?
- (1) The nature of data: Both P G P_G PG and P d a t a P_{data} Pdata are low-dim manifold in high-dim space. (高维空间中的低维流形; e.g. 将一个 2 维平面折到 3 维平面中) The overlap can be ignored.
- (2) Sampling: Even though P G P_G PG and P d a t a P_{data} Pdata have overlap. If you do not have enough sampling …… (采样出的点没有交集)
What is the problem of JS divergence?
- JS divergence is log 2 \log2 log2 if two distributions do not overlap. (当刚开始 train GAN 时, P G P_G PG 和 P d a t a P_{data} Pdata 一直都没有重合,因此 JS divergence 一直为 log 2 \log2 log2,使得 Generater 的 loss function L ( D ) L(D) L(D) 一直为 0,训练难以收敛)
- 引用 SNGAN 中的一句话: “When the support of the model distribution and the support of the target distribution are disjoint, there exists a discriminator that can perfectly distinguish the model distribution from the target (Arjovsky & Bottou, 2017). Once such discriminator is produced in this situation, the training of the generator comes to complete stop, because the derivative of the so-produced discriminator with respect to the input turns out to be 0. This motivates us to introduce some form of restriction to the choice of the discriminator.”
- Solution: (1) 削弱 Discriminator 的能力: 通过给它添加 Dropout、减少更新次数,让它无法 overfitting;但 Discriminator 能力过弱也有问题,它就无法衡量 JS divergence 了… (2) Add Noise (Noises decay over time): Add some artificial noise to the inputs of discriminator; Make the labels noisy for the discriminator (train discriminator 时,给 input / label 加 noise) ⇒ \Rightarrow ⇒ P d a t a ( x ) P_{data}(x) Pdata(x) and P G ( x ) P_G(x) PG(x) have some overlap. Discriminator cannot perfectly separate real and generated data
- 原始的 GAN 中还有一个问题,就是 D D D 最后的激活函数使用 sigmoid,容易导致梯度消失 ⇒ \Rightarrow ⇒ Least Square GAN (LSGAN): Replace sigmoid with linear (replace classification with regression) (将分类问题转化为回归问题)
Wasserstein GAN (WGAN)
冷知识: Wasserstein 中的 W W W 念 v v v (因为是俄语)
- One sentence for WGAN: Using Earth Mover’s Distance to evaluate two distributions
Earth Mover’s Distance (Wasserstein Distance)
Earth Mover’s Distance
- Considering one distribution P P P as a pile of earth, and another distribution Q Q Q as the target. The average distance the earth mover has to move the earth.
- There many possible “moving plans”. Using the “moving plan” with the smallest average distance to define the earth mover’s distance.
Formal Definition
- A “moving plan” is a matrix. The value of the element is the amount of earth from one position to another.
- Average distance of a plan γ \gamma γ ( γ ( x p , x q ) \gamma(x_p,x_q) γ(xp,xq): 从 x p x_p xp 运到 x q x_q xq 的土量):
- Earth Mover’s Distance (All possible plan ∏ \prod ∏):
Why Earth Mover’s Distance?
- 在 GAN 中,Wasserstein Distance 比 f divergence 拥有更好的数学性质,它处处连续,几乎处处可导且导数不为 0
Evaluate wasserstein distance
- Evaluate wasserstein distance between P d a t a P_{data} Pdata and P G P_G PG (证明很复杂,这里略去)
- Smooth: function does not change fast
- Without the constraint, the training of D D D will not converge: 如果省略约束,由于 real data 和 generated data 之间很少会有 overlap,那么 discriminator 会使 real data 对应的值趋近于 ∞ \infty ∞,generated data 对应的值趋近于 − ∞ -\infty −∞,因此训练永远都不会收敛
Lipschitz Function
- K = 1 K=1 K=1 for “1 −Lipschitz” ⇒ \Rightarrow ⇒ ∣ ∣ D ( x 1 ) − D ( x 2 ) ∣ ∣ ≤ ∣ ∣ x 1 − x 2 ∣ ∣ ||D(x_1)-D(x_2)||\leq||x_1-x_2|| ∣∣D(x1)−D(x2)∣∣≤∣∣x1−x2∣∣
- How to fulfill this constraint?
Original version (weight clipping)
Weight Clipping
- Force the parameters w w w between c c c and − c -c −c After parameter update, if w > c w > c w>c, w = c w = c w=c; if w < − c w < -c w<−c, w = − c w = -c w=−c
- Intuition: 对 w w w 作限制,因此当 input 变化时,output 的变化总是有限的
- limitation:
- (1) We only ensure that ⇒ \Rightarrow ⇒ ∣ ∣ D ( x 1 ) − D ( x 2 ) ∣ ∣ ≤ K ∣ ∣ x 1 − x 2 ∣ ∣ ||D(x_1)-D(x_2)||\leq K||x_1-x_2|| ∣∣D(x1)−D(x2)∣∣≤K∣∣x1−x2∣∣ for some K K K. 最后得出来的 wasserstein distance 时原来的 K K K 倍
- (2) Do not truly find function D D D maximizing the function: 有可能 w w w 不满足 weight clipping 的条件,但也可以使 D D D 满足 1-Lipschitz 限制;也就是说,weight clipping 只覆盖了满足 1-Lipschitz 限制的所有函数的一个 subspace
- 如果不使用 WGAN,由于 real data 和 generated data 之间通常没有 overlap,因此 JS divergence 一直为 log 2 \log2 log2,使得我们在训练 GAN 时,Generater 的 loss function L ( D ) L(D) L(D) 一直为 0,无法通过 loss function 来判断 real data 和 generated data 的相似程度
- 使用 WGAN 后,就可以用 wasserstein distance 来衡量 GAN 训练的好坏了!
Improved WGAN (WGAN-GP, gradient penalty)
- A differentiable function is 1-Lipschitz if and only if it has gradients with norm less than or equal to 1 everywhere. (Discriminator 对 input x x x 的梯度范数要小于等于 1)
- 因此我们可以给 V ( G , D ) V(G,D) V(G,D) 增加一个 penalty 项 ⇒ \Rightarrow ⇒ Prefer ∣ ∣ ∇ x D ( x ) ∣ ∣ ≤ 1 ||\nabla_xD(x)||\leq1 ∣∣∇xD(x)∣∣≤1 for all x x x
V ( G , D ) ≈ max D { E x ∼ P data [ D ( x ) ] − E x ∼ P G [ D ( x ) ] − λ ∫ x max ( 0 , ∥ ∇ x D ( x ) ∥ − 1 ) d x } \begin{aligned} V(G, D) \approx &\max _{D}\{E_{x \sim P_{\text {data }}}[D(x)]-E_{x \sim P_{G}}[D(x)]\\ &\quad\quad -\lambda \int_{x} \max (0,\|\nabla_{x} D(x)\|-1) d x\} \end{aligned} V(G,D)≈Dmax{Ex∼Pdata [D(x)]−Ex∼PG[D(x)]−λ∫xmax(0,∥∇xD(x)∥−1)dx} - 但我们实际上无法对整个 input space 作积分的,因此我们要用采样代替积分项 ⇒ \Rightarrow ⇒ Prefer ∣ ∣ ∇ x D ( x ) ∣ ∣ ≤ 1 ||\nabla_xD(x)||\leq1 ∣∣∇xD(x)∣∣≤1 for x x x sampling from x ∼ P p e n a l t y x\sim P_{penalty} x∼Ppenalty
V ( G , D ) ≈ max D { E x ∼ P data [ D ( x ) ] − E x ∼ P G [ D ( x ) ] − λ E x ∈ P p e n a l t y [ max ( 0 , ∥ ∇ x D ( x ) ∥ − 1 ) ] } \begin{aligned} V(G, D) \approx& \max _{D}\{E_{x \sim P_{\text {data }}}[D(x)]-E_{x \sim P_{G}}[D(x)] \\ &\quad\quad -\lambda E_{x\in P_{penalty}} [\max (0,\|\nabla_{x} D(x)\|-1)]\} \end{aligned} V(G,D)≈Dmax{Ex∼Pdata [D(x)]−Ex∼PG[D(x)]−λEx∈Ppenalty[max(0,∥∇xD(x)∥−1)]}- “Given that enforcing the Lipschitz constraint everywhere is intractable, enforcing it only along these straight lines seems sufficient and experimentally results in good performance.”
- Only give gradient constraint to the region between P d a t a P_{data} Pdata and P G P_G PG because they influence how P G P_G PG moves to P d a t a P_{data} Pdata (在 train Generator 的时候,我们是让 Generator 根据 discriminator 指示的 gradient 方向 ( ∇ x V ( G , D ) \nabla_xV(G,D) ∇xV(G,D)) 将 P G P_G PG 移到 P d a t a P_{data} Pdata 的位置): 从 P d a t a P_{data} Pdata 和 P G P_G PG 中各采样出一个点,把这两个点相连,在这两个点中间做一个 random sample 当作从 P p e n a l t y P_{penalty} Ppenalty 中采样出的点
- 实际在 train GAN 的时候,我们希望 gradient 越接近 1 越好
V ( G , D ) ≈ max D { E x ∼ P data [ D ( x ) ] − E x ∼ P G [ D ( x ) ] − λ E x ∈ P p e n a l t y [ ( ∣ ∣ ∇ x D ( x ) ∣ ∣ − 1 ) 2 ] } \begin{aligned} V(G, D) \approx& \max _{D}\{E_{x \sim P_{\text {data }}}[D(x)]-E_{x \sim P_{G}}[D(x)] \\ &\quad\quad -\lambda E_{x\in P_{penalty}} [(||\nabla_xD(x)||-1)^2]\} \end{aligned} V(G,D)≈Dmax{Ex∼Pdata [D(x)]−Ex∼PG[D(x)]−λEx∈Ppenalty[(∣∣∇xD(x)∣∣−1)2]}- “Simply penalizing overly large gradients also works in theory, but experimentally we found that this approach converged faster and to better optima.”
Performance
- 可以看到,WGAN 和 WGAN-GP 相比于 DCGAN 和 LSGAN,更具鲁棒性,受网络参数的影响更小
Algorithm
V ( G , D ) V(G,D) V(G,D) 中已经没有了 log \log log 函数,因此没必要用 sigmoid 来限制 D ( x ) D(x) D(x) 范围了
Spectrum Norm (SNGAN)
- Spectral Normalization → Keep gradient norm smaller than 1 everywhere
Energy-based GAN (EBGAN)
- paper: Energy-based Generative Adversarial Network
- video: https://www.youtube.com/watch?v=gFaqKdcCdOE
- Using an autoencoder as discriminator D D D
- Using the negative reconstruction error of auto-encoder to determine the goodness (reconstruction error 越低,就认为 image 的 quality 越高)
- Benefit: The auto-encoder can be pre-train by real images without generator. (与之相比,基于 NN 的 Discriminator 在训练时需要 negative examples,因此无法 pretrain)
- Auto-encoder based discriminator only gives limited region large value.
GAN is still challenging …
- GAN 是非常难训练的,要想让网络训练起来,往往需要调整一下超参 (GAN training is dynamic, and sensitive to nearly every aspect of its setup (from optimization parameters to model architecture).)
- 我们可以简单地从它的结构上来分析: Generator and Discriminator needs to match each other 。也就是说,在训练时,如果 Generator 和 Discriminator 之中有一个不再进步,另一个也会跟着停止进步
More Tips
- Tips from Soumith
- Tips in DCGAN: Guideline for network architecture design for image generation
- Improved techniques for training GANs
- Tips from BigGAN
How to Train a GAN? Tips and tricks to make GANs work
- Ref: How to Train a GAN? Tips and tricks to make GANs work、怎样训练一个 GAN?一些小技巧让 GAN 更好的工作、训练不稳定、调参难度大,这里有 7 大法则带你规避 GAN 训练的坑!
- (1) Normalize the inputs:
- normalize the images between -1 and 1:
img / 127.5 - 1
Tanh
as the last layer of the generator output: 生成的图片也要经过判别器,所以生成器的输出也是 -1 到 1 之间 (和原图的区间范围保持一致)
- normalize the images between -1 and 1:
- (2) Avoid Sparse Gradients: ReLU, MaxPool
- the stability of the GAN game suffers if you have sparse gradients
- LeakyReLU = good (in both G and D)
- For Downsampling, use: Average Pooling, Conv2d + stride
- For Upsampling, use: PixelShuffle, ConvTranspose2d + stride
- (3) Use stability tricks from RL
- Experience Replay
- Keep a replay buffer of past generations and occassionally show them
- Keep checkpoints from the past of G and D and occassionaly swap them out for a few iterations
- All stability tricks that work for deep deterministic policy gradients
- See Pfau & Vinyals (2016)
- Experience Replay
- (4) Use the ADAM Optimizer
- (5) Track failures early
- D D D loss goes to 0: failure mode
- check norms of gradients: if they are over 100 things are screwing up; 理想情况下,生成器应该在训练的早期接受大梯度,因为它需要学会如何生成看起来真实的数据。另一方面,判别器则在训练早期则不应该总是接受大梯度,因为它可以很容易地区分真实图像和生成图像。当生成器训练地足够好时,判别器就没有那么容易区分真实图像和生成图像了。它会不断发生错误,并得到较大的梯度
- when things are working, D D D loss has low variance and goes down over time vs having huge variance and spiking
- if loss of generator steadily decreases, then it’s fooling D with garbage
- (6) Dont balance loss via statistics (unless you have a good reason to)
- Dont try to find a (number of G / number of D) schedule to uncollapse training. It’s hard and we’ve all tried it.
- If you do try it, have a principled approach to it, rather than intuition. For example
while lossD > A:
train D
while lossG > B:
train G
- (7) Use Dropouts in G in both train and test phase
- Provide noise in the form of dropout (50%).
- Apply on several layers of our generator at both training and test time
- https://arxiv.org/pdf/1611.07004v1.pdf
Feature Extraction
InfoGAN
- paper: InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets
- 在 GAN 中,我们需要输入一个采样自某个分布的 vector,并且我们希望在训练 GAN 之后,该 vector 的每一个 dimension 都可以表示某种 characteristic
- Regular GAN: Modifying a specific dimension, no clear meaning (下图中横轴代表改变 input 的某个维度)
- Regular GAN: Modifying a specific dimension, no clear meaning (下图中横轴代表改变 input 的某个维度)
The colors represents the characteristics. (以二维的 input vector 为例,我们希望在 latent space 中,不同特征的 object 的分布是有某种规律性的)
What is InfoGAN?
- 将输入 z z z 分为了 c c c 和 z ′ z' z′ 两个部分, c c c 的每个维度都代表图片的某些特征, z ′ z' z′ 代表随机的、无法解释的部分
- 除了 GAN 的结构外,InfoGAN 还新增了一个 Classifier,它需要根据 x x x 还原出 c c c (The classifier can recover c c c from x x x. c c c must have clear influence on x x x)。注意到 Generator 和 Classifier 就形成了一个 Auto-encoder 的结构
- 同时由于 Classifier 和 Discriminator 都接受 x x x 作为输入,因此它们可以共享一部分参数
VAE-GAN
- VAE-GAN
- (1) 用 GAN 来强化 VAE: 前两个部分的 Encoder 和 Generator (Decoder) 可以看作 VAE,如果我们没有 Discriminator 而只是 minimize reconstruction error,那么由于我们很难计算两个 image 之间的 loss,最后生成的图片往往会比较模糊。但是有了 Discriminator 之后,我们还可以通过 cheat Discriminator 来让生成的图像更真实
- (2) 用 VAE 来强化 GAN: 后两个部分的 Generator (Decoder) 和 Discriminator 可以看作 GAN。VAE-GAN 新增了 Encoder,这样可以通过 minimize reconstruction error 来让生成图像更真实
BiGAN
- paper: Adversarial Feature Learning
- 可以看到,BiGAN 同样也是由 Encoder, Decoder 和 Discriminator 三部分组成的。但 Ecnoder 和 Decoder 并没有使用 Auto-encoder 的结构,而是利用 Discriminator 将 Ecnoder 和 Decoder 联系起来。Discriminator 同时接受 Image x x x 和 code z z z 并判断 ( x , z ) (x,z) (x,z) 来自 Encoder 还是 Decoder
- 那么这么做有什么用呢?可以设 Encoder 的输入和输出组成的 pair 服从联合分布 P ( x , z ) P(x,z) P(x,z),Decoder 的输入和输出组成的 pair 服从联合分布 Q ( x , z ) Q(x,z) Q(x,z)。Discriminator 做的事和 GAN 其实一样,就是衡量这两个分布之间的 difference。而 Encoder 和 Decoder 都尝试欺骗 Discriminator,最终不断迭代使得 P ( x , z ) P(x,z) P(x,z) 和 Q ( x , z ) Q(x,z) Q(x,z) 这两个联合分布越来越接近,最终得到如下的最优 Encoder 和 Decoder:
Algorithm
这里是让 Discriminator 增加来自 Encoder 的 ( x , z ) (x,z) (x,z) pair 的得分,减少来自 Decoder 的 ( x , z ) (x,z) (x,z) pair 的得分。但实际上反过来也可以 (即,增加来自 Decoder 的 ( x , z ) (x,z) (x,z) pair 的得分,减少来自 Encoder 的 ( x , z ) (x,z) (x,z) pair 的得分),因为 Discriminator 只是为了衡量 P P P 和 Q Q Q 之间的差别
- 注意到,Optimal encoder 和 decoder 在形式上相当于训练了如下的两个 Auto-encoder。但虽然它们在收敛到 optimal solution 时的效果是一样的,但训练时达不到 optima,实验中它们的效果还是有很大不同的 (BiGAN 更容易提取出图片的语义信息,生成清晰的图片,例如给定 1 张鸟的图片,它能给出另一张不太一样的鸟的图片,而 Auto-encoder 则会给出一张比较模糊的原图)
Triple GAN
- D D D: Discriminator, G G G: Generator, C C C: Classifier
- 如果不看 C C C 的话, D D D 和 G G G 就形成了一个 Conditional GAN。 G G G 的条件输入为 Y g Y_g Yg,然后输出 X g X_g Xg。接着将 ( X g , Y g ) (X_g,Y_g) (Xg,Yg) 的 pair 输入 D D D, D D D 需要分辨出 G G G 生成的数据和真实的样本数据
- Triple GAN 属于 Semi-supervised learning,也就是说,训练数据中有一小部分为 labeld data,但大部分为 unlabeld data ( x x x 和 y y y 不匹配)。我们可以用 labeld data 和 G G G 生成的 data 去训练 C C C,最后使得 C C C 可以做到输入 x x x,输出 y y y
具体为什么要这么做还得看 paper
Domain-adversarial training
- Training and testing data are in different domains: e.g. 模型的 Training data 和 Testing data 的数据分布不太一样,如果直接拿在 Training data 上训练得到的模型在 Testing data 上做测试,效果不会太好。因此我们可以用 Generator 抽取出 Training data 和 Testing data 的 feature,使抽取出的特征拥有相同的分布
feature extractor 就是 Generator;Domain classifier 就是 Discriminator,用于衡量 Testing data 和 Training data 之间的 distribution divergence;Label predictor 就是一个分类器,例如给数字作分类
这三个部分可以一起同时训练 (原始论文中采用的方法),也可以采用类似 GAN 的方法,分开来迭代地进行训练
Feature Disentangle
Original Seq2seq Auto-encoder
- 如果我们想要用 Auto-encoder 抽取出一段语音的发音特征 (phonetic information),但 latent representation 其实不止包含了 phonetic information,还包括了 speaker information (语者信息)、环境信息等 (例如,两个人说同一个词的语音信息是不同的,这是因为虽然两段语音的发音特征类似,但语者信息不同)
Feature Disentangle
e.g. phonetic information 可以做语音识别,speaker information 可以做声纹比对
Training
- (1) Train Speaker Encoder
- (2) Train Phonetic Encoder: 额外训练一个 Speaker Classifier 用于判别 z i z_i zi 和 z j z_j zj 是否来自同一个人说的语音 (其实就是 GAN)。如果 Phonetic Encoder 可以骗过 Speaker Classifier,那么说明 Phonetic Encoder 可以过滤掉所有与 Speaker 有关的信息
Result: Audio segments of two different speakers
Photo Editing
Sequence Generation
- 2018 (具体讲解): video, ppt
- 2021: video (4:06), ppt (p50)
Evaluation of GAN
即,如何客观地评估 GAN 生成 object 的好坏
We don’t want memory GAN
- 在训练 GAN 中,我们不想让 GAN 记住并输出 database 中已有的图片。如果 GAN 输出原图的话,我们可以通过与 database 中图片计算欧氏距离来判别 GAN 是不是输出的原图。但 GAN 也可能会生成原 database 中图片向上/下/左/右移动 1 / 2 / 3… 个 pixel 的图片或者左右翻转图片,这些图片与原图片是非常相似的,但如果用欧氏距离计算它们与 database 中图片距离的话,会发现它们与 database 中图片最像的并不是原图片,而是其他图片。此时我们就比较难判断生成的图片是否为原 database 中的原图
- 例如下图中,在将卡车图片移动 3 个 pixel 之后,与它最相似的图片竟然变成了飞机
- 例如下图中,在将卡车图片移动 3 个 pixel 之后,与它最相似的图片竟然变成了飞机
- Solution: Using k-nearest neighbor to check whether the generator generates new objects
Likelihood
- 在传统的评估生成模型时,我们可以采样出一些没有被用在训练中的真实样本 x i x^i xi,然后计算其对数似然来评估模型好坏
- But we cannot compute P G ( x i ) P_G(x^i) PG(xi) (in GAN). We can only sample from P G P_G PG.
Likelihood - Kernel Density Estimation
- Estimate the distribution of P G ( x ) P_G(x) PG(x) from sampling. Each sample is the mean of a
Gaussian with the same covariance. (用 Gaussian Mixture Model 去逼近 P G P_G PG)
- Now we have an approximation of P G P_G PG, so we can compute P G ( x i ) P_G(x^i) PG(xi) for each real data x i x^i xi Then we can compute the likelihood.
- 这个方法是有很多问题的,例如如何确定采样样本的个数、高 Likelihood 未必意味着高质量等
Likelihood v.s. Quality
- Low likelihood, high quality?: Considering a model generating good images (small variance)
- High likelihood, low quality?: 如下所示, G 2 G_2 G2 产生高质量图片的几率只有 G 1 G_1 G1 的 100 分之一,但 Likelihood 却只减小了 4.6
L G 1 = 1 N ∑ i log P G ( x i ) L G 1 = 1 N ∑ i log P G ( x i ) 100 = − log 100 + L G 1 ≈ − 4.6 + L G 1 L_{G_1}=\frac{1}{N}\sum_i\log P_G(x^i)\\ L_{G_1}=\frac{1}{N}\sum_i\log \frac{P_G(x^i)}{100}=-\log100+L_{G_1}\approx-4.6+L_{G_1} LG1=N1i∑logPG(xi)LG1=N1i∑log100PG(xi)=−log100+LG1≈−4.6+LG1
Inception Score (IS)
拿一个已经训练好的 classifier 来评估生成的 object
- (1) Concentrated distribution (lower entropy) means higher visual quality (每张图片对应的输出都可以看作一个 distribution,表示图片属于各个类别的概率)
- e.g. 如果我们生成的是 image,那就可以用一个已经训练好的 image classifier 来判断生成质量。如果 image classifier 判定 image 属于某个类别的概率特别高,那么就可以认为我们生成的图片质量比较好
- e.g. 如果我们生成的是 image,那就可以用一个已经训练好的 image classifier 来判断生成质量。如果 image classifier 判定 image 属于某个类别的概率特别高,那么就可以认为我们生成的图片质量比较好
- (2) Uniform distribution means higher variety: 我们同样可以评估 GAN 生成 object 的 diversity。如下图所示,我们可以采样出 3 张图片让 CNN 分类,从而产生 3 个 distribution。之后我们将这 3 个 distribution 平均起来得到一个 distribution。如果这个 distribution比较平均,那么说明每一个不同的 class 都被生成了,GAN 生成 object 的比较 diverse
Inception Score
用在 ImageNet 上训练得到的 Incepetion Net 作为分类器,所以叫作 Inception Score
- Inception Score:
exp ( E x KL ( p ( y ∣ x ) ∥ p ( y ) ) ) \begin{aligned} & {\exp \left(\mathbb{E}_{{x}} \operatorname{KL}(p(y \mid {x}) \| p(y))\right)} \end{aligned} exp(ExKL(p(y∣x)∥p(y))) - 由于我们只需要计算相对大小,因此可以忽略 exp \exp exp;同时在实际操作时,将取期望替换为 ∑ x \sum_x ∑x
E x KL ( p ( y ∣ x ) ∥ p ( y ) ) = ∑ x ∑ y P ( y ∣ x ) log P ( y ∣ x ) P ( y ) = ∑ x ∑ y P ( y ∣ x ) log P ( y ∣ x ) ( Negative entropy ) − ∑ x ∑ y P ( y ∣ x ) log P ( y ) ( Cross entropy ) \begin{aligned}&\mathbb{E}_{{x}} \operatorname{KL}(p(y \mid {x}) \| p(y))\\ =&\sum_{x} \sum_{y} P(y \mid x) \log \frac{P(y \mid x)}{P(y)} \\ =&\sum_{{x}} \sum_{y} P(y \mid x) \log P(y \mid x)\quad\quad (\text{Negative entropy})\\ &\quad\quad- \sum_{x} \sum_{y} P(y \mid x) \log P(y)\quad\quad (\text{Cross entropy}) \end{aligned} ==ExKL(p(y∣x)∥p(y))x∑y∑P(y∣x)logP(y)P(y∣x)x∑y∑P(y∣x)logP(y∣x)(Negative entropy)−x∑y∑P(y∣x)logP(y)(Cross entropy)- (1) Negative entropy 越大越好 ⇒ \Rightarrow ⇒ higher visual quality; (2) Cross entropy 用于衡量两个 distribution 之间的相似度,我们希望它越小越好 ⇒ \Rightarrow ⇒ higher diversity
- 因此 Inception Score 越大越好
Mode collapse, Mode missing
- Mode collapse is easy to detect.
- Mode missing: 如果 Discriminator 对 database 中的某张图片输出 score 特别高,那么可能这张图片就属于 missing mode (Generator 不会产生这样的图片)
- 不足:Inception Score 依赖于 classifier 的 training data;如果 Generator 产生的图片很逼真,但不与任何 training data 中的图片相似,那么 Inception Score 也不会很高;或者你生成的都是动漫人脸,但 Inception Net 都将它们看成人脸,此时 IS 就不能用于评估生成图片的质量
- 解决:FID: 首先提取出 GAN 输出图片与真实图片的 feature,将两者相比,越小越好,可在某些方面弥补 Inception Score 的不足
Fréchet Inception Distance (FID)
- 直接取 Inception Net 最后一个 hidden layer 的输出作为提取出的 feature。假设生成图片和真实图片都服从 Gaussian Distribution,FID 即为两个分布之间的 Fréchet distance ,因此 FID 越小越好
- 不足:(1) 生成图片和真实图片实际上不一定服从 Gaussian Distribution;(2) 为了计算 Fréchet distance,我们需要采样大量图片
To learn more about evaluation …
More Generative Models
Variational Autoencoder (VAE)
FLOW-based Model
- video: https://youtu.be/uXY18nzdSsM