首页 1 2 3 4 5 6 7

GAN (Generative Adversarial Network)

本文为李宏毅 2021 ML 课程的笔记

Basic Idea of GAN

NIPS 2016 Tutorial: Generative Adversarial Networks
The first GAN: Generative Adversarial Networks

All Kinds of GAN…

The GAN zoo

Basic Idea of GAN

Unconditional Generation

Image Generation
Sentence Generation

Conditional Generation

We will control what to generate (e.g. 给定文字产生对应图像，给定图像产生另一张图像 (风格转换)…)

Generator outputs a complex distribution

The data we want to generate has a distribution
P d a t a ( x ) P_{data}(x)
Pdata(x); A generator
is a network. The network defines a probability distribution $P_G(x)$

Generator

Generator: a neural network (NN), or a function
- Input: a vector; Each dimension of input vector may represent some characteristics
- Output: a high dimensional vector (image / sentence) $\Rightarrow$

Discriminator

Discriminator: a neural network (NN), or a function
- Input: a high dimensional vector (image / sentence)
- Output: a scalar (Larger value means real, smaller value means fake)

Generator and Discriminator

首先我们需要准备一个由真实图片组成的数据集，然后我们的 Generator v1 由向量生成了一堆图片，但由于一开始 Generator v1 的参数是随机初始化的，它生成的图片实际上就是一堆随意的输出。此时我们就可以训练 Discriminator v1，使它分辨出哪张图片是生成器生成的，哪张图片是真实的；在训练完 Discriminator v1 后，我们转而训练 Generator v1，使它生成的图片能尽量骗过 Discriminator v1 (生成使 Discriminator v1 输出得分高的图片)，这样就得到了 Generator v2…
这样不断地重复，不断得到更好的 Generator 和 Discriminator…

This is where the term “adversarial” comes from.

Algorithm

Initialize generator and discriminator
In each training iteration:
- Step 1: Fix generator $, and update discriminator (Discriminator learns to assign high scores to real objects and low scores to generated objects)$
- Step 2: Fix discriminator
  , and update generator
  (Generator learns to “fool” the discriminator)
  - How to implement? 可以把 Generator 和 Discriminator 组合起来，看作一整个网络。我们只需要让最后网络输出的数值越大越好。同时注意，我们在进行参数更新时只调整前几个对应 Generator 的 hidden layer 的参数
算法的数学描述：

Note: input 的 vector 采样自某个分布 (Uniform distribution, Gaussian distribution…); 具体这些 vector 是几维的可能是一个需要调整的超参

GAN as structured learning

Structured Learning / Prediction

Output is composed of components with dependency (e.g. output a sequence, a matrix, a graph, a tree …)

Why Structured Learning Challenging?

One-shot / Zero-shot Learning:
- In classification, each class has some examples.
- In structured learning, If you consider each possible output as a “class”, since the output space is huge, most “classes” do not have any training data. So machine has to create new stuff during testing.
Machine has to learn to do planning.
- Machine generates objects component-by-component, but it should have a big picture in its mind. (Because the output components have dependancy, they should be considered globally.)

Structured Learning Approach

Can Generator learn by itself?

Traditional Supervised Learning

在常规的监督学习中，我们可以收集一个数据集，样本为服从某个分布的向量，标签为对应的图片。我们直接用该数据集训练网络即可 (这里有个难点: 如何确定每张图片对应的向量？ $\rightarrow$
还有一个方法可以更方便的标注出每张图片对应的向量：Encoder in auto-encoder provides the code

Auto-encoder

Encoder: Compact representation of the input object
Decoder: Reconstruct the original object
Train: 将 Encoder 和 Decoder 组合起来，希望输入和输出尽量相似；这里注意到，其实 Decoder 就是我们想要的 Generator !

Decoder as a generator

问题: Training data 是有限的，难以保证 Decoder 的质量
解决方法：Variational Auto-encoder (VAE)

Variational Auto-encoder (VAE)

paper: Auto-Encoding Variational Bayes

Encoder 不止产生一个 code $m_1,m_2,m_3)$
但如果训练时只是 Minimize reconstruction error，那么由于 $\boldsymbol e$

What do we miss?

我们在使用 auto encoder 时，希望输入和输出尽量相近 (e.g. 可以使用欧氏距离来计算两张图片的相似度); But it does not really try to simulate real images!
- It will be fine if the generator can truly copy the target image. But what if the generator makes some mistakes… Some mistakes are serious, while some are fine.
- 关键是在 Structured Learning 中，component 和 component 之间的关系是十分重要的，而我们上面的方法没法很好的表现出 component 之间的相关性 $\rightarrow$

Can Discriminator generate?

It is easier to catch the relation between the components by top-down evaluation

How to learn the discriminator?

I only have some real images $\Rightarrow$
Discriminator training needs some negative examples (Quality of negative examples is critical)

How to generate realistic negative examples? - General Algorithm

Given a set of positive examples, randomly generate a set of negative examples.
In each iteration
- Learn a discriminator $that can discriminate positive and negative examples.$
- Generate negative examples by discriminator $D D x ^ = arg max ⁡ x ∈ X D ( x ) \hat x=\argmax_{x\in\mathcal X}D(x)$
因此，关键就是要解 $\argmax$

GAN: 二次元人物头像生成

Source of images: http://zhuanlan.zhihu.com/p/24767059
DCGAN (Deep CNN GAN): http://github.com/carpedm20/DCGAN-tensorflow

In 2019, with StyleGAN ……
Progressive GAN: Progressive Growing of GANs for Improved Quality, Stability, and Variation
Today …… BigGAN: Large Scale GAN Training for High Fidelity Natural Image Synthesis

Conditional Generation by GAN

Conditional GAN

paper:
Conditional GAN: Conditional Generative Adversarial Nets
Class conditional image generation: Conditional Image Synthesis With Auxiliary Classifier GANs

Text-to-Image

Traditional supervised approach: Problem: 同一种叙述可能对应多张图片，而 NN 会尝试 minimize 跟所有图片的 distance，最终可能产生一张 blurry image (It is blurry because it is the average of several images).
- e.g. Text: “train”; Annotation: 各种不同角度、不同种类的火车照片; 最终网络的输出可能是多种火车混合的一张模糊图像 (A blurry image!)

Conditional GAN

Generator: 除了一个 vector
$z \boldsymbol z$ 以外，还给定一段 text (condition)，最终生成一副相关的 image；注意到
$z \boldsymbol z$ 为一个 distribution，因此
$x \boldsymbol x$ 也为一个 distribution (Generator learns to approximate
)

Why output a distribution?
The same input has different outputs $\Rightarrow$
avoid generating blurry image

当然为了避免 Generator 无视 condition，也可以给 Generator 加 dropout 而省略 $，这样仍然可以让输出有一个 random 的效果$
Discriminator: 如果沿用之前的 Discriminator，那么 Generator 只能学会产生真实的图像 (But completely ignore the input conditions); 因此需要作如下改进:
- Training data: $(\hat c,\hat x)$
- Positive example: $(\hat c,\hat x)$
Training algorithm

注意，在训练 Discriminator 时，最大化的目标中包含了两种错误情况 (fake image、condition 与真实图片不匹配)

最后的式子中应为 $\theta_g\leftarrow\theta_g+\eta\nabla\tilde V(\theta_g)$

不同的 Discriminator 架构

在这里插入图片描述

下面的架构可以更好地分辨两种不同的 err (生成图片不够 realistic；条件与图片不匹配)

StackGAN

paper: StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks

idea: 先产生低分辨率的图片，再逐步产生更高分辨率的图片

在这里插入图片描述

Image-to-image (Patch GAN)

paper: Image-to-Image Translation with Conditional Adversarial Networks

在这里插入图片描述

我们的目标是由几何图形生成真实的房屋建筑，在下图中，
- close 表示使用 traditional supervised learning (使输出的图片与真实图片尽可能相近)，可以看出，close 生成的图片比较模糊
- GAN 表示使用 GAN (Conditinal GAN)，可以看出，GAN 生成的图片更加清晰，但也多了一些其他奇奇怪怪的结构
- GAN + close 表示在 GAN 的基础上，在训练 Generator 时，增加一个优化目标，不仅要使 Discriminator 输出的分数更高，也要使 Generator 生成的图片与真实图片尽可能接近 (如图中红色箭头所示)；可以看出，GAN + close 生成的图片效果还是不错的

Patch GAN

在上面的 Image-to-image 中，作者还提出了 Patch GAN。通过改进 Discriminator 的结构来提高模型效果。传统的 Discriminator 是直接输入整张图片输出最终的得分，但在针对大图片时，网络需要的参数可能比较多，开销较大且训练时容易过拟合。而 Patch GAN 的主要思想就是针对大图片，一次只查看图片的一部分 (patch)，输出该部分的得分 (具体的 patch 大小则是一个超参了)

Speech Enhancement

e.g. 去掉语音中的杂音

下面的语音用 spectrum 表示，因此可以直接套用图像处理的网络架构
Conditional GAN

Video Generation

Generator: 给 Generator 看一段影片，让它预测影片接下来发生的事情

Unsupervised Conditional Generation

Unsupervised Conditional Generation

Transform an object from one domain to another without paired data (e.g. style transfer; 我们只有一堆风景照和一堆艺术画，但风景照和艺术画之间并不是两两对应的)

Approach 1: Direct Transformation (For texture or color change)

Direct Transformation

Problem: ignore input (Discriminator 只负责判别画是否属于艺术画，因此 Generator 可能学会只输出某些艺术画，使得输出的画与输入的照片完全无关)
- The issue can be avoided by network design. Simpler generator makes the input and output more closely related. (shallow network 不太受这个问题的影响，可以直接 train)

Encoder Network

CycleGAN

paper: Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

在这里插入图片描述

我们也可以同时 learn 两个 Generator 和 Discriminator

在这里插入图片描述

Issue of Cycle Consistency

paper: CycleGAN: a Master of Steganography (隐写术)
CycleGAN 会把 input 的信息藏起来，输出的时候再把它呈现出来 (Generator 把信息藏在了人看不出来的地方) (e.g. 下图中屋顶上的黑点消失了)

Related Work

Dual GAN
Disco GAN

跟 CycleGAN 一样的方法 (不同的人在同一时间想出来的，发表在了不同的会议上…)

StarGAN (multiple domains)

paper: StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation

StarGAN 只需 1 个 Generator 和 1 个 Discriminator 就可以实现多个 domian 的互转 (也利用了 Cycle Consistency)

Approach 2: Projection to Common Space (Larger change, only keep the semantics)

相比于 Direct Transformation，Projection to Common Space 可以支持更大程度的转换

Projection to Common Space

Target

Training

利用 Auto-Encoder 的思想，相当于 train 两个 Auto-Encoder (分别为图中的红色箭头和蓝色箭头所示)
如果只 learn auto-encoder，decoder output 的 image 会很模糊，因此还可以再加上 Discriminator，这就相当于 train 两个 VAE-GAN

Problem

Because we train two auto-encoders separately, the images with the same attribute may not project to the same position in the latent space.
- latent space: 隐空间；隐空间的作用是为了找到模式 (pattern) 而学习数据特征并且简化数据表示

Sharing the parameters of encoders and decoders

Couple GAN [Ming-Yu Liu, et al., NIPS, 2016]; UNIT [Ming-Yu Liu, et al., NIPS, 2017]

使两个 Encoder 和 Decoder 共享参数 (如下图虚线所示)：Encoder 共享后面几个 layer 的参数，Decoder 共享前面几个 layer 的参数
- 最极端的情况是共享所有参数，这样 Encoder 还需要读入一个 flag 表示图片位于哪个 domain

Domain Discriminator

Domain Discriminator: The domain discriminator forces the output of $EN_X$ [Guillaume Lample, et al., NIPS, 2017]
- input: latent vector; output: 判断 latent vector 属于哪个 domain

Cycle Consistency:

ComboGAN [Asha Anoosheh, et al., arXiv, 017]

类似 CycleGAN

Semantic Consistency

Used in DTN [Yaniv Taigman, et al., ICLR, 2017] and XGAN [Amélie Royer, et al., arXiv, 2017]

计算 latent vector 的相似度 $\Rightarrow$

U-GAT-IT

Ref: U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation

SELFIE2ANIME

To learn more…

Theory behind GAN

Maximum Likelihood Estimation

Given a data distribution $P_{data}(x)$
We have a distribution
P G ( x ; θ ) P_G(x;\theta)
PG(x;θ) parameterized by $\theta$

Maximum Likelihood Estimation = Minimize KL Divergence

Sample ${x^1,x^2,...,x^m\}$
Likelihood of generating the samples:
$L=\prod_{i=1}^m P_G(x^i;\theta)$
Find $\theta^*$

Generator is a NN

Generated distribution:
P G ( x ) = ∫ z P p r i o r ( z ) I G ( z ) = x d x P_G(x)=\int_z P_{prior}(z)\mathbb I_{G(z)=x}dx
PG(x)=∫zPprior(z)IG(z)=xdx
Difficult to compute the likelihood; Hard to learn by maximum likelihood $\Rightarrow$

$P_{prior}(z)$

Generator (Defines a distribution $P_G(x)$

A generator $G G is a network. The network defines a probability distribution P G ( x ; θ ) P_G(x;\theta)$

Discriminator (Evaluates the “difference” between $P_G(x)$

Our objective
$G^*=\argmin_GDiv(P_G,P_{data})$

How to compute the divergence? - Sampling is good enough ……

Although we do not know the distributions of $P_G$

Discriminator $evaluates the “difference” between and$

Example Objective Function for
(
is fixed):
- 须在 NN 后加 sigmoid 来保证 $\log$
Training: Using the example objective function is exactly the same as training a binary classifier (i.e. minimize the cross entropy error)
The maximum objective value is related to JS divergence.
- intuition: small divergence $\Rightarrow$

$max_DV(G,D)$

Given $, what is the optimal maximizing (Assume that can be any function)$
Given $x x , the optimal D ∗ D^* maximizing (Since D ( x ) D(x) can be any function) i.e. Find D ∗ D^* maximizing: f ( D ) = a log ⁡ ( D ) + b log ⁡ ( 1 − D ) f(D)=a\log(D)+b\log(1-D)$
下面我们就可以把 $D^*$

JS divergence $\in[0,\log2]$

Algorithm

Our objective
$G^*=\argmin_GDiv(P_G,P_{data})=\argmin_G\ \max _{D} V(G, D)$

How to find $G^*$

(1) Initialize generator and discriminator
(2) In each training iteration:
- Step 1: Fix generator $G G , and update discriminator D D ⇒ \Rightarrow Given a generator G G , max ⁡ D V ( G , D ) \max _{D} V(G, D)$
- Step 2: Fix discriminator $D D , and update generator G G ⇒ \Rightarrow Pick the G G defining P G P_G$

Notation

$L(G)=\max_DV(G,D)$

Algorithm

Given $G_0$
Find $D_0^*$
Obtain $G_1$
Find $D_1^*$
Obtain $G_2$
…

Decrease JS divergence (?)

注意到，我们上面在 Algorithm 中注明了，在 train Generator 时 (对 $作梯度下降) 未必会使 JS divergence 减少。原因是当 Generator 改变时，用同一个 Discriminator 计算出的就不是在衡量 JS divergence 了$
那么为什么我们说对
作梯度下降可以看作减少 JS divergence 呢？这是因为我们新增了假设：
D 0 ∗ ≈ D 1 ∗ D_0^*\approx D^*_1
D0∗≈D1∗
该假设要求我们: Don’t update $too much$

In practice, how to compute $max_DV(G,D)$

We can use sampling to approximate expectation
Sample ${x^1,x^2,...,x^m\}$

Cross entropy error: $\log (\tilde{y})-(1-y) \log (1-\widetilde{y})$

Summary

train Discriminator 是为了衡量 JS divergence，因此理论上我们想要让每个 iteration 中都将 Discriminator 训练至收敛。但实际上我们只需进行 $次 Gradient Ascent$ 得到 JS divergence 的一个大致的 lower bound 即可，不必训练 $至收敛 (即使我们训练至收敛，仍然可能收敛至 local minima 或者由于的表现能力有限，无法到达 global minima) (在更极端的情况下，在 train 时可以只更新 1 次参数，也可以得到不错的效果)$
注意到之前关于 Decrease JS divergence (?) 的讨论中作出的假设。为了维持这个假设，更新 Generator 参数时不能使其更新幅度过大，因此我们在每个 iteration 中只对 $的参数进行 1 次梯度下降$
注意到，在 train $G G 时，由于 D D 的参数固定，因此 V ~ \tilde V$

Objective Function for Generator in Real Implementation

Minimax GAN (MMGAN): 在开始训练 Generator 时， $D ( x ) D(x) 会比较小，代表 Generator 生成的图片无法骗过 Discriminator，而此时 log ⁡ ( 1 − D ( x ) ) \log(1-D(x))$
Non-saturating GAN (NSGAN): 为了改善上面的缺点，可以将
log ⁡ ( 1 − D ( x ) ) \log(1-D(x))
log(1−D(x)) 替换为 $-\log(D(x))$

fGAN: General Framework of GAN

paper: f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization
- One sentence: you can use any f-divergence (fGAN 可以让我们最小化各种不同的 divergence；但实际上它们的差别是比较小的)

f-divergence

f-divergence

and
are two distributions.
and
are the probability of sampling $. is convex; evaluates the difference of and$

If $and are the same distributions, has the smallest value, which is 0$

If $for all$

KL divergence: $f(x)=x\log x$

Reverse KL divergence: $f(x)=-\log x$

Chi Square divergence: $f(x)=(x-1)^2$

Fenchel Conjugate

Every convex function $has a conjugate function$ ( $f^*)^*=f$

从下图中可以看出， $f^*(t)$

$f^*(t)$

e.g. 当 $f(x)=x\log x$

Connection with GAN

$与互为 Fenchel Conjugate Function$
因此可以在计算 $divergence 时，将用其 Fenchel Conjugate Function 替代$
$is a function, whose input is, and output is$ ；我们可以求得 $D_f(P||Q)$
我们可以通过找最大的 lower bound 来让其逼近 $D_f(P||Q)$
至此，我们就得到了 $D_f(P_{data}||P_G)$
进一步可以写出 $G^*$
现在我们可以根据我们想要 minimize 的 $divergence，找出其，然后就能求得，进而训练 GAN 来最小化改 f divergence 了！$

下面我们来看， $divergence 到底是想要解决什么问题呢？$

Mode Collapse, Mode Dropping

Mode Collapse

Mode Collapse: 在 train GAN 的时候，real data 的 distribution 很大，但 generated data 的 distribution 却很小
- e.g. 如下图所示，在做图像生成时，输出的图片来来回回就那几张

Mode Dropping

Mode Dropping: real data 的 distribution 可能有多个 mode，但 generated data 确涵盖了其中一部分 mode。表面看起来 generated data 能会觉得还不错，而且多样性也够，但其实产生出来的数据只有真实数据的一部分

Why?

之所以会发生 Mode Collapse 和 Mode Dropping 直观上还是比较容易理解的：当 Generator 学会产生某种图片以后，它发现这种图片总能骗过 Discriminator，于是它就一直生成这种图片
Dive deeper: Flaw in Optimization? (just a guess…): 当
P d a t a > 0 , P G = 0 P_{data}>0, P_G=0
Pdata>0,PG=0 时，KL divergence $\rightarrow\infty$

Ensemble

可以通过集成学习来有效避免 Mode Collapse 和 Mode Dropping。例如我们要产生 25 张图片，那么我们就可以训练 25 个 GAN，每个 GAN 各生成 1 张图片。这样即使每个 GAN 都遇到了 Mode Collapse 或 Mode Dropping 的问题，最后生成的 25 张图片也会是不太一样的 (如果只生产一张图片，那么我们可以随机选择一个 Generator 进行生成)

Double-loop v.s. Single-step

Tips for Improving GAN

paper: Wasserstein GAN, Improved Training of Wasserstein GANs

JS divergence is not suitable

In most cases, $P_G$ - Why?
- (1) The nature of data: Both $P_G$
- (2) Sampling: Even though $P_G$

What is the problem of JS divergence?

JS divergence is $\log2$ (当刚开始 train GAN 时， $P_G$
- 引用 SNGAN 中的一句话: “When the support of the model distribution and the support of the target distribution are disjoint, there exists a discriminator that can perfectly distinguish the model distribution from the target (Arjovsky & Bottou, 2017). Once such discriminator is produced in this situation, the training of the generator comes to complete stop, because the derivative of the so-produced discriminator with respect to the input turns out to be 0. This motivates us to introduce some form of restriction to the choice of the discriminator.”
- Solution: (1) 削弱 Discriminator 的能力: 通过给它添加 Dropout、减少更新次数，让它无法 overfitting；但 Discriminator 能力过弱也有问题，它就无法衡量 JS divergence 了… (2) Add Noise (Noises decay over time): Add some artificial noise to the inputs of discriminator; Make the labels noisy for the discriminator (train discriminator 时，给 input / label 加 noise) $\Rightarrow$
原始的 GAN 中还有一个问题，就是 $D D 最后的激活函数使用 sigmoid，容易导致梯度消失 ⇒ \Rightarrow Least Square GAN (LSGAN): Replace sigmoid with linear (replace classification with regression) (将分类问题转化为回归问题)$

Wasserstein GAN (WGAN)

冷知识: Wasserstein 中的 $念 (因为是俄语)$

One sentence for WGAN: Using Earth Mover’s Distance to evaluate two distributions

Earth Mover’s Distance (Wasserstein Distance)

Earth Mover’s Distance

Considering one distribution $as a pile of earth, and another distribution as the target. The average distance the earth mover has to move the earth .$
There many possible “moving plans”. Using the “moving plan” with the smallest average distance to define the earth mover’s distance.

Formal Definition

A “moving plan” is a matrix. The value of the element is the amount of earth from one position to another.
Average distance of a plan $\gamma$
Earth Mover’s Distance (All possible plan $\prod$

Why Earth Mover’s Distance?

在 GAN 中，Wasserstein Distance 比 f divergence 拥有更好的数学性质，它处处连续，几乎处处可导且导数不为 0

Evaluate wasserstein distance

Evaluate wasserstein distance between $P_{data}$ (证明很复杂，这里略去)
- Smooth: function does not change fast
- Without the constraint, the training of $will not converge$ : 如果省略约束，由于 real data 和 generated data 之间很少会有 overlap，那么 discriminator 会使 real data 对应的值趋近于 $\infty$

Lipschitz Function

$for “ 1 -Lipschitz ”$

How to fulfill this constraint?

Original version (weight clipping)

Weight Clipping

Force the parameters $between and After parameter update$ , if
w > c w > c
w>c, $; if, Intuition : 对作限制，因此当 input 变化时，output 的变化总是有限的 limitation : (1) We only ensure that (2) Do not truly find function maximizing the function : 有可能不满足 weight clipping 的条件，但也可以使满足 1-Lipschitz 限制；也就是说，weight clipping 只覆盖了满足 1-Lipschitz 限制的所有函数的一个 subspace$

如果不使用 WGAN，由于 real data 和 generated data 之间通常没有 overlap，因此 JS divergence 一直为 $\log2$
使用 WGAN 后，就可以用 wasserstein distance 来衡量 GAN 训练的好坏了！

Improved WGAN (WGAN-GP, gradient penalty)

A differentiable function is 1-Lipschitz if and only if it has gradients with norm less than or equal to 1 everywhere. (Discriminator 对 input $的梯度范数要小于等于 1)$
因此我们可以给 $V ( G , D ) V(G,D) 增加一个 penalty 项 ⇒ \Rightarrow Prefer ∣ ∣ ∇ x D ( x ) ∣ ∣ ≤ 1 ||\nabla_xD(x)||\leq1 V ( G , D ) ≈ max ⁡ D { E x ∼ P data [ D ( x ) ] − E x ∼ P G [ D ( x ) ] − λ ∫ x max ⁡ ( 0 , ∥ ∇ x D ( x ) ∥ − 1 ) d x } \begin{aligned} V(G, D) \approx &\max _{D}\{E_{x \sim P_{\text {data }}}[D(x)]-E_{x \sim P_{G}}[D(x)]\\ &\quad\quad -\lambda \int_{x} \max (0,\|\nabla_{x} D(x)\|-1) d x\} \end{aligned}$
但我们实际上无法对整个 input space 作积分的，因此我们要用采样代替积分项
$⇒ \Rightarrow$ Prefer $||\nabla_xD(x)||\leq1$
$V ( G , D ) ≈ max ⁡ D { E x ∼ P data [ D ( x ) ] − E x ∼ P G [ D ( x ) ] − λ E x ∈ P p e n a l t y [ max ⁡ ( 0 , ∥ ∇ x D ( x ) ∥ − 1 ) ] } \begin{aligned} V(G, D) \approx& \max _{D}\{E_{x \sim P_{\text {data }}}[D(x)]-E_{x \sim P_{G}}[D(x)] \\ &\quad\quad -\lambda E_{x\in P_{penalty}} [\max (0,\|\nabla_{x} D(x)\|-1)]\} \end{aligned}$ $P_{data}$
实际在 train GAN 的时候，我们希望 gradient 越接近 1 越好
V ( G , D ) ≈ max ⁡ D { E x ∼ P data [ D ( x ) ] − E x ∼ P G [ D ( x ) ] − λ E x ∈ P p e n a l t y [ ( ∣ ∣ ∇ x D ( x ) ∣ ∣ − 1 ) 2 ] } \begin{aligned} V(G, D) \approx& \max _{D}\{E_{x \sim P_{\text {data }}}[D(x)]-E_{x \sim P_{G}}[D(x)] \\ &\quad\quad -\lambda E_{x\in P_{penalty}} [(||\nabla_xD(x)||-1)^2]\} \end{aligned}
V(G,D)≈Dmax{Ex∼Pdata [D(x)]−Ex∼PG[D(x)]−λEx∈Ppenalty[(∣∣∇xD(x)∣∣−1)2]}
“Simply penalizing overly large gradients also works in theory, but experimentally we found that this approach converged faster and to better optima.”

Performance

可以看到，WGAN 和 WGAN-GP 相比于 DCGAN 和 LSGAN，更具鲁棒性，受网络参数的影响更小

Algorithm

$V ( G , D ) V(G,D) 中已经没有了 log ⁡ \log 函数，因此没必要用 sigmoid 来限制 D ( x ) D(x) 范围了$

Spectrum Norm (SNGAN)

paper: Spectral Normalization for Generative Adversarial Networks

Spectral Normalization → Keep gradient norm smaller than 1 everywhere

Energy-based GAN (EBGAN)

Using an autoencoder as discriminator
- Using the negative reconstruction error of auto-encoder to determine the goodness (reconstruction error 越低，就认为 image 的 quality 越高)
- Benefit: The auto-encoder can be pre-train by real images without generator. (与之相比，基于 NN 的 Discriminator 在训练时需要 negative examples，因此无法 pretrain)
Auto-encoder based discriminator only gives limited region large value.

GAN is still challenging …

GAN 是非常难训练的，要想让网络训练起来，往往需要调整一下超参 (GAN training is dynamic, and sensitive to nearly every aspect of its setup (from optimization parameters to model architecture).)
我们可以简单地从它的结构上来分析: Generator and Discriminator needs to match each other 。也就是说，在训练时，如果 Generator 和 Discriminator 之中有一个不再进步，另一个也会跟着停止进步

More Tips

How to Train a GAN? Tips and tricks to make GANs work

Ref: How to Train a GAN? Tips and tricks to make GANs work、怎样训练一个 GAN？一些小技巧让 GAN 更好的工作、训练不稳定、调参难度大，这里有 7 大法则带你规避 GAN 训练的坑！

(1) Normalize the inputs:
- normalize the images between -1 and 1: img / 127.5 - 1
- Tanh as the last layer of the generator output: 生成的图片也要经过判别器，所以生成器的输出也是 -1 到 1 之间 (和原图的区间范围保持一致)
(2) Avoid Sparse Gradients: ReLU, MaxPool
- the stability of the GAN game suffers if you have sparse gradients
- LeakyReLU = good (in both G and D)
- For Downsampling, use: Average Pooling, Conv2d + stride
- For Upsampling, use: PixelShuffle, ConvTranspose2d + stride
(3) Use stability tricks from RL
- Experience Replay
  - Keep a replay buffer of past generations and occassionally show them
  - Keep checkpoints from the past of G and D and occassionaly swap them out for a few iterations
- All stability tricks that work for deep deterministic policy gradients
- See Pfau & Vinyals (2016)
(4) Use the ADAM Optimizer
(5) Track failures early
- $loss goes to 0: failure mode$
- check norms of gradients: if they are over 100 things are screwing up; 理想情况下，生成器应该在训练的早期接受大梯度，因为它需要学会如何生成看起来真实的数据。另一方面，判别器则在训练早期则不应该总是接受大梯度，因为它可以很容易地区分真实图像和生成图像。当生成器训练地足够好时，判别器就没有那么容易区分真实图像和生成图像了。它会不断发生错误，并得到较大的梯度
- when things are working, $loss has low variance and goes down over time$ vs having huge variance and spiking
- if loss of generator steadily decreases, then it’s fooling D with garbage
(6) Dont balance loss via statistics (unless you have a good reason to)
- Dont try to find a (number of G / number of D) schedule to uncollapse training. It’s hard and we’ve all tried it.
- If you do try it, have a principled approach to it, rather than intuition. For example

while lossD > A:
  train D
while lossG > B:
  train G

(7) Use Dropouts in G in both train and test phase
- Provide noise in the form of dropout (50%).
- Apply on several layers of our generator at both training and test time
- https://arxiv.org/pdf/1611.07004v1.pdf

Feature Extraction

InfoGAN

paper: InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

在 GAN 中，我们需要输入一个采样自某个分布的 vector，并且我们希望在训练 GAN 之后，该 vector 的每一个 dimension 都可以表示某种 characteristic
- Regular GAN: Modifying a specific dimension, no clear meaning (下图中横轴代表改变 input 的某个维度)

The colors represents the characteristics. (以二维的 input vector 为例，我们希望在 latent space 中，不同特征的 object 的分布是有某种规律性的)

What is InfoGAN?

将输入 $分为了和两个部分$ ， $的每个维度都代表图片的某些特征，代表随机的、无法解释的部分$
除了 GAN 的结构外，InfoGAN 还新增了一个 Classifier，它需要根据 $还原出 (The classifier can recover from . must have clear influence on)。注意到 Generator 和 Classifier 就形成了一个 Auto-encoder 的结构$
同时由于 Classifier 和 Discriminator 都接受 $作为输入，因此它们可以共享一部分参数$

在这里插入图片描述

VAE-GAN

paper: Autoencoding beyond pixels using a learned similarity metric

VAE-GAN
- (1) 用 GAN 来强化 VAE: 前两个部分的 Encoder 和 Generator (Decoder) 可以看作 VAE，如果我们没有 Discriminator 而只是 minimize reconstruction error，那么由于我们很难计算两个 image 之间的 loss，最后生成的图片往往会比较模糊。但是有了 Discriminator 之后，我们还可以通过 cheat Discriminator 来让生成的图像更真实
- (2) 用 VAE 来强化 GAN: 后两个部分的 Generator (Decoder) 和 Discriminator 可以看作 GAN。VAE-GAN 新增了 Encoder，这样可以通过 minimize reconstruction error 来让生成图像更真实

在这里插入图片描述

BiGAN

paper: Adversarial Feature Learning

可以看到，BiGAN 同样也是由 Encoder, Decoder 和 Discriminator 三部分组成的。但 Ecnoder 和 Decoder 并没有使用 Auto-encoder 的结构，而是利用 Discriminator 将 Ecnoder 和 Decoder 联系起来。Discriminator 同时接受 Image $和 code 并判断来自 Encoder 还是 Decoder$
那么这么做有什么用呢？可以设 Encoder 的输入和输出组成的 pair 服从联合分布 $，Decoder 的输入和输出组成的 pair 服从联合分布。Discriminator 做的事和 GAN 其实一样，就是衡量这两个分布之间的 difference。而 Encoder 和 Decoder 都尝试欺骗 Discriminator，最终不断迭代使得和这两个联合分布越来越接近，最终得到如下的最优 Encoder 和 Decoder:$

Algorithm

在这里插入图片描述

这里是让 Discriminator 增加来自 Encoder 的 $pair 的得分，减少来自 Decoder 的 pair 的得分。但实际上反过来也可以 (即，增加来自 Decoder 的 pair 的得分，减少来自 Encoder 的 pair 的得分)，因为 Discriminator 只是为了衡量和之间的差别$

在这里插入图片描述

注意到，Optimal encoder 和 decoder 在形式上相当于训练了如下的两个 Auto-encoder。但虽然它们在收敛到 optimal solution 时的效果是一样的，但训练时达不到 optima，实验中它们的效果还是有很大不同的 (BiGAN 更容易提取出图片的语义信息，生成清晰的图片，例如给定 1 张鸟的图片，它能给出另一张不太一样的鸟的图片，而 Auto-encoder 则会给出一张比较模糊的原图)

Triple GAN

paper: Triple Generative Adversarial Nets

$: Discriminator, : Generator, : Classifier$
如果不看 $的话，和就形成了一个 Conditional GAN。的条件输入为，然后输出。接着将$
Triple GAN 属于 Semi-supervised learning，也就是说，训练数据中有一小部分为 labeld data，但大部分为 unlabeld data ( $和不匹配)。我们可以用 labeld data 和生成的 data 去训练，最后使得可以做到输入，输出$

在这里插入图片描述

具体为什么要这么做还得看 paper

Domain-adversarial training

paper: Domain-Adversarial Training of Neural Networks

Training and testing data are in different domains: e.g. 模型的 Training data 和 Testing data 的数据分布不太一样，如果直接拿在 Training data 上训练得到的模型在 Testing data 上做测试，效果不会太好。因此我们可以用 Generator 抽取出 Training data 和 Testing data 的 feature，使抽取出的特征拥有相同的分布

在这里插入图片描述

feature extractor 就是 Generator；Domain classifier 就是 Discriminator，用于衡量 Testing data 和 Training data 之间的 distribution divergence；Label predictor 就是一个分类器，例如给数字作分类

这三个部分可以一起同时训练 (原始论文中采用的方法)，也可以采用类似 GAN 的方法，分开来迭代地进行训练

Feature Disentangle

Original Seq2seq Auto-encoder

如果我们想要用 Auto-encoder 抽取出一段语音的发音特征 (phonetic information)，但 latent representation 其实不止包含了 phonetic information，还包括了 speaker information (语者信息)、环境信息等 (例如，两个人说同一个词的语音信息是不同的，这是因为虽然两段语音的发音特征类似，但语者信息不同)

Feature Disentangle

e.g. phonetic information 可以做语音识别，speaker information 可以做声纹比对

Training

(1) Train Speaker Encoder
(2) Train Phonetic Encoder: 额外训练一个 Speaker Classifier 用于判别 $z_i$

Result: Audio segments of two different speakers

Photo Editing

video, ppt

Sequence Generation

2018 (具体讲解): video, ppt
2021: video (4:06), ppt (p50)

Evaluation of GAN

即，如何客观地评估 GAN 生成 object 的好坏

We don’t want memory GAN

在训练 GAN 中，我们不想让 GAN 记住并输出 database 中已有的图片。如果 GAN 输出原图的话，我们可以通过与 database 中图片计算欧氏距离来判别 GAN 是不是输出的原图。但 GAN 也可能会生成原 database 中图片向上/下/左/右移动 1 / 2 / 3… 个 pixel 的图片或者左右翻转图片，这些图片与原图片是非常相似的，但如果用欧氏距离计算它们与 database 中图片距离的话，会发现它们与 database 中图片最像的并不是原图片，而是其他图片。此时我们就比较难判断生成的图片是否为原 database 中的原图
- 例如下图中，在将卡车图片移动 3 个 pixel 之后，与它最相似的图片竟然变成了飞机

Solution: Using k-nearest neighbor to check whether the generator generates new objects

Likelihood

在传统的评估生成模型时，我们可以采样出一些没有被用在训练中的真实样本 $x^i$
But we cannot compute $P_G(x^i)$ (in GAN). We can only sample from $P_G$

Likelihood - Kernel Density Estimation

Estimate the distribution of $P_G(x)$
Now we have an approximation of
P G P_G
PG, so we can compute
$P_G(x^i)$

Likelihood v.s. Quality

Low likelihood, high quality?: Considering a model generating good images (small variance)
High likelihood, low quality?: 如下所示， $G_2$

Inception Score (IS)

Ref: Improved Techniques for Training GANs

拿一个已经训练好的 classifier 来评估生成的 object

(1) Concentrated distribution (lower entropy) means higher visual quality (每张图片对应的输出都可以看作一个 distribution，表示图片属于各个类别的概率)
- e.g. 如果我们生成的是 image，那就可以用一个已经训练好的 image classifier 来判断生成质量。如果 image classifier 判定 image 属于某个类别的概率特别高，那么就可以认为我们生成的图片质量比较好
(2) Uniform distribution means higher variety: 我们同样可以评估 GAN 生成 object 的 diversity。如下图所示，我们可以采样出 3 张图片让 CNN 分类，从而产生 3 个 distribution。之后我们将这 3 个 distribution 平均起来得到一个 distribution。如果这个 distribution比较平均，那么说明每一个不同的 class 都被生成了，GAN 生成 object 的比较 diverse

Inception Score

用在 ImageNet 上训练得到的 Incepetion Net 作为分类器，所以叫作 Inception Score

Inception Score:
$\begin{aligned} & {\exp \left(\mathbb{E}_{{x}} \operatorname{KL}(p(y \mid {x}) \| p(y))\right)} \end{aligned}$
由于我们只需要计算相对大小，因此可以忽略
$exp ⁡ \exp$ ；同时在实际操作时，将取期望替换为
∑ x \sum_x
∑x
$\begin{aligned}&\mathbb{E}_{{x}} \operatorname{KL}(p(y \mid {x}) \| p(y))\\ =&\sum_{x} \sum_{y} P(y \mid x) \log \frac{P(y \mid x)}{P(y)} \\ =&\sum_{{x}} \sum_{y} P(y \mid x) \log P(y \mid x)\quad\quad (\text{Negative entropy})\\ &\quad\quad- \sum_{x} \sum_{y} P(y \mid x) \log P(y)\quad\quad (\text{Cross entropy}) \end{aligned}$

Mode collapse, Mode missing

Mode collapse is easy to detect.
Mode missing: 如果 Discriminator 对 database 中的某张图片输出 score 特别高，那么可能这张图片就属于 missing mode (Generator 不会产生这样的图片)

不足：Inception Score 依赖于 classifier 的 training data；如果 Generator 产生的图片很逼真，但不与任何 training data 中的图片相似，那么 Inception Score 也不会很高；或者你生成的都是动漫人脸，但 Inception Net 都将它们看成人脸，此时 IS 就不能用于评估生成图片的质量
解决：FID: 首先提取出 GAN 输出图片与真实图片的 feature，将两者相比，越小越好，可在某些方面弥补 Inception Score 的不足

Fréchet Inception Distance (FID)

Ref: GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

直接取 Inception Net 最后一个 hidden layer 的输出作为提取出的 feature。假设生成图片和真实图片都服从 Gaussian Distribution，FID 即为两个分布之间的 Fréchet distance ，因此 FID 越小越好