Generative Networks

ACTL3143 & ACTL5111 Deep Learning for Actuaries

Author

Patrick Laub

Generative Adversarial Networks

GANs consist of two neural networks, a generator, and a discriminator, and they are trained simultaneously through adversarial training. The generator takes in random noise and generates a synthetic data observation. The goal of the generator is to learn how to generate synthetic data that resembles actual data very well. The discriminator distinguishes between real and synthetic data and classifies them as ‘real’ or ‘fake’. The goal of the discriminator is to correctly identify whether the input is real or synthetic. An equilibrium is reached when the generator is able to generate data that very well resembles actual data and the discriminator is unable to distinguish them with high confidence.

GAN faces

Try out https://www.whichfaceisreal.com.

Example StyleGAN2-ADA outputs

GAN structure

A schematic of a generative adversarial network.

GAN intuition

Intuition about GANs

  • A forger creates a fake Picasso painting to sell to an art dealer.
  • The art dealer assesses the painting.

How they best each other:

  • The art dealer is given both authentic paintings and fake paintings to look at. Later on, the validity of his assessment is evaluated and he trains to become better at detecting fakes. Over time, he becomes increasingly expert at authenticating Picasso’s artwork.
  • The forger receives an assessment from the art dealer every time he gives him a fake. He knows he has to perfect his craft if the art dealer can detect his fake. He becomes increasingly adept at imitating Picasso’s style.

Generative adversarial networks

  • A GAN is made up of two parts:
    • Generator network: the forger. Takes a random point in the latent space, and decodes it into a synthetic data/image.
    • Discriminator network (or adversary): the expert. Takes a data/image and decides whether it exists in the original data set (the training set) or was created by the generator network.

Discriminator

lrelu = layers.LeakyReLU(alpha=0.2)

discriminator = keras.Sequential([
    keras.Input(shape=(28, 28, 1)),
    layers.Conv2D(64, 3, strides=2, padding="same", activation=lrelu),
    layers.Conv2D(128, 3, strides=2, padding="same", activation=lrelu),
    layers.GlobalMaxPooling2D(),
    layers.Dense(1)])

discriminator.summary()

Generator

latent_dim = 128
generator = keras.Sequential([
    layers.Dense(7 * 7 * 128, input_dim=latent_dim, activation=lrelu),
    layers.Reshape((7, 7, 128)),
    layers.Conv2DTranspose(128, 4, strides=2, padding="same", activation=lrelu),
    layers.Conv2DTranspose(128, 4, strides=2, padding="same", activation=lrelu),
    layers.Conv2D(1, 7, padding="same", activation="sigmoid")])
generator.summary()

Training GANs

GAN cost functions

The loss function à la 3Blue1Brown.

GAN - Schematic process

First step: Training discriminator:

  • Draw random points in the latent space (random noise).
  • Use generator to generate data from this random noise.
  • Mix generated data with real data and input them into the discriminator. The training targets are the correct labels of real data or fake data. Use discriminator to give feedback on the mixed data whether they are real or synthetic. Train discriminator to minimize the loss function which is the difference between the discriminator’s feedback and the correct labels.

GAN - Schematic process II

Second step: Training generator:

  • Draw random points in the latent space and generate data with generator.
  • Use discriminator to give feedback on the generated data. What the generator tries to achieve is to fool the discriminator into thinking all generated data are real data. Train generator to minimize the loss function which is the difference between the discriminator’s feedback and the desired feedback: “All data are real data” (which is not true).

GAN - Schematic process III

  • When training, the discriminator may end up dominating the generator because the loss function for training the discriminator tends to zero faster. In that case, try reducing the learning rate and increasing the dropout rate of the discriminator.
  • There are a few tricks for implementing GANs such as introducing stochasticity by adding random noise to the labels for the discriminator, using stride instead of pooling in the discriminator, using kernel size that is divisible by stride size, etc.

Train step

# Separate optimisers for discriminator and generator.
d_optimizer = keras.optimizers.Adam(learning_rate=0.0003)
g_optimizer = keras.optimizers.Adam(learning_rate=0.0004)

# Instantiate a loss function.
loss_fn = keras.losses.BinaryCrossentropy(from_logits=True)

@tf.function
def train_step(real_images):
  # Sample random points in the latent space
  random_latent_vectors = tf.random.normal(shape=(batch_size, latent_dim))
  # Decode them to fake images
  generated_images = generator(random_latent_vectors)
  # Combine them with real images
  combined_images = tf.concat([generated_images, real_images], axis=0)

  # Assemble labels discriminating real from fake images
  labels = tf.concat([
    tf.zeros((batch_size, 1)),
    tf.ones((real_images.shape[0], 1))], axis=0)

  # Add random noise to the labels - important trick!
  labels += 0.05 * tf.random.uniform(labels.shape)

  # Train the discriminator
  with tf.GradientTape() as tape:
    predictions = discriminator(combined_images)
    d_loss = loss_fn(labels, predictions)
  grads = tape.gradient(d_loss, discriminator.trainable_weights)
  d_optimizer.apply_gradients(zip(grads, discriminator.trainable_weights))

  # Sample random points in the latent space
  random_latent_vectors = tf.random.normal(shape=(batch_size, latent_dim))

  # Assemble labels that say "all real images"
  misleading_labels = tf.ones((batch_size, 1))

  # Train the generator (note that we should *not* update the weights
  # of the discriminator)!
  with tf.GradientTape() as tape:
    predictions = discriminator(generator(random_latent_vectors))
    g_loss = loss_fn(misleading_labels, predictions)

  grads = tape.gradient(g_loss, generator.trainable_weights)
  g_optimizer.apply_gradients(zip(grads, generator.trainable_weights))
  return d_loss, g_loss, generated_images

Grab the data

# Prepare the dataset.
# We use both the training & test MNIST digits.
batch_size = 64
(x_train, _), (x_test, _) = keras.datasets.mnist.load_data()
all_digits = np.concatenate([x_train, x_test])
all_digits = all_digits.astype("float32") / 255.0
all_digits = np.reshape(all_digits, (-1, 28, 28, 1))
dataset = tf.data.Dataset.from_tensor_slices(all_digits)
dataset = dataset.shuffle(buffer_size=1024).batch(batch_size)

# In practice you need at least 20 epochs to generate nice digits.
epochs = 1
save_dir = "./"

Train the GAN

%%time
for epoch in range(epochs):
  for step, real_images in enumerate(dataset):
    # Train the discriminator & generator on one batch of real images.
    d_loss, g_loss, generated_images = train_step(real_images)

    # Logging.
    if step % 200 == 0:
      # Print metrics
      print(f"Discriminator loss at step {step}: {d_loss:.2f}")
      print(f"Adversarial loss at step {step}: {g_loss:.2f}")
      break # Remove this if really training the GAN

Conditional GANs

Unconditional vs conditional generation

An analogy for unconditional vs conditional GANs

Hurricane example data

Original data

Hurricane example

Initial fakes

Hurricane example (after 54s)

Fakes after 1 iteration

Hurricane example (after 21m)

Fakes after 100 kimg

Hurricane example (after 47m)

Fakes after 200 kimg

Hurricane example (after 4h10m)

Fakes after 1000 kimg

Hurricane example (after 14h41m)

Fakes after 3700 kimg

Image-to-image translation

Example: Deoldify images #1

A deoldified version of the famous “Migrant Mother” photograph.

Example: Deoldify images #2

A deoldified Golden Gate Bridge under construction.

Example: Deoldify images #3

Explore the latent space

Generator can’t generate everything

Target

Projection

Problems with GANs

They are slow to train

StyleGAN2-ADA training times on V100s (1024x1024):

GPUs 1000 kimg 25000 kimg sec / kimg GPU mem CPU mem
1 1d 20h 46d 03h 158 8.1 GB 5.3 GB
2 23h 09m 24d 02h 83 8.6 GB 11.9 GB
4 11h 36m 12d 02h 40 8.4 GB 21.9 GB
8 5h 54m 6d 03h 20 8.3 GB 44.7 GB

Uncertain convergence

Converges to a Nash equilibrium, if at all.

Analogy of minimax update failure.

Mode collapse

Example of mode collapse

Generation is harder

A schematic of a generative adversarial network.
# Separate optimisers for discriminator and generator.
d_optimizer = keras.optimizers.Adam(learning_rate=0.0003)
g_optimizer = keras.optimizers.Adam(learning_rate=0.0004)

Advanced image layers

Conv2D

GlobalMaxPool2D

Conv2DTranspose

Vanishing gradients (I)

When the discriminator is too good, vanishing gradients

Vanishing gradients (II)

Vanishing gradients

Wasserstein GAN

We’re comparing distributions

Trying to minimise the distance between the distribution of generated samples and the distribution of real data.

Vanilla GAN is equivalent to minimising the Jensen–Shannon Divergence between the two.

An alternative distance between distributions is the Wasserstein distance.

Discriminator Critic

Critic D : \text{Input} \to \mathbb{R} how “authentic” the input looks. It can’t discriminate real from fake exactly.

Critic’s goal is

\max_{D \in \mathscr{D}} \mathbb{E}[ D(X) ] - \mathbb{E}[ D(G(Z)) ]

where \mathscr{D} is space of 1-Lipschitz functions. Either use gradient clipping or penalise gradients far from 1:

\max_{D} \mathbb{E}[ D(X) ] - \mathbb{E}[ D(G(Z)) ] + \lambda \mathbb{E} \Bigl[ ( \bigl|\bigl| \nabla D \bigr|\bigr| - 1)^2 \Bigr] .

GANs with differential privacy

Generating synthetic user information with differential privacy and Wasserstein GANs.

Language Models

Generative deep learning

  • Using AI as augmented intelligence rather than artificial intelligence.
  • Use of deep learning to augment creative activities such as writing, music and art, to generate new things.
  • Some applications: text generation, deep dreaming, neural style transfer, variational autoencoders and generative adversarial networks.

Text generation

Generating sequential data is the closest computers get to dreaming.

  • Generate sequence data: Train a model to predict the next token or next few tokens in a sentence, using previous tokens as input.
  • A network that models the probability of the next tokens given the previous ones is called a language model.

GPT-3 is a 175 billion parameter text-generation model trained by the startup OpenAI on a large text corpus of digitally available books, Wikipedia and web crawling. GPT-3 made headlines in 2020 due to its capability to generate plausible-sounding text paragraphs on virtually any topic.

Word-level language model

Diagram of a word-level language model.

The way how word-level language models work is that, it first takes in the input text and then generates the probability distribution of the next word. This distribution tells us how likely a certain word is to be the next word. Thereafter, the model implements an appropriate sampling strategy to select the next word. Once the next word is predicted, it is appended to the input text and then passed in to the model again to predict the next word. The idea here is to predict the word after word.

Character-level language model

Diagram of a character-level language model (Char-RNN)

Character-level language predicts the next character given a certain input character. They capture patterns at a much granular level and do not aim to capture semantics of words.

Useful for speech recognition

RNN output Decoded Transcription
what is the weather like in bostin right now what is the weather like in boston right now
prime miniter nerenr modi prime minister narendra modi
arther n tickets for the game are there any tickets for the game
Figure 1: Examples of transcriptions directly from the RNN with errors that are fixed by addition of a language model.

The above example shows how RNN predictions (for sequential data processing) can be improved by fixing errors using a language model.

Generating Shakespeare I

The following is an example how a language model trained on works of Shakespeare starts predicting words after we input a string. This is an example of a character-level prediction, where we aim to predict the most likely character, not the word.

ROMEO:
Why, sir, what think you, sir?

AUTOLYCUS:
A dozen; shall I be deceased.
The enemy is parting with your general,
As bias should still combit them offend
That Montague is as devotions that did satisfied;
But not they are put your pleasure.

Generating Shakespeare II

DUKE OF YORK:
Peace, sing! do you must be all the law;
And overmuting Mercutio slain;
And stand betide that blows which wretched shame;
Which, I, that have been complaints me older hours.

LUCENTIO:
What, marry, may shame, the forish priest-lay estimest you, sir,
Whom I will purchase with green limits o’ the commons’ ears!

Generating Shakespeare III

ANTIGONUS:
To be by oath enjoin’d to this. Farewell!
The day frowns more and more: thou’rt like to have
A lullaby too rough: I never saw
The heavens so dim by day. A savage clamour!

[Exit, pursued by a bear]

Sampling strategy

The sampling strategy refers to the way how we pick the next word/character as the prediction after observing the distribution. There are different sampling strategies and they aim to serve different levels of trade-offs between exploration and exploitation when generating text sequences.

Sampling strategy

  • Greedy sampling will choose the token with the highest probability. It makes the resulting sentence repetitive and predictable.
  • Stochastic sampling: if a word has probability 0.3 of being next in the sentence according to the model, we’ll choose it 30% of the time. But the result is still not interesting enough and still quite predictable.
  • Use a softmax temperature to control the randomness. More randomness results in more surprising and creative sentences.

Softmax temperature

  • The softmax temperature is a parameter that controls the randomness of the next token.
  • The formula is: \text{softmax}_\text{temperature}(x) = \frac{\exp(x / \text{temperature})}{\sum_i \exp(x_i / \text{temperature})}

“I am a” …

The graphical illustration above shows how the distribution of words change with different levels of temperature values. Higher levels of temperatures result in less predictable(more interesting) outcomes. If we continue to increase the temperature, after a certain point, outcomes will be picked completely at random. The predictions after this point might not be meaningful. Hence, attention to the trade-off between predictability and creativity is important when deciding the temperature.

Generating Laub (temp = 0.01)

Here I have trained a neural network based on the transcripts of my lecture recordings. Given the starting point “In today’s lecture we will”, I asked it to generate very different completions at varying temperatures. Setting temperature to 0.25 may give interesting outputs compared to 0.01, and 0.50 may give more creative outputs compared to 0.25. However, when we keep on increasing temperature, the neural network starts giving out meaningless sequences.

In today’s lecture we will be different situation. So, next one is what they rective that each commit to be able to learn some relationships from the course, and that is part of the image that it’s very clese and black problems that you’re trying to fit the neural network to do there instead of like a specific though shef series of layers mean about full of the chosen the baseline of car was in the right, but that’s an important facts and it’s a very small summary with very scrort by the beginning of the sentence.

Generating Laub (temp = 0.25)

In today’s lecture we will decreas before model that we that we have to think about it, this mightsks better, for chattely the same project, because you might use the test set because it’s to be picked up the things that I wanted to heard of things that I like that even real you and you’re using the same thing again now because we need to understand what it’s doing the same thing but instead of putting it in particular week, and we can say that’s a thing I mainly link it’s three columns.

Generating Laub (temp = 0.5)

In today’s lecture we will probably the adw n wait lots of ngobs teulagedation to calculate the gradient and then I’ll be less than one layer the next slide will br input over and over the threshow you ampaigey the one that we want to apply them quickly. So, here this is the screen here the main top kecw onct three thing to told them, and the output is a vertical variables and Marceparase of things that you’re moving the blurring and that just data set is to maybe kind of categorical variants here but there’s more efficiently not basically replace that with respect to the best and be the same thing.

Generating Laub (temp = 1)

In today’s lecture we will put it different shates to touch on last week, so I want to ask what are you object frod current. They don’t have any zero into it, things like that which mistakes. 10 claims that the average version was relden distever ditgs and Python for the whole term wo long right to really. The name of these two options. There are in that seems to be modified version. If you look at when you’re putting numbers into your, that that’s over. And I went backwards, up, if they’rina functional pricing working with.

Generating Laub (temp = 1.5)

In today’s lecture we will put it could be bedinnth. Lowerstoriage nruron. So rochain the everything that I just sGiming. If there was a large. It’s gonua draltionation. Tow many, up, would that black and 53% that’s girter thankAty will get you jast typically stickK thing. But maybe. Anyway, I’m going to work on this libry two, past, at shit citcs jast pleming to memorize overcamples like pre pysing, why wareed to smart a one in this reportbryeccuriay.

Copilot’s “Conversation Style”

This is (probably) just the ‘temperature’ knob under the hood.

Generate the most likely sequence

Similar to other sequence generating tasks such as generating the next word or generating the next character, generating an entire sequence of words is also useful. The task involves generating the most likely sequence after observing model predictions.

An example sequence-to-sequence chatbot model.

Transformers

Transformers are a special type of neural networks that are proven to be highly effective in NLP tasks. They can capture long-run dependencies in the sequential data that is useful for generating predictions with contextual meaning. It makes use of the self-attention mechanism which studies all inputs in the sequence together, tries to understand the dependencies among them, and then utilizes the information about long-run dependencies to predict the output sequence.

Transformer architecture

GPT makes use of a mechanism known as attention, which removes the need for recurrent layers (e.g., LSTMs). It works like an information retrieval system, utilizing queries, keys, and values to decide how much information it wants to extract from each input token.

Attention heads can be grouped together to form what is known as a multihead attention layer. These are then wrapped up inside a Transformer block, which includes layer normalization and skip connections around the attention layer. Transformer blocks can be stacked to create very deep neural networks.

Highly recommended viewing: Iulia Turk (2021), Transfer learning and Transformer models, ML Tech Talks.

🤗 Transformers package

The following code uses the transformers library from Hugging Face to create a text generation pipeline using the GPT-2 (Generative Pre-trained Transformer 2).

1import transformers
2from transformers import pipeline
3generator = pipeline(task="text-generation", model="gpt2", revision="6c0e608")
1
Imports the transformers library
2
Imports the class pipeline
3
Creates a pipeline object named generator, whose task would be to generate text using the pre-trained model GPT-2. revision="6c0e608" specifies the specific revision of the model to refer
Device set to use mps:0
1transformers.set_seed(123)
2print(generator("It's the holidays so I'm going to enjoy")[0]["generated_text"])
1
Sets the seed for reproducibility
2
Applies the generator object to generate a text based on the input It’s the holidays so I’m going to enjoy. The result from generator would be a list of generated texts. To select the first output sequence hence, we pass the command [0]["generated_text"]
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
It's the holidays so I'm going to enjoy this," he says.

The three are all in the same boat. "We're so excited to see what the fans will be like," he says.

The first two days of the season are a big hit. The team still has to play without their top line and center Dwight Howard, but there's still a lot of action ahead.

"There's a lot of new players coming in and I think it's going to be a great tournament for everybody, but it's going to be a very busy time," says Howard.

The team is excited about the opportunity to see their new teammates play.

"It's going to be a great tournament for us but it's going to be a great experience for everyone," he says.

But there's no way around it.

"It's a tough game," says Howard. "We lose one of our best players and we lose one of our best players for the last five years."

We can try the same code with a different seed value, and it would give a very different output.

transformers.set_seed(234)
print(generator("It's the holidays so I'm going to enjoy")[0]["generated_text"])
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
It's the holidays so I'm going to enjoy this for a long time."

The full report is expected to be released to the public at the end of the year.

The report also says that the government is "working with private sector stakeholders" to develop a "new approach for addressing and combating cyber attacks on our networks".

In response to the report, Chief Secretary G.K. Chidambaram said the government would ensure that all the data it holds about the state is used in a way that is consistent with its values and priorities.

"We will take steps to ensure that this data is used in a way that is consistent with the values and priorities of the state. We will continue to provide the state with information that makes the state more secure, more efficient and more resilient. We will ensure that our people have access to the most sensitive data. We will ensure that our government provides effective services to the people of Bangladesh and to the world," he told reporters here.

"Today's report is an important step in enhancing our security. The first thing I want to tell you is the government is working with private sector stakeholders and it is important that we provide the system with the tools and the capacity that is necessary to protect our people. It is now in a position to

Reading the course profile

Another application of pipeline is the ability to generate texts in the answer format. The following is an example of how a pre-trained model can be used to answer questions by relating it to a body of text information (context).

context = """
StoryWall Formative Discussions: An initial StoryWall, worth 2%, is due by noon on June 3. The following StoryWalls are worth 4% each (taking the best 7 of 9) and are due at noon on the following dates:
The project will be submitted in stages: draft due at noon on July 1 (10%), recorded presentation due at noon on July 22 (15%), final report due at noon on August 1 (15%).

As a student at UNSW you are expected to display academic integrity in your work and interactions. Where a student breaches the UNSW Student Code with respect to academic integrity, the University may take disciplinary action under the Student Misconduct Procedure. To assure academic integrity, you may be required to demonstrate reasoning, research and the process of constructing work submitted for assessment.
To assist you in understanding what academic integrity means, and how to ensure that you do comply with the UNSW Student Code, it is strongly recommended that you complete the Working with Academic Integrity module before submitting your first assessment task. It is a free, online self-paced Moodle module that should take about one hour to complete.

StoryWall (30%)

The StoryWall format will be used for small weekly questions. Each week of questions will be released on a Monday, and most of them will be due the following Monday at midday (see assessment table for exact dates). Students will upload their responses to the question sets, and give comments on another student's submission. Each week will be worth 4%, and the grading is pass/fail, with the best 7 of 9 being counted. The first week's basic 'introduction' StoryWall post is counted separately and is worth 2%.

Project (40%)

Over the term, students will complete an individual project. There will be a selection of deep learning topics to choose from (this will be outlined during Week 1).

The deliverables for the project will include: a draft/progress report mid-way through the term, a presentation (recorded), a final report including a written summary of the project and the relevant Python code (Jupyter notebook).

Exam (30%)

The exam will test the concepts presented in the lectures. For example, students will be expected to: provide definitions for various deep learning terminology, suggest neural network designs to solve risk and actuarial problems, give advice to mock deep learning engineers whose projects have hit common roadblocks, find/explain common bugs in deep learning Python code.
"""

Question answering

1qa = pipeline("question-answering", model="distilbert-base-cased-distilled-squad", revision="626af31")
1
Creates a question and answer style pipeline object by referring to the pre-trained model DistilBERT model (fine-tuned on the SQuAD: Stanford Question Answering Dataset) with revision 626af31
Device set to use mps:0
1qa(question="What weight is the exam?", context=context)
1
Answers the questions What weight is the exam given the context specified
{'score': 0.5019664764404297, 'start': 2092, 'end': 2095, 'answer': '30%'}
qa(question="What topics are in the exam?", context=context)
{'score': 0.21276013553142548,
 'start': 1778,
 'end': 1791,
 'answer': 'deep learning'}
qa(question="When is the presentation due?", context=context)
{'score': 0.5296490788459778,
 'start': 1319,
 'end': 1335,
 'answer': 'Monday at midday'}
qa(question="How many StoryWall tasks are there?", context=context)
{'score': 0.21391083300113678, 'start': 1155, 'end': 1158, 'answer': '30%'}

ChatGPT is Transformer + RLHF

“… there is no official paper that describes how ChatGPT works in detail, but … we know that it uses a technique called reinforcement learning from human feedback (RLHF) to fine-tune the GPT-3.5 model. While ChatGPT still has many limitations (such as sometimes “hallucinating” factually incorrect information), it is a powerful example of how Transformers can be used to build generative models that can produce complex, long-ranging, and novel output that is often indistinguishable from human-generated text. The progress made thus far by models like ChatGPT serves as a testament to the potential of AI and its transformative impact on the world.”

Next Steps

Two new courses starting in 2026:

ACTL4306 “Quantitative Ethical AI for Risk & Actuarial Applications”

ACTL4307 “Generative AI for Actuaries”

Package Versions

from watermark import watermark
print(watermark(python=True, packages="keras,matplotlib,numpy,pandas,seaborn,scipy,torch,tensorflow,tf_keras"))
Python implementation: CPython
Python version       : 3.11.12
IPython version      : 9.3.0

keras     : 3.8.0
matplotlib: 3.10.0
numpy     : 1.26.4
pandas    : 2.2.2
seaborn   : 0.13.2
scipy     : 1.16.0
torch     : 2.6.0
tensorflow: 2.18.0
tf_keras  : 2.18.0

Glossary

  • beam search
  • bias
  • ChatGPT (& RLHF)
  • generative adversarial networks
  • greedy sampling
  • Hugging Face
  • language model
  • latent space
  • softmax temperature
  • stochastic sampling