How transformer generate texts

decoding methods for language generation with Transformers

Dylan Yang

Last updated on Jul 5, 2022 4 min read

Basic math: auto-regressive language generation
Decoding methods

Basic math: auto-regressive language generation

auto-regressive language generation is based on the assumption that the probability distribution of a word sequence can be decomposed into the product of conditional next word distributions:

$$P\left(w_{1: T} \mid W_{0}\right)=\prod_{t=1}^{T} P\left(w_{t} \mid w_{1: t-1}, W_{0}\right)$$

$W_0$: the context
$T$: the length of word sequence
- it is determined on-the-fly: when EOS token is generated.

Decoding methods

Currently, most prominent decoding methods are, mainly, Greedy search, Beam search, Top-K sampling and Top-p sampling.

Greedy Search

Definition

Given all previous words $w_{1: t-1}$ selects the word $w$ with the highest probability as its next word:

$$w_t = argmax_w P(w|w_{1:t-1})$$

Example

Starting from the word "The", the algorithm greedily chooses the next word of highest probability "nice" and so on, so that the final generated word sequence is ("The","nice","woman") having an overall probability of $0.5 \times 0.4 = 0.20.5×0.4=0.2$.

Feature

The major drawback is:

Greedy Search may miss the sequence with overall highest probability.
- For example, a word “has” hidden behind a low probability word “dog”.
beam search alleviate this problem

Beam Search

Definition

keeping num_beams the most-likely sequences, and eventually choosing the sequance with overall highest probability.

examplle

num_beams = 2

Feature

Beam search will always find an output sequence with higher probability than greedy search
but is not guaranteed to find the most likely output.

Code


# activate beam search: set num_beams > 1
# early_stopping:  generation is finished when all beam hypotheses reached the EOS token.

beam_output = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Some function features of beam search

features:

no repeat n-gram
- no_repeat_ngram_size=2
multiple returns
- num_return_sequences
- num_return_sequences <= num_beams
the multiple beam output are only marginally different to each other.
- if increase diversity, trying increase beam number.

Why beam search might not be good in open-ended generation

Beam search work well in the case:
- length of desired generation is more or less fixed
Beam search does not work well in the case:
- output length can vary greatly: e.g. dialog and story generation
- beam search heavily suffers from repetitive generation
- beam search is “less surprising” than real-human’s answer

Sampling

sampling means randomly picking the next word $w_t$ according to its conditional probability distribution (conditioned on previous words $w_{1:t-1}$):

$$ w_{t} \sim P\left(w \mid w_{1: t-1}\right) $$

issue 1 with sampling: coherent

The first problem when sampling word sequences: The models often generate incoherent sentence.

A trick is to make the conditional distribution sharper (increasing the likelihood of high probability words and decreasing the likelihood of low probability words).

by lowering the so-called temperature of the softmax.
temperature increase: more randomness.

# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# use temperature to decrease the sensitivity to low probability candidates
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=0, 
    temperature=0.7
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Top-K Sampling

In Top-K sampling, the K most likely next words are filtered and the probability mass is redistributed among only those K next words

An example

set K= 6: we limit our sampling pool to 6 words.


# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# set top_k to 50
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Good and Bad

Good of Top-K sampling:

may success in story generation, filter out weird words

Bad:

when we have sharp distribution from beginning,
- limiting the sample pool to a fixed size K could endanger the model to produce gibberish sentence
limit the model’s creativity

Top-p (nucleus) sampling — revised top-k sampling

chooses from the smallest possible set of words whose cumulative probability exceeds the probability p.

set a probability threshold that you want

an example

# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# deactivate top_k sampling and sample only from 92% most likely words
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Top-K + Top-P

Top-p can also be used in combination with Top-K, which can:

avoid very low ranked words
while allowing for some dynamic selection.


# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
# to get multiple independently sampled outputs, we can again set the parameter num_return_sequences > 1:
sample_outputs = model.generate(
    input_ids,
    do_sample=True, 
    max_length=50, 
    top_k=50, 
    top_p=0.95, 
    num_return_sequences=3
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Reference:

https://huggingface.co/blog/how-to-generate
transformers.generation_utils.GenerationMixin.generate

NLP