references/generation.md

Text Generation

Overview

Generate text with language models using the generate() method. Control output quality and style through generation strategies and parameters.

Basic Generation

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Tokenize input
inputs = tokenizer("Once upon a time", return_tensors="pt")

# Generate
outputs = model.generate(**inputs, max_new_tokens=50)

# Decode
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

Generation Strategies

Greedy Decoding

Select highest probability token at each step (deterministic):

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=False  # Greedy decoding (default)
)

Use for: Factual text, translations, where determinism is needed.

Sampling

Randomly sample from probability distribution:

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.95
)

Use for: Creative writing, diverse outputs, open-ended generation.

Explore multiple hypotheses in parallel:

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    num_beams=5,
    early_stopping=True
)

Use for: Translations, summarization, where quality is critical.

Balance quality and diversity:

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    penalty_alpha=0.6,
    top_k=4
)

Use for: Long-form generation, reducing repetition.

Key Parameters

Length Control

max_new_tokens: Maximum tokens to generate

max_new_tokens=100  # Generate up to 100 new tokens

max_length: Maximum total length (input + output)

max_length=512  # Total sequence length

min_new_tokens: Minimum tokens to generate

min_new_tokens=50  # Force at least 50 tokens

min_length: Minimum total length

min_length=100

Temperature

Controls randomness (only with sampling):

temperature=1.0   # Default, balanced
temperature=0.7   # More focused, less random
temperature=1.5   # More creative, more random

Lower temperature → more deterministic Higher temperature → more random

Top-K Sampling

Consider only top K most likely tokens:

do_sample=True
top_k=50  # Sample from top 50 tokens

Common values: 40-100 for balanced output, 10-20 for focused output.

Top-P (Nucleus) Sampling

Consider tokens with cumulative probability ≥ P:

do_sample=True
top_p=0.95  # Sample from smallest set with 95% cumulative probability

Common values: 0.9-0.95 for balanced, 0.7-0.85 for focused.

Repetition Penalty

Discourage repetition:

repetition_penalty=1.2  # Penalize repeated tokens

Values: 1.0 = no penalty, 1.2-1.5 = moderate, 2.0+ = strong penalty.

Beam Search Parameters

num_beams: Number of beams

num_beams=5  # Keep 5 hypotheses

early_stopping: Stop when num_beams sentences are finished

early_stopping=True

no_repeat_ngram_size: Prevent n-gram repetition

no_repeat_ngram_size=3  # Don't repeat any 3-gram

Output Control

num_return_sequences: Generate multiple outputs

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    num_beams=5,
    num_return_sequences=3  # Return 3 different sequences
)

pad_token_id: Specify padding token

pad_token_id=tokenizer.eos_token_id

eos_token_id: Stop generation at specific token

eos_token_id=tokenizer.eos_token_id

Advanced Features

Batch Generation

Generate for multiple prompts:

prompts = ["Hello, my name is", "Once upon a time"]
inputs = tokenizer(prompts, return_tensors="pt", padding=True)

outputs = model.generate(**inputs, max_new_tokens=50)

for i, output in enumerate(outputs):
    text = tokenizer.decode(output, skip_special_tokens=True)
    print(f"Prompt {i}: {text}\n")

Streaming Generation

Stream tokens as generated:

from transformers import TextIteratorStreamer
from threading import Thread

streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)

generation_kwargs = dict(
    inputs,
    streamer=streamer,
    max_new_tokens=100
)

thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()

for text in streamer:
    print(text, end="", flush=True)

thread.join()

Constrained Generation

Force specific token sequences:

# Force generation to start with specific tokens
force_words = ["Paris", "France"]
force_words_ids = [tokenizer.encode(word, add_special_tokens=False) for word in force_words]

outputs = model.generate(
    **inputs,
    force_words_ids=force_words_ids,
    num_beams=5
)

Guidance and Control

Prevent bad words:

bad_words = ["offensive", "inappropriate"]
bad_words_ids = [tokenizer.encode(word, add_special_tokens=False) for word in bad_words]

outputs = model.generate(
    **inputs,
    bad_words_ids=bad_words_ids
)

Generation Config

Save and reuse generation parameters:

from transformers import GenerationConfig

# Create config
generation_config = GenerationConfig(
    max_new_tokens=100,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
    do_sample=True
)

# Save
generation_config.save_pretrained("./my_generation_config")

# Load and use
generation_config = GenerationConfig.from_pretrained("./my_generation_config")
outputs = model.generate(**inputs, generation_config=generation_config)

Model-Specific Generation

Chat Models

Use chat templates:

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
]

input_text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

Encoder-Decoder Models

For T5, BART, etc.:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
tokenizer = AutoTokenizer.from_pretrained("t5-small")

# T5 uses task prefixes
input_text = "translate English to French: Hello, how are you?"
inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=50)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)

Optimization

Caching

Enable KV cache for faster generation:

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    use_cache=True  # Default, faster generation
)

Static Cache

For fixed sequence lengths:

from transformers import StaticCache

cache = StaticCache(model.config, max_batch_size=1, max_cache_len=1024, device="cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    past_key_values=cache
)

Attention Implementation

Use Flash Attention for speed:

model = AutoModelForCausalLM.from_pretrained(
    "model-id",
    attn_implementation="flash_attention_2"
)

Generation Recipes

Creative Writing

outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.8,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.2
)

Factual Generation

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    do_sample=False,  # Greedy
    repetition_penalty=1.1
)

Diverse Outputs

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    num_beams=5,
    num_return_sequences=5,
    temperature=1.5,
    do_sample=True
)

Long-Form Generation

outputs = model.generate(
    **inputs,
    max_new_tokens=1000,
    penalty_alpha=0.6,  # Contrastive search
    top_k=4,
    repetition_penalty=1.2
)

Translation/Summarization

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    num_beams=5,
    early_stopping=True,
    no_repeat_ngram_size=3
)

Common Issues

Repetitive output: - Increase repetition_penalty (1.2-1.5) - Use no_repeat_ngram_size (2-3) - Try contrastive search - Lower temperature

Poor quality: - Use beam search (num_beams=5) - Lower temperature - Adjust top_k/top_p

Too deterministic: - Enable sampling (do_sample=True) - Increase temperature (0.7-1.0) - Adjust top_k/top_p

Slow generation: - Reduce batch size - Enable use_cache=True - Use Flash Attention - Reduce max_new_tokens

Best Practices

  1. Start with defaults: Then tune based on output
  2. Use appropriate strategy: Greedy for factual, sampling for creative
  3. Set max_new_tokens: Avoid unnecessarily long generation
  4. Enable caching: For faster sequential generation
  5. Tune temperature: Most impactful parameter for sampling
  6. Use beam search carefully: Slower but higher quality
  7. Test different seeds: For reproducibility with sampling
  8. Monitor memory: Large beams use significant memory
← Back to transformers