Text Generation
Overview
Generate text with language models using the generate() method. Control output quality and style through generation strategies and parameters.
Basic Generation
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Tokenize input
inputs = tokenizer("Once upon a time", return_tensors="pt")
# Generate
outputs = model.generate(**inputs, max_new_tokens=50)
# Decode
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)
Generation Strategies
Greedy Decoding
Select highest probability token at each step (deterministic):
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=False # Greedy decoding (default)
)
Use for: Factual text, translations, where determinism is needed.
Sampling
Randomly sample from probability distribution:
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
temperature=0.7,
top_k=50,
top_p=0.95
)
Use for: Creative writing, diverse outputs, open-ended generation.
Beam Search
Explore multiple hypotheses in parallel:
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5,
early_stopping=True
)
Use for: Translations, summarization, where quality is critical.
Contrastive Search
Balance quality and diversity:
outputs = model.generate(
**inputs,
max_new_tokens=50,
penalty_alpha=0.6,
top_k=4
)
Use for: Long-form generation, reducing repetition.
Key Parameters
Length Control
max_new_tokens: Maximum tokens to generate
max_new_tokens=100 # Generate up to 100 new tokens
max_length: Maximum total length (input + output)
max_length=512 # Total sequence length
min_new_tokens: Minimum tokens to generate
min_new_tokens=50 # Force at least 50 tokens
min_length: Minimum total length
min_length=100
Temperature
Controls randomness (only with sampling):
temperature=1.0 # Default, balanced
temperature=0.7 # More focused, less random
temperature=1.5 # More creative, more random
Lower temperature → more deterministic Higher temperature → more random
Top-K Sampling
Consider only top K most likely tokens:
do_sample=True
top_k=50 # Sample from top 50 tokens
Common values: 40-100 for balanced output, 10-20 for focused output.
Top-P (Nucleus) Sampling
Consider tokens with cumulative probability ≥ P:
do_sample=True
top_p=0.95 # Sample from smallest set with 95% cumulative probability
Common values: 0.9-0.95 for balanced, 0.7-0.85 for focused.
Repetition Penalty
Discourage repetition:
repetition_penalty=1.2 # Penalize repeated tokens
Values: 1.0 = no penalty, 1.2-1.5 = moderate, 2.0+ = strong penalty.
Beam Search Parameters
num_beams: Number of beams
num_beams=5 # Keep 5 hypotheses
early_stopping: Stop when num_beams sentences are finished
early_stopping=True
no_repeat_ngram_size: Prevent n-gram repetition
no_repeat_ngram_size=3 # Don't repeat any 3-gram
Output Control
num_return_sequences: Generate multiple outputs
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5,
num_return_sequences=3 # Return 3 different sequences
)
pad_token_id: Specify padding token
pad_token_id=tokenizer.eos_token_id
eos_token_id: Stop generation at specific token
eos_token_id=tokenizer.eos_token_id
Advanced Features
Batch Generation
Generate for multiple prompts:
prompts = ["Hello, my name is", "Once upon a time"]
inputs = tokenizer(prompts, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_new_tokens=50)
for i, output in enumerate(outputs):
text = tokenizer.decode(output, skip_special_tokens=True)
print(f"Prompt {i}: {text}\n")
Streaming Generation
Stream tokens as generated:
from transformers import TextIteratorStreamer
from threading import Thread
streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)
generation_kwargs = dict(
inputs,
streamer=streamer,
max_new_tokens=100
)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
for text in streamer:
print(text, end="", flush=True)
thread.join()
Constrained Generation
Force specific token sequences:
# Force generation to start with specific tokens
force_words = ["Paris", "France"]
force_words_ids = [tokenizer.encode(word, add_special_tokens=False) for word in force_words]
outputs = model.generate(
**inputs,
force_words_ids=force_words_ids,
num_beams=5
)
Guidance and Control
Prevent bad words:
bad_words = ["offensive", "inappropriate"]
bad_words_ids = [tokenizer.encode(word, add_special_tokens=False) for word in bad_words]
outputs = model.generate(
**inputs,
bad_words_ids=bad_words_ids
)
Generation Config
Save and reuse generation parameters:
from transformers import GenerationConfig
# Create config
generation_config = GenerationConfig(
max_new_tokens=100,
temperature=0.7,
top_k=50,
top_p=0.95,
do_sample=True
)
# Save
generation_config.save_pretrained("./my_generation_config")
# Load and use
generation_config = GenerationConfig.from_pretrained("./my_generation_config")
outputs = model.generate(**inputs, generation_config=generation_config)
Model-Specific Generation
Chat Models
Use chat templates:
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
input_text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
Encoder-Decoder Models
For T5, BART, etc.:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
tokenizer = AutoTokenizer.from_pretrained("t5-small")
# T5 uses task prefixes
input_text = "translate English to French: Hello, how are you?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
Optimization
Caching
Enable KV cache for faster generation:
outputs = model.generate(
**inputs,
max_new_tokens=100,
use_cache=True # Default, faster generation
)
Static Cache
For fixed sequence lengths:
from transformers import StaticCache
cache = StaticCache(model.config, max_batch_size=1, max_cache_len=1024, device="cuda")
outputs = model.generate(
**inputs,
max_new_tokens=100,
past_key_values=cache
)
Attention Implementation
Use Flash Attention for speed:
model = AutoModelForCausalLM.from_pretrained(
"model-id",
attn_implementation="flash_attention_2"
)
Generation Recipes
Creative Writing
outputs = model.generate(
**inputs,
max_new_tokens=200,
do_sample=True,
temperature=0.8,
top_k=50,
top_p=0.95,
repetition_penalty=1.2
)
Factual Generation
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=False, # Greedy
repetition_penalty=1.1
)
Diverse Outputs
outputs = model.generate(
**inputs,
max_new_tokens=100,
num_beams=5,
num_return_sequences=5,
temperature=1.5,
do_sample=True
)
Long-Form Generation
outputs = model.generate(
**inputs,
max_new_tokens=1000,
penalty_alpha=0.6, # Contrastive search
top_k=4,
repetition_penalty=1.2
)
Translation/Summarization
outputs = model.generate(
**inputs,
max_new_tokens=100,
num_beams=5,
early_stopping=True,
no_repeat_ngram_size=3
)
Common Issues
Repetitive output: - Increase repetition_penalty (1.2-1.5) - Use no_repeat_ngram_size (2-3) - Try contrastive search - Lower temperature
Poor quality: - Use beam search (num_beams=5) - Lower temperature - Adjust top_k/top_p
Too deterministic: - Enable sampling (do_sample=True) - Increase temperature (0.7-1.0) - Adjust top_k/top_p
Slow generation: - Reduce batch size - Enable use_cache=True - Use Flash Attention - Reduce max_new_tokens
Best Practices
- Start with defaults: Then tune based on output
- Use appropriate strategy: Greedy for factual, sampling for creative
- Set max_new_tokens: Avoid unnecessarily long generation
- Enable caching: For faster sequential generation
- Tune temperature: Most impactful parameter for sampling
- Use beam search carefully: Slower but higher quality
- Test different seeds: For reproducibility with sampling
- Monitor memory: Large beams use significant memory