Generation Control: Mastering AI Output for Better Results
In today’s fast-moving world of AI and large language models (LLMs), I’ve learned that one of the most valuable skills is not just understanding what these models can do but knowing how to guide them effectively. As I’ve spent time building applications, conducting research, and experimenting with different prompts, I’ve realized that real progress comes from learning how to control the generation process.
In this blog, I want to share seven generation control techniques that have made a real difference in how I work with AI and that every practitioner, researcher, or enthusiast can benefit from.
- Temperature
- Top-p/Top-k Sampling
- Prompt Engineering Techniques
- Few-shot Learning
- In-context Learning
- Chain-of-Thought Prompting
- Hallucination Prevention
1. Temperature
Understanding Temperature
Temperature is perhaps the most fundamental parameter for controlling AI generation. It controls the randomness of the model’s output by scaling the probability distribution over possible tokens.
How Temperature Works
Behind the scenes, language models output logits unnormalized log probabilities for each possible next token.
p_i = exp(z_i / T) / Σ_j exp(z_j / T)
Where:
- z_i is the logit for token i
- T is the temperature parameter
- p_i is the final probability of selecting token i
What’s Really Happening?
Think of temperature as a “confidence dial”:
- Low Temperature (T < 1): Sharpens the distribution, making high-probability tokens even more dominant
- Temperature = 1: Uses the model’s natural probability distribution
- High Temperature (T > 1): Flattens the distribution, giving more chance to unlikely tokens
- Temperature → 0: Becomes deterministic (always picks the most likely token)
- Temperature → ∞: Approaches uniform randomness
The Sampling Algorithm
Here’s what happens under the hood:
import numpy as np
def temperature_sample(logits, temperature=1.0):
# Step 1: Scale logits by temperature
scaled_logits = logits / temperature
# Step 2: Apply softmax (with numerical stability)
exp_logits = np.exp(scaled_logits - np.max(scaled_logits))
probs = exp_logits / np.sum(exp_logits)
# Step 3: Sample from the distribution
next_token = np.random.choice(len(probs), p=probs)
return next_token
The numerical stability trick (subtracting max before exp) prevents overflow when dealing with large logit values.

Practical Examples
1) Low Temperature (0.1–0.3)
Perfect for tasks requiring consistency and precision:
# Example with low temperature
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is the capital of France?"}],
temperature=0.1
)
# Output: "The capital of France is Paris."
Use cases:
- Factual question answering
- Code generation
- Mathematical calculations
- Data extraction
- Classification tasks
The model becomes highly deterministic, consistently choosing the most probable tokens.
2) High Temperature (0.7–1.0+)
Unleashes creativity and diverse outputs:
# Example with high temperature
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Describe a sunset"}],
temperature=0.9
)
# Output might vary each time:
# "The crimson orb melted into the horizon..."
# "Golden light spilled across the darkening sky..."
# "Fire painted the clouds as day surrendered to night..."
Use cases:
- Creative writing
- Brainstorming sessions
- Poetry and artistic content
- Marketing copy variations
- Story generation
Each run produces notably different outputs as the model explores less probable but potentially more interesting token choices.
2. Top-k and Top-p Sampling
Overview
While temperature scales the entire probability distribution, top-p and top-k are truncation methods that eliminate low-probability tokens before sampling. They provide different ways to control output quality and diversity.
Top-k Sampling
Top-k sampling keeps only the k most probable tokens and redistributes their probability mass.
How it works?
- Get probability distribution: P = softmax(logits / temperature)
- Sort tokens by probability: P_sorted
- Keep only top-k tokens, set others to 0
- Renormalize: P’_i = P_i / Σ(top-k probabilities)
- Sample from P’
import torch
import torch.nn.functional as F
def top_k_sampling(logits, k=50, temperature=1.0):
"""
Top-k sampling implementation
Args:
logits: [vocab_size] tensor of unnormalized scores
k: number of top tokens to keep
temperature: temperature scaling factor
Returns:
sampled token index
"""
# Step 1: Apply temperature
logits = logits / temperature
# Step 2: Get top-k logits and their indices
top_k_logits, top_k_indices = torch.topk(logits, k)
# Step 3: Apply softmax to top-k logits only
top_k_probs = F.softmax(top_k_logits, dim=-1)
# Step 4: Sample from top-k distribution
sampled_index = torch.multinomial(top_k_probs, num_samples=1)
# Step 5: Map back to original vocabulary index
token = top_k_indices[sampled_index]
return token
Example
Let’s say we have vocabulary of 8 tokens:
tokens = ['the', 'a', 'is', 'very', 'quite', 'extremely', 'somewhat', 'rather']
logits = [5.0, 4.5, 3.2, 2.8, 1.5, 0.8, 0.3, -0.5]
# After softmax (temperature = 1.0)
probs = [0.426, 0.259, 0.070, 0.047, 0.013, 0.006, 0.004, 0.002]
With top-k = 3:
# Step 1: Select top-3 tokens
top_k_tokens = ['the', 'a', 'is']
top_k_probs = [0.426, 0.259, 0.070]
Top-p (Nucleus Sampling)
Top-p (also called nucleus sampling) keeps the smallest set of tokens whose cumulative probability ≥ p.
How it works?
- Get probability distribution: P = softmax(logits / temperature)
- Sort tokens by probability (descending)
- Calculate cumulative sum: CDF_i = Σ P_j for j ≤ i
- Find nucleus: N = {tokens where CDF ≤ p}
- Renormalize and sample from N
def top_p_sampling(logits, p=0.9, temperature=1.0):
"""
Top-p (nucleus) sampling implementation
Args:
logits: [vocab_size] tensor of unnormalized scores
p: cumulative probability threshold (0 < p ≤ 1)
temperature: temperature scaling factor
Returns:
sampled token index
"""
# Step 1: Apply temperature and softmax
logits = logits / temperature
probs = F.softmax(logits, dim=-1)
# Step 2: Sort probabilities in descending order
sorted_probs, sorted_indices = torch.sort(probs, descending=True)
# Step 3: Calculate cumulative probabilities
cumsum_probs = torch.cumsum(sorted_probs, dim=-1)
# Step 4: Find the nucleus (tokens to keep)
# Remove tokens where cumsum > p (keep first token that exceeds p)
sorted_indices_to_remove = cumsum_probs > p
# Shift right to keep the first token that exceeds p
sorted_indices_to_remove[1:] = sorted_indices_to_remove[:-1].clone()
sorted_indices_to_remove[0] = False
# Step 5: Set removed token probabilities to 0
sorted_probs[sorted_indices_to_remove] = 0.0
# Step 6: Renormalize
sorted_probs = sorted_probs / sorted_probs.sum()
# Step 7: Sample from the nucleus
sampled_sorted_index = torch.multinomial(sorted_probs, num_samples=1)
# Step 8: Map back to original vocabulary
token = sorted_indices[sampled_sorted_index]
return token
Same Example
tokens = ['the', 'a', 'is', 'very', 'quite', 'extremely', 'somewhat', 'rather']
probs = [0.426, 0.259, 0.070, 0.047, 0.013, 0.006, 0.004, 0.002]
With top-p = 0.9:
# Step 1: Sort by probability (already sorted)
# Step 2: Calculate cumulative sum
cumulative = [0.426, 0.685, 0.755, 0.802, 0.815, 0.821, 0.825, 0.827]
With top-p = 0.75:
# cumulative[2] = 0.755 > 0.75 ← Stop here!
# Nucleus = ['the', 'a', 'is']
Visual Comparison
Top-k = 4 (Fixed):
███████████████ the (40%) ← Keep
██████████ a (25%) ← Keep
████ is (10%) ← Keep
███ very (8%) ← Keep
-- (7%) ← Discard (not in top-4)
-- (5%) ← Discard
-- (3%) ← Discard
-- (2%) ← Discard
Top-p = 0.9 (Adaptive):
███████████████ the (40%) ← Keep
██████████ a (25%) ← Keep
████ is (10%) ← Keep
███ very (8%) ← Keep
-- (7%) ← Keep (cumsum still < 90%)
-- (5%) ← Discard (cumsum > 90%)
-- (3%) ← Discard
-- (2%) ← Discard
3. Prompt Engineering Techniques
Effective prompts are the foundation of controlled generation. The way you structure your prompts directly impacts the quality and relevance of outputs.
Clear Instructions
Bad: "Tell me about dogs"
Good: "Write a 200-word informative paragraph about dog training techniques for puppies, focusing on positive reinforcement methods."
Role-Based Prompting
Prompt: "You are an expert data scientist with 10 years of experience.
Explain gradient descent in simple terms for a beginner."
Format Specification
Prompt: "List the top 5 programming languages for beginners.
Format your response as:
1. [Language]: [Brief description]
2. [Language]: [Brief description]
..."
Constraint Setting
Prompt: "Write a product review for a smartphone. Requirements:
- Exactly 150 words
- Include both pros and cons
- Mention battery life, camera, and performance
- Use a neutral tone"
4. Few-shot Learning
Few-shot learning involves providing examples within your prompt to guide the model’s behavior. This technique is incredibly powerful for establishing patterns and desired output formats.
Example: Sentiment Classification
Prompt: "Classify the sentiment of these reviews:
Review: 'This product exceeded my expectations!'
Sentiment: Positive
Review: 'Terrible quality, waste of money.'
Sentiment: Negative
Review: 'It's okay, nothing special.'
Sentiment: Neutral
Review: 'I love this new feature update!'
Sentiment: ?"
Example: Code Generation
Prompt: "Convert natural language to Python functions:
Input: 'Create a function that adds two numbers'
Output:
def add_numbers(a, b):
return a + b
Input: 'Create a function that finds the maximum in a list'
Output:
def find_maximum(numbers):
return max(numbers)
Input: 'Create a function that reverses a string'
Output: ?"
Benefits of Few-shot Learning:
- Establishes clear patterns
- Reduces ambiguity
- Improves consistency across outputs
- Minimizes need for fine-tuning
5. In-context Learning
In-context learning leverages the model’s ability to understand and apply new information provided within the conversation context, without updating the model’s parameters.
Dynamic Adaptation Example
Prompt: "I'm working with a specific dataset format:
{
'customer_id': 12345,
'purchase_date': '2024-01-15',
'items': ['laptop', 'mouse'],
'total': 899.99
}
Based on this format, generate 3 sample customer records for an electronics store."
Context-Aware Responses
Conversation Context:
User: "I'm building a React application for a food delivery service."
AI: "Great! What specific functionality are you looking to implement?"
User: "I need help with the cart component."
AI: [Provides React-specific cart component code tailored to food delivery]
Best Practices for In-context Learning:
- Provide clear, relevant context early in the conversation
- Reference previous context when building on discussions
- Use specific examples from your domain
- Maintain consistency with established patterns
6. Chain-of-Thought Prompting
Chain-of-Thought (CoT) prompting encourages the model to show its reasoning process, leading to more accurate and explainable outputs.
Basic Chain-of-Thought
Prompt: "Solve this step by step:
A store has 24 apples. They sell 8 apples in the morning and 6 apples in the afternoon. How many apples are left?
Let me work through this step by step:
1) Starting apples: 24
2) Sold in morning: 8
3) Sold in afternoon: 6
4) Total sold: 8 + 6 = 14
5) Remaining: 24 - 14 = 10
Therefore, 10 apples are left."
Zero-Shot Chain-of-Thought
Prompt: "A company's revenue increased by 20% in Q1 and decreased by 10% in Q2. If they started with $100,000, what's their revenue at the end of Q2? Let's think step by step."
Complex Reasoning Example
Prompt: "Analyze whether this business model is sustainable:
Business: Subscription-based meal delivery service
- Monthly fee: $50
- Food cost per meal: $8
- Delivery cost per meal: $3
- 20 meals per month per subscriber
Let's break this down step by step:"
When to Use Chain-of-Thought:
- Mathematical calculations
- Logic problems
- Decision-making scenarios
- Complex analysis tasks
7. Hallucination Prevention
Hallucinations when AI models generate false or nonsensical information are a significant challenge. Here are strategies to minimize them:
Grounding Techniques
Prompt: "Based ONLY on the following text, answer the question:
Text: [Insert specific source material]
Question: [Your question]
If the answer cannot be found in the provided text, respond with 'Information not available in the source.'"
Confidence Indicators
Prompt: "Answer the following question and indicate your confidence level (High/Medium/Low):
Question: What is the population of Tokyo in 2024?
Answer: [Response]
Confidence: [Level]
Reasoning: [Why this confidence level]"
Fact-Checking Prompts
Prompt: "Claim: 'Python was created in 1995 by Guido van Rossum'
Please verify this claim step by step:
1. Check the creation year
2. Verify the creator
3. Provide the correct information if any part is wrong
4. Rate the accuracy: Correct/Partially Correct/Incorrect"
Source Citation Requirements
Prompt: "Write a summary about renewable energy trends. For each major claim, indicate what type of source would be needed to verify it (e.g., 'government report', 'academic study', 'industry survey')."
Hallucination Prevention Best Practices:
- Request sources and citations
- Use specific, factual prompts
- Ask for confidence levels
- Provide authoritative source material when possible
(You can use RAG also😃)
Combining Techniques for Maximum Control
The real power comes from combining these techniques strategically:
Example: Research Assistant
Prompt: "You are a research assistant helping with academic writing.
Temperature: 0.3 (for accuracy)
Task: Summarize the key findings about machine learning bias from the following paper excerpt.
Follow this format:
1. Main Finding: [One sentence]
2. Supporting Evidence: [Key statistics or examples]
3. Implications: [What this means for practitioners]
4. Confidence: [High/Medium/Low based on source quality]
Paper Excerpt: [Insert text]
Think through this step by step, and only include information directly supported by the text."
Conclusion
Mastering generation control is essential for anyone working with AI models. By understanding and applying these six techniques temperature and top-p sampling, prompt engineering, few-shot learning, in-context learning, chain-of-thought prompting, and hallucination prevention you can dramatically improve the quality, reliability, and usefulness of AI-generated content.
Thank you for reading!🤗I hope that you found this article both informative and enjoyable to read. (Comment if you build any async Agent application lately love to hear that🙂)
Fore more information like this follow me on LinkedIn
Generation Control: Mastering AI Output for Better Results was originally published in Coinmonks on Medium, where people are continuing the conversation by highlighting and responding to this story.