Generation Control: Mastering AI Output for Better Results

By: Sweety Tripathi

13 November 2025 at 03:44

In today’s fast-moving world of AI and large language models (LLMs), I’ve learned that one of the most valuable skills is not just understanding what these models can do but knowing how to guide them effectively. As I’ve spent time building applications, conducting research, and experimenting with different prompts, I’ve realized that real progress comes from learning how to control the generation process.

In this blog, I want to share seven generation control techniques that have made a real difference in how I work with AI and that every practitioner, researcher, or enthusiast can benefit from.

Temperature
Top-p/Top-k Sampling
Prompt Engineering Techniques
Few-shot Learning
In-context Learning
Chain-of-Thought Prompting
Hallucination Prevention

1. Temperature

Understanding Temperature

Temperature is perhaps the most fundamental parameter for controlling AI generation. It controls the randomness of the model’s output by scaling the probability distribution over possible tokens.

How Temperature Works

Behind the scenes, language models output logits unnormalized log probabilities for each possible next token.

p_i = exp(z_i / T) / Σ_j exp(z_j / T)

Where:

z_i is the logit for token i
T is the temperature parameter
p_i is the final probability of selecting token i

What’s Really Happening?

Think of temperature as a “confidence dial”:

Low Temperature (T < 1): Sharpens the distribution, making high-probability tokens even more dominant
Temperature = 1: Uses the model’s natural probability distribution
High Temperature (T > 1): Flattens the distribution, giving more chance to unlikely tokens
Temperature → 0: Becomes deterministic (always picks the most likely token)
Temperature → ∞: Approaches uniform randomness

The Sampling Algorithm

Here’s what happens under the hood:

import numpy as np

def temperature_sample(logits, temperature=1.0):
    # Step 1: Scale logits by temperature
    scaled_logits = logits / temperature
    
    # Step 2: Apply softmax (with numerical stability)
    exp_logits = np.exp(scaled_logits - np.max(scaled_logits))
    probs = exp_logits / np.sum(exp_logits)
    
    # Step 3: Sample from the distribution
    next_token = np.random.choice(len(probs), p=probs)
    
    return next_token

The numerical stability trick (subtracting max before exp) prevents overflow when dealing with large logit values.

Technical implementation of how temperature controls randomness in language model token selection

Practical Examples

1) Low Temperature (0.1–0.3)

Perfect for tasks requiring consistency and precision:

# Example with low temperature
response = openai.ChatCompletion.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    temperature=0.1
)
# Output: "The capital of France is Paris."

Use cases:

Factual question answering
Code generation
Mathematical calculations
Data extraction
Classification tasks

The model becomes highly deterministic, consistently choosing the most probable tokens.

2) High Temperature (0.7–1.0+)

Unleashes creativity and diverse outputs:

# Example with high temperature
response = openai.ChatCompletion.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Describe a sunset"}],
    temperature=0.9
)
# Output might vary each time:
# "The crimson orb melted into the horizon..."
# "Golden light spilled across the darkening sky..."
# "Fire painted the clouds as day surrendered to night..."

Use cases:

Creative writing
Brainstorming sessions
Poetry and artistic content
Marketing copy variations
Story generation

Each run produces notably different outputs as the model explores less probable but potentially more interesting token choices.

2. Top-k and Top-p Sampling

Overview

While temperature scales the entire probability distribution, top-p and top-k are truncation methods that eliminate low-probability tokens before sampling. They provide different ways to control output quality and diversity.

Top-k Sampling

Top-k sampling keeps only the k most probable tokens and redistributes their probability mass.

How it works?

Get probability distribution: P = softmax(logits / temperature)
Sort tokens by probability: P_sorted
Keep only top-k tokens, set others to 0
Renormalize: P’_i = P_i / Σ(top-k probabilities)
Sample from P’

import torch
import torch.nn.functional as F
def top_k_sampling(logits, k=50, temperature=1.0):
    """
    Top-k sampling implementation
    
    Args:
        logits: [vocab_size] tensor of unnormalized scores
        k: number of top tokens to keep
        temperature: temperature scaling factor
    
    Returns:
        sampled token index
    """
    # Step 1: Apply temperature
    logits = logits / temperature
    
    # Step 2: Get top-k logits and their indices
    top_k_logits, top_k_indices = torch.topk(logits, k)
    
    # Step 3: Apply softmax to top-k logits only
    top_k_probs = F.softmax(top_k_logits, dim=-1)
    
    # Step 4: Sample from top-k distribution
    sampled_index = torch.multinomial(top_k_probs, num_samples=1)
    
    # Step 5: Map back to original vocabulary index
    token = top_k_indices[sampled_index]
    
    return token

Example

Let’s say we have vocabulary of 8 tokens:

tokens = ['the', 'a', 'is', 'very', 'quite', 'extremely', 'somewhat', 'rather']
logits = [5.0, 4.5, 3.2, 2.8, 1.5, 0.8, 0.3, -0.5]

# After softmax (temperature = 1.0)
probs = [0.426, 0.259, 0.070, 0.047, 0.013, 0.006, 0.004, 0.002]

With top-k = 3:

# Step 1: Select top-3 tokens
top_k_tokens = ['the', 'a', 'is']
top_k_probs = [0.426, 0.259, 0.070]

Top-p (Nucleus Sampling)

Top-p (also called nucleus sampling) keeps the smallest set of tokens whose cumulative probability ≥ p.

How it works?

Get probability distribution: P = softmax(logits / temperature)
Sort tokens by probability (descending)
Calculate cumulative sum: CDF_i = Σ P_j for j ≤ i
Find nucleus: N = {tokens where CDF ≤ p}
Renormalize and sample from N

def top_p_sampling(logits, p=0.9, temperature=1.0):
    """
    Top-p (nucleus) sampling implementation
    
    Args:
        logits: [vocab_size] tensor of unnormalized scores
        p: cumulative probability threshold (0 < p ≤ 1)
        temperature: temperature scaling factor
    
    Returns:
        sampled token index
    """
    # Step 1: Apply temperature and softmax
    logits = logits / temperature
    probs = F.softmax(logits, dim=-1)
    
    # Step 2: Sort probabilities in descending order
    sorted_probs, sorted_indices = torch.sort(probs, descending=True)
    
    # Step 3: Calculate cumulative probabilities
    cumsum_probs = torch.cumsum(sorted_probs, dim=-1)
    
    # Step 4: Find the nucleus (tokens to keep)
    # Remove tokens where cumsum > p (keep first token that exceeds p)
    sorted_indices_to_remove = cumsum_probs > p
    
    # Shift right to keep the first token that exceeds p
    sorted_indices_to_remove[1:] = sorted_indices_to_remove[:-1].clone()
    sorted_indices_to_remove[0] = False
    
    # Step 5: Set removed token probabilities to 0
    sorted_probs[sorted_indices_to_remove] = 0.0
    
    # Step 6: Renormalize
    sorted_probs = sorted_probs / sorted_probs.sum()
    
    # Step 7: Sample from the nucleus
    sampled_sorted_index = torch.multinomial(sorted_probs, num_samples=1)
    
    # Step 8: Map back to original vocabulary
    token = sorted_indices[sampled_sorted_index]
    
    return token

Same Example

tokens = ['the', 'a', 'is', 'very', 'quite', 'extremely', 'somewhat', 'rather']
probs = [0.426, 0.259, 0.070, 0.047, 0.013, 0.006, 0.004, 0.002]

With top-p = 0.9:

# Step 1: Sort by probability (already sorted)
# Step 2: Calculate cumulative sum
cumulative = [0.426, 0.685, 0.755, 0.802, 0.815, 0.821, 0.825, 0.827]

With top-p = 0.75:

# cumulative[2] = 0.755 > 0.75 ← Stop here!
# Nucleus = ['the', 'a', 'is']

Visual Comparison

Top-k = 4 (Fixed):
███████████████ the (40%)      ← Keep
██████████ a (25%)             ← Keep
████ is (10%)                  ← Keep
███ very (8%)                  ← Keep
-- (7%)                        ← Discard (not in top-4)
-- (5%)                        ← Discard
-- (3%)                        ← Discard
-- (2%)                        ← Discard

Top-p = 0.9 (Adaptive):
███████████████ the (40%)      ← Keep
██████████ a (25%)             ← Keep
████ is (10%)                  ← Keep
███ very (8%)                  ← Keep
-- (7%)                        ← Keep (cumsum still < 90%)
-- (5%)                        ← Discard (cumsum > 90%)
-- (3%)                        ← Discard
-- (2%)                        ← Discard

3. Prompt Engineering Techniques

Effective prompts are the foundation of controlled generation. The way you structure your prompts directly impacts the quality and relevance of outputs.

Clear Instructions

Bad: "Tell me about dogs"
Good: "Write a 200-word informative paragraph about dog training techniques for puppies, focusing on positive reinforcement methods."

Role-Based Prompting

Prompt: "You are an expert data scientist with 10 years of experience.
Explain gradient descent in simple terms for a beginner."

Format Specification

Prompt: "List the top 5 programming languages for beginners.
Format your response as:
1. [Language]: [Brief description]
2. [Language]: [Brief description]
..."

Constraint Setting

Prompt: "Write a product review for a smartphone. Requirements:
- Exactly 150 words
- Include both pros and cons
- Mention battery life, camera, and performance
- Use a neutral tone"

4. Few-shot Learning

Few-shot learning involves providing examples within your prompt to guide the model’s behavior. This technique is incredibly powerful for establishing patterns and desired output formats.

Example: Sentiment Classification

Prompt: "Classify the sentiment of these reviews:

Review: 'This product exceeded my expectations!'
Sentiment: Positive

Review: 'Terrible quality, waste of money.'
Sentiment: Negative

Review: 'It's okay, nothing special.'
Sentiment: Neutral

Review: 'I love this new feature update!'
Sentiment: ?"

Example: Code Generation

Prompt: "Convert natural language to Python functions:

Input: 'Create a function that adds two numbers'
Output:
def add_numbers(a, b):
    return a + b

Input: 'Create a function that finds the maximum in a list'
Output:
def find_maximum(numbers):
    return max(numbers)

Input: 'Create a function that reverses a string'
Output: ?"

Benefits of Few-shot Learning:

Establishes clear patterns
Reduces ambiguity
Improves consistency across outputs
Minimizes need for fine-tuning

5. In-context Learning

In-context learning leverages the model’s ability to understand and apply new information provided within the conversation context, without updating the model’s parameters.

Dynamic Adaptation Example

Prompt: "I'm working with a specific dataset format:
{
  'customer_id': 12345,
  'purchase_date': '2024-01-15',
  'items': ['laptop', 'mouse'],
  'total': 899.99
}

Based on this format, generate 3 sample customer records for an electronics store."

Context-Aware Responses

Conversation Context:
User: "I'm building a React application for a food delivery service."
AI: "Great! What specific functionality are you looking to implement?"

User: "I need help with the cart component."
AI: [Provides React-specific cart component code tailored to food delivery]

Best Practices for In-context Learning:

Provide clear, relevant context early in the conversation
Reference previous context when building on discussions
Use specific examples from your domain
Maintain consistency with established patterns

6. Chain-of-Thought Prompting

Chain-of-Thought (CoT) prompting encourages the model to show its reasoning process, leading to more accurate and explainable outputs.

Basic Chain-of-Thought

Prompt: "Solve this step by step:
A store has 24 apples. They sell 8 apples in the morning and 6 apples in the afternoon. How many apples are left?

Let me work through this step by step:
1) Starting apples: 24
2) Sold in morning: 8
3) Sold in afternoon: 6
4) Total sold: 8 + 6 = 14
5) Remaining: 24 - 14 = 10

Therefore, 10 apples are left."

Zero-Shot Chain-of-Thought

Prompt: "A company's revenue increased by 20% in Q1 and decreased by 10% in Q2. If they started with $100,000, what's their revenue at the end of Q2? Let's think step by step."

Complex Reasoning Example

Prompt: "Analyze whether this business model is sustainable:

Business: Subscription-based meal delivery service
- Monthly fee: $50
- Food cost per meal: $8
- Delivery cost per meal: $3
- 20 meals per month per subscriber

Let's break this down step by step:"

When to Use Chain-of-Thought:

Mathematical calculations
Logic problems
Decision-making scenarios
Complex analysis tasks

7. Hallucination Prevention

Hallucinations when AI models generate false or nonsensical information are a significant challenge. Here are strategies to minimize them:

Grounding Techniques

Prompt: "Based ONLY on the following text, answer the question:

Text: [Insert specific source material]

Question: [Your question]

If the answer cannot be found in the provided text, respond with 'Information not available in the source.'"

Confidence Indicators

Prompt: "Answer the following question and indicate your confidence level (High/Medium/Low):

Question: What is the population of Tokyo in 2024?
Answer: [Response]
Confidence: [Level]
Reasoning: [Why this confidence level]"

Fact-Checking Prompts

Prompt: "Claim: 'Python was created in 1995 by Guido van Rossum'

Please verify this claim step by step:
1. Check the creation year
2. Verify the creator
3. Provide the correct information if any part is wrong
4. Rate the accuracy: Correct/Partially Correct/Incorrect"

Source Citation Requirements

Prompt: "Write a summary about renewable energy trends. For each major claim, indicate what type of source would be needed to verify it (e.g., 'government report', 'academic study', 'industry survey')."

Hallucination Prevention Best Practices:

Request sources and citations
Use specific, factual prompts
Ask for confidence levels
Provide authoritative source material when possible

(You can use RAG also😃)

Combining Techniques for Maximum Control

The real power comes from combining these techniques strategically:

Example: Research Assistant

Prompt: "You are a research assistant helping with academic writing.
Temperature: 0.3 (for accuracy)

Task: Summarize the key findings about machine learning bias from the following paper excerpt.
Follow this format:

1. Main Finding: [One sentence]
2. Supporting Evidence: [Key statistics or examples]
3. Implications: [What this means for practitioners]
4. Confidence: [High/Medium/Low based on source quality]

Paper Excerpt: [Insert text]

Think through this step by step, and only include information directly supported by the text."

Conclusion

Mastering generation control is essential for anyone working with AI models. By understanding and applying these six techniques temperature and top-p sampling, prompt engineering, few-shot learning, in-context learning, chain-of-thought prompting, and hallucination prevention you can dramatically improve the quality, reliability, and usefulness of AI-generated content.

Thank you for reading!🤗I hope that you found this article both informative and enjoyable to read. (Comment if you build any async Agent application lately love to hear that🙂)

Fore more information like this follow me on LinkedIn

Generation Control: Mastering AI Output for Better Results was originally published in Coinmonks on Medium, where people are continuing the conversation by highlighting and responding to this story.

Reading view