⚙️ Ollama Parameters Guide

Understanding Context Windows, Memory Management & Performance Tuning

Master the critical parameters that control how Ollama processes context, manages memory, and generates responses. Learn the difference between application limits and model constraints.

🎯 Parameter Overview

Ollama parameters control how your AI models process input, manage memory, and generate responses. Understanding these parameters is crucial for optimizing performance and getting the best results from your local AI models.

Parameter Processing Flow

Input Text
Your prompt + context
Tokenization
Text to tokens (~4 chars = 1 token)
Context Window
Limited by num_ctx
Generation
Controlled by generation params
💡 Key Concept
Parameters work in two layers: Application limits (like PHP maxContextSize) control what data reaches the AI, while Model parameters (like num_ctx) control how the AI processes that data.

🧠 Context Window Parameters

num_ctx
Model Parameter
Maximum number of tokens the model can process at once. This is the model's "working memory" including both input and output.
Default: 2048 tokens
maxContextSize
Application Limit
Maximum bytes of text your application reads from files before sending to the model. Controls input size before tokenization.
Example: 150KB - 200KB
num_predict
Generation Control
Maximum number of tokens the model will generate in its response. Ensures minimum response length.
Default: 128 tokens

Context Window Calculation

# Token estimation (rough) 1 token ≈ 4 characters of English text 1 token ≈ 0.75 words on average # Example breakdown for num_ctx=8192: Research context: 150KB ≈ 4,500-5,000 tokens Your prompt: 500 tokens Available for response: 2,500-3,000 tokens # Setting larger context window ollama run llama2 --num-ctx 16384 ollama run codellama --num-ctx 32768
Context Size Tokens Approximate Text Memory Impact Use Case
2048 2K ~1,500 words Low Simple Q&A
4096 4K ~3,000 words Medium Document summaries
8192 8K ~6,000 words High Research analysis
16384 16K ~12,000 words Very High Large document analysis
32768 32K ~24,000 words Extreme Massive context tasks
⚠️ Memory Warning
Larger context windows exponentially increase memory usage. A 70B model with 32K context can require 80GB+ RAM. Monitor your system resources carefully.

🎛️ Generation Parameters

temperature
Creativity Control
Controls randomness in responses. Lower values = more focused, higher values = more creative and diverse.
Default: 0.8 (Range: 0.0-2.0)
top_p
Nucleus Sampling
Only consider tokens that make up the top P percentage of probability mass. More dynamic than top_k.
Default: 0.9 (Range: 0.0-1.0)
top_k
Token Limiting
Only consider the top K most likely next tokens. Lower values = more focused responses.
Default: 40 (Range: 1-100+)
repeat_penalty
Repetition Control
Penalizes repetitive text. Values > 1.0 reduce repetition, < 1.0 allow more repetition.
Default: 1.1 (Range: 0.8-1.3)
presence_penalty
Token Diversity
Penalizes tokens based on whether they appear in the text. Encourages topic diversity.
Default: 0.0 (Range: -2.0 to 2.0)
frequency_penalty
Usage Frequency
Penalizes tokens based on how frequently they appear. Reduces repetitive phrases.
Default: 0.0 (Range: -2.0 to 2.0)

Parameter Usage Examples

# Creative writing (high temperature, diverse sampling) ollama run llama2 \ --temperature 1.2 \ --top-p 0.95 \ --top-k 50 \ "Write a creative story about space exploration" # Technical documentation (low temperature, focused) ollama run codellama \ --temperature 0.2 \ --top-p 0.8 \ --top-k 20 \ --repeat-penalty 1.2 \ "Explain how to implement OAuth 2.0" # Balanced conversation ollama run mistral \ --temperature 0.7 \ --top-p 0.9 \ --presence-penalty 0.6 \ "Discuss the pros and cons of renewable energy"
🚀 Performance Tip
For faster inference, use lower temperature (0.1-0.3) and smaller top_k values (10-20). For more creative outputs, increase temperature (0.8-1.5) and use higher top_p (0.95).

💾 Memory Management

OLLAMA_NUM_PARALLEL
Environment Variable
Maximum number of parallel requests Ollama can handle simultaneously.
Default: 1
OLLAMA_MAX_LOADED_MODELS
Environment Variable
Maximum number of models to keep loaded in memory at once.
Default: 3
OLLAMA_FLASH_ATTENTION
Optimization
Enable flash attention for better memory efficiency with long contexts.
Default: false
use_mmap
Model Loading
Use memory mapping for model loading. Can reduce RAM usage but may be slower.
Default: true

Memory Optimization Strategies

# Set memory limits export OLLAMA_MAX_LOADED_MODELS=1 export OLLAMA_NUM_PARALLEL=1 # Enable memory optimizations export OLLAMA_FLASH_ATTENTION=1 # Check memory usage ollama ps NAME ID SIZE PROCESSOR UNTIL llama2:latest e38ae474 4.1 GB 100% GPU 4 minutes from now # Force model unload ollama rm --keep-file llama2
Model Size Base RAM 4K Context 8K Context 32K Context
7B 4 GB 5 GB 6 GB 10 GB
13B 7 GB 8 GB 10 GB 16 GB
34B 20 GB 22 GB 26 GB 40 GB
70B 40 GB 45 GB 55 GB 80 GB

Performance Tuning

num_thread
CPU Threading
Number of CPU threads to use for inference. Set to your CPU core count for optimal performance.
Default: Auto-detected
num_gpu
GPU Layers
Number of model layers to run on GPU. Higher values = more GPU usage, faster inference.
Default: Auto-detected
batch_size
Processing Batch
Number of tokens processed in parallel. Higher values can improve throughput.
Default: 512
rope_frequency_base
Context Extension
Base frequency for RoPE (Rotary Position Embedding). Can extend context length beyond training.
Default: 10000.0

Performance Optimization

# CPU optimization ollama run llama2 --num-thread 8 --batch-size 1024 # GPU optimization ollama run llama2 --num-gpu 35 --batch-size 2048 # Extended context with RoPE scaling ollama run llama2 \ --num-ctx 16384 \ --rope-frequency-base 20000 \ "Analyze this large document..." # Check GPU utilization nvidia-smi # Monitor CPU usage htop
💡 Optimization Tips
• Use GPU acceleration when available (10x faster)
• Set num_thread to your CPU core count
• Increase batch_size for better throughput
• Use smaller models for faster responses
• Enable mlock to prevent swapping to disk

📋 Practical Examples

Research Analysis Setup

# Large document analysis with extended context ollama run llama2 \ --num-ctx 32768 \ --num-predict 2048 \ --temperature 0.3 \ --top-p 0.8 \ --repeat-penalty 1.15 \ "Analyze the research papers and provide insights" # Application-level optimization maxContextSize = 100KB # Reduce input to save context space num_ctx = 16384 # Moderate context window num_predict = 1500 # Ensure good response length

Code Generation Setup

# Precise code generation ollama run codellama \ --num-ctx 8192 \ --temperature 0.1 \ --top-k 20 \ --repeat-penalty 1.2 \ --presence-penalty 0.5 \ "Write a Python function to process CSV files"

Creative Writing Setup

# Creative and diverse output ollama run mistral \ --temperature 1.1 \ --top-p 0.95 \ --top-k 60 \ --frequency-penalty 0.8 \ "Write an engaging short story about time travel"

Memory-Constrained Setup

# Optimize for limited RAM export OLLAMA_MAX_LOADED_MODELS=1 export OLLAMA_NUM_PARALLEL=1 ollama run llama2:7b \ --num-ctx 4096 \ --use-mmap true \ --batch-size 256

Troubleshooting Cut-off Responses

# Problem: Responses getting cut off # Solution 1: Increase prediction length ollama run llama2 --num-predict 2048 "Your prompt" # Solution 2: Expand context window ollama run llama2 --num-ctx 16384 "Your prompt" # Solution 3: Reduce input context (application level) maxContextSize = 80KB # Leave more room for response # Solution 4: Check token usage echo "Your text" | wc -c # Character count echo "Your text" | wc -w # Word count python -c "print(len('Your text')/4)" # Rough token estimate
⚠️ Common Pitfall
Don't just increase num_ctx without considering memory usage. A 70B model with 32K context can require 80GB+ RAM and may cause system instability if you don't have enough memory.