🎯
Parameter Overview
Ollama parameters control how your AI models process input, manage memory, and generate responses. Understanding these parameters is crucial for optimizing performance and getting the best results from your local AI models.
Parameter Processing Flow
Input Text
Your prompt + context
→
Tokenization
Text to tokens (~4 chars = 1 token)
→
Context Window
Limited by num_ctx
→
Generation
Controlled by generation params
💡 Key Concept
Parameters work in two layers:
Application limits (like PHP maxContextSize) control what data reaches the AI, while
Model parameters (like num_ctx) control how the AI processes that data.
🧠
Context Window Parameters
num_ctx
Model Parameter
Maximum number of tokens the model can process at once. This is the model's "working memory" including both input and output.
Default: 2048 tokens
maxContextSize
Application Limit
Maximum bytes of text your application reads from files before sending to the model. Controls input size before tokenization.
Example: 150KB - 200KB
num_predict
Generation Control
Maximum number of tokens the model will generate in its response. Ensures minimum response length.
Default: 128 tokens
Context Window Calculation
1 token ≈ 4 characters of English text
1 token ≈ 0.75 words on average
Research context: 150KB ≈ 4,500-5,000 tokens
Your prompt: 500 tokens
Available for response: 2,500-3,000 tokens
ollama run llama2 --num-ctx 16384
ollama run codellama --num-ctx 32768
Context Size |
Tokens |
Approximate Text |
Memory Impact |
Use Case |
2048 |
2K |
~1,500 words |
Low |
Simple Q&A |
4096 |
4K |
~3,000 words |
Medium |
Document summaries |
8192 |
8K |
~6,000 words |
High |
Research analysis |
16384 |
16K |
~12,000 words |
Very High |
Large document analysis |
32768 |
32K |
~24,000 words |
Extreme |
Massive context tasks |
⚠️ Memory Warning
Larger context windows exponentially increase memory usage. A 70B model with 32K context can require 80GB+ RAM. Monitor your system resources carefully.
📋
Practical Examples
Research Analysis Setup
ollama run llama2 \
--num-ctx 32768 \
--num-predict 2048 \
--temperature 0.3 \
--top-p 0.8 \
--repeat-penalty 1.15 \
"Analyze the research papers and provide insights"
maxContextSize = 100KB # Reduce input to save context space
num_ctx = 16384 # Moderate context window
num_predict = 1500 # Ensure good response length
Code Generation Setup
ollama run codellama \
--num-ctx 8192 \
--temperature 0.1 \
--top-k 20 \
--repeat-penalty 1.2 \
--presence-penalty 0.5 \
"Write a Python function to process CSV files"
Creative Writing Setup
ollama run mistral \
--temperature 1.1 \
--top-p 0.95 \
--top-k 60 \
--frequency-penalty 0.8 \
"Write an engaging short story about time travel"
Memory-Constrained Setup
export OLLAMA_MAX_LOADED_MODELS=1
export OLLAMA_NUM_PARALLEL=1
ollama run llama2:7b \
--num-ctx 4096 \
--use-mmap true \
--batch-size 256
Troubleshooting Cut-off Responses
ollama run llama2 --num-predict 2048 "Your prompt"
ollama run llama2 --num-ctx 16384 "Your prompt"
maxContextSize = 80KB # Leave more room for response
echo "Your text" | wc -c # Character count
echo "Your text" | wc -w # Word count
python -c "print(len('Your text')/4)" # Rough token estimate
⚠️ Common Pitfall
Don't just increase num_ctx without considering memory usage. A 70B model with 32K context can require 80GB+ RAM and may cause system instability if you don't have enough memory.