Ollama Parameters Guide

🎯 Parameter Overview

Ollama parameters control how your AI models process input, manage memory, and generate responses. Understanding these parameters is crucial for optimizing performance and getting the best results from your local AI models.

Parameter Processing Flow

Input Text
Your prompt + context

→

Tokenization
Text to tokens (~4 chars = 1 token)

→

Context Window
Limited by num_ctx

→

Generation
Controlled by generation params

💡 Key Concept

Parameters work in two layers: Application limits (like PHP maxContextSize) control what data reaches the AI, while Model parameters (like num_ctx) control how the AI processes that data.

🧠 Context Window Parameters

num_ctx

Model Parameter

Maximum number of tokens the model can process at once. This is the model's "working memory" including both input and output.

Default: 2048 tokens

maxContextSize

Application Limit

Maximum bytes of text your application reads from files before sending to the model. Controls input size before tokenization.

Example: 150KB - 200KB

num_predict

Generation Control

Maximum number of tokens the model will generate in its response. Ensures minimum response length.

Default: 128 tokens

Context Window Calculation

# Token estimation (rough)
1 token ≈ 4 characters of English text
1 token ≈ 0.75 words on average

# Example breakdown for num_ctx=8192:
Research context: 150KB ≈ 4,500-5,000 tokens
Your prompt: 500 tokens
Available for response: 2,500-3,000 tokens

# Setting larger context window
ollama run llama2 --num-ctx 16384
ollama run codellama --num-ctx 32768
            

Context Size	Tokens	Approximate Text	Memory Impact	Use Case
2048	2K	~1,500 words	Low	Simple Q&A
4096	4K	~3,000 words	Medium	Document summaries
8192	8K	~6,000 words	High	Research analysis
16384	16K	~12,000 words	Very High	Large document analysis
32768	32K	~24,000 words	Extreme	Massive context tasks

⚠️ Memory Warning

Larger context windows exponentially increase memory usage. A 70B model with 32K context can require 80GB+ RAM. Monitor your system resources carefully.

🎛️ Generation Parameters

temperature

Creativity Control

Controls randomness in responses. Lower values = more focused, higher values = more creative and diverse.

Default: 0.8 (Range: 0.0-2.0)

top_p

Nucleus Sampling

Only consider tokens that make up the top P percentage of probability mass. More dynamic than top_k.

Default: 0.9 (Range: 0.0-1.0)

top_k

Token Limiting

Only consider the top K most likely next tokens. Lower values = more focused responses.

Default: 40 (Range: 1-100+)

repeat_penalty

Repetition Control

Penalizes repetitive text. Values > 1.0 reduce repetition, < 1.0 allow more repetition.

Default: 1.1 (Range: 0.8-1.3)

presence_penalty

Token Diversity

Penalizes tokens based on whether they appear in the text. Encourages topic diversity.

Default: 0.0 (Range: -2.0 to 2.0)

frequency_penalty

Usage Frequency

Penalizes tokens based on how frequently they appear. Reduces repetitive phrases.

Default: 0.0 (Range: -2.0 to 2.0)

Parameter Usage Examples

# Creative writing (high temperature, diverse sampling)
ollama run llama2 \
  --temperature 1.2 \
  --top-p 0.95 \
  --top-k 50 \
  "Write a creative story about space exploration"

# Technical documentation (low temperature, focused)
ollama run codellama \
  --temperature 0.2 \
  --top-p 0.8 \
  --top-k 20 \
  --repeat-penalty 1.2 \
  "Explain how to implement OAuth 2.0"

# Balanced conversation
ollama run mistral \
  --temperature 0.7 \
  --top-p 0.9 \
  --presence-penalty 0.6 \
  "Discuss the pros and cons of renewable energy"
            

🚀 Performance Tip

For faster inference, use lower temperature (0.1-0.3) and smaller top_k values (10-20). For more creative outputs, increase temperature (0.8-1.5) and use higher top_p (0.95).

💾 Memory Management

OLLAMA_NUM_PARALLEL

Environment Variable

Maximum number of parallel requests Ollama can handle simultaneously.

Default: 1

OLLAMA_MAX_LOADED_MODELS

Environment Variable

Maximum number of models to keep loaded in memory at once.

Default: 3

OLLAMA_FLASH_ATTENTION

Optimization

Enable flash attention for better memory efficiency with long contexts.

Default: false

use_mmap

Model Loading

Use memory mapping for model loading. Can reduce RAM usage but may be slower.

Default: true

Memory Optimization Strategies

# Set memory limits
export OLLAMA_MAX_LOADED_MODELS=1
export OLLAMA_NUM_PARALLEL=1

# Enable memory optimizations
export OLLAMA_FLASH_ATTENTION=1

# Check memory usage
ollama ps
NAME           ID         SIZE    PROCESSOR    UNTIL               
llama2:latest  e38ae474   4.1 GB  100% GPU    4 minutes from now

# Force model unload
ollama rm --keep-file llama2
            

Model Size	Base RAM	4K Context	8K Context	32K Context
7B	4 GB	5 GB	6 GB	10 GB
13B	7 GB	8 GB	10 GB	16 GB
34B	20 GB	22 GB	26 GB	40 GB
70B	40 GB	45 GB	55 GB	80 GB

⚡ Performance Tuning

num_thread

CPU Threading

Number of CPU threads to use for inference. Set to your CPU core count for optimal performance.

Default: Auto-detected

num_gpu

GPU Layers

Number of model layers to run on GPU. Higher values = more GPU usage, faster inference.

Default: Auto-detected

batch_size

Processing Batch

Number of tokens processed in parallel. Higher values can improve throughput.

Default: 512

rope_frequency_base

Context Extension

Base frequency for RoPE (Rotary Position Embedding). Can extend context length beyond training.

Default: 10000.0

Performance Optimization

# CPU optimization
ollama run llama2 --num-thread 8 --batch-size 1024

# GPU optimization
ollama run llama2 --num-gpu 35 --batch-size 2048

# Extended context with RoPE scaling
ollama run llama2 \
  --num-ctx 16384 \
  --rope-frequency-base 20000 \
  "Analyze this large document..."

# Check GPU utilization
nvidia-smi

# Monitor CPU usage
htop
            

💡 Optimization Tips

• Use GPU acceleration when available (10x faster)
• Set num_thread to your CPU core count
• Increase batch_size for better throughput
• Use smaller models for faster responses
• Enable mlock to prevent swapping to disk

📋 Practical Examples

Research Analysis Setup

# Large document analysis with extended context
ollama run llama2 \
  --num-ctx 32768 \
  --num-predict 2048 \
  --temperature 0.3 \
  --top-p 0.8 \
  --repeat-penalty 1.15 \
  "Analyze the research papers and provide insights"

# Application-level optimization
maxContextSize = 100KB  # Reduce input to save context space
num_ctx = 16384         # Moderate context window
num_predict = 1500      # Ensure good response length
            

Code Generation Setup

# Precise code generation
ollama run codellama \
  --num-ctx 8192 \
  --temperature 0.1 \
  --top-k 20 \
  --repeat-penalty 1.2 \
  --presence-penalty 0.5 \
  "Write a Python function to process CSV files"
            

Creative Writing Setup

# Creative and diverse output
ollama run mistral \
  --temperature 1.1 \
  --top-p 0.95 \
  --top-k 60 \
  --frequency-penalty 0.8 \
  "Write an engaging short story about time travel"
            

Memory-Constrained Setup

# Optimize for limited RAM
export OLLAMA_MAX_LOADED_MODELS=1
export OLLAMA_NUM_PARALLEL=1

ollama run llama2:7b \
  --num-ctx 4096 \
  --use-mmap true \
  --batch-size 256
            

Troubleshooting Cut-off Responses

# Problem: Responses getting cut off
# Solution 1: Increase prediction length
ollama run llama2 --num-predict 2048 "Your prompt"

# Solution 2: Expand context window
ollama run llama2 --num-ctx 16384 "Your prompt"

# Solution 3: Reduce input context (application level)
maxContextSize = 80KB  # Leave more room for response

# Solution 4: Check token usage
echo "Your text" | wc -c                # Character count
echo "Your text" | wc -w                # Word count
python -c "print(len('Your text')/4)"  # Rough token estimate
            

⚠️ Common Pitfall

Don't just increase num_ctx without considering memory usage. A 70B model with 32K context can require 80GB+ RAM and may cause system instability if you don't have enough memory.