What is Llama 4 Scout?
Llama 4 Scout is Meta's open-weight AI model released on April 5, 2025. It holds the record for the largest publicly released context window of any AI model: 10 million tokens. This isn't just a benchmark number — Scout was specifically designed for long-context document analysis, codebase understanding, and research tasks requiring sustained attention across massive amounts of text. With 17 billion active parameters from a 109 billion total MoE architecture, it runs on a single H100 GPU (with Int4 quantization).
Llama 4 Scout Technical Specs
Scout uses a Mixture-of-Experts (MoE) architecture with 16 experts, activating 17 billion parameters per token while keeping 109 billion total. This design delivers strong performance with efficient inference. The 10 million token context enables loading entire codebases, complete legal documents, or full research libraries into a single context window. Meta trained Scout on 40 trillion tokens of diverse text and image data.
The model is natively multimodal: it accepts both text and image inputs, making it useful for document processing tasks involving scanned files, diagrams, or mixed media. It was trained on data in over 200 languages, though English and Chinese performance is strongest.
Llama 4 Scout Pricing and Access
As an open-weight model, Scout's weights are free to download from Meta's website and Hugging Face. Commercial use is permitted for businesses with fewer than 700 million monthly active users (those above this threshold need a special Meta license). Via API providers like Together AI, Groq, and Fireworks, Scout costs approximately $0.15 per million input tokens and $0.50 per million output tokens.
Llama 4 Scout vs Scout Alternatives
Scout's 10M context window is unmatched among open models — Llama 4 Maverick caps at 1M, Claude Opus 4.6 at 1M, and GPT-5.2 at 400K. If your application needs to process entire repositories or very long documents, Scout is the only practical open-source option. The tradeoff is hardware: processing 1.4 million tokens of context requires 8 H100 GPUs. Effective context degrades in practice beyond 32,000 tokens in early user reports, so test your specific use case.
Frequently Asked Questions
Can I run Llama 4 Scout locally?
Yes, with a single H100 GPU using Int4 quantization. For longer contexts (above 100K tokens), you'll need multiple H100s. Many developers use cloud inference providers instead of self-hosting for cost efficiency.
Is Llama 4 Scout better than Llama 3?
Significantly. Scout uses MoE architecture (versus Llama 3's dense architecture), is natively multimodal, and has 10x more context than Llama 3.1. It's not a direct comparison model — it targets different use cases — but capability-wise it's a major step forward.