Introducing Context Autopilot
Introducing Context Autopilot
At Context, we're excited to unveil Autopilot—an AI productivity suite that learns like you, thinks like you, and uses tools like you. Powered by the world's first context engine, Autopilot is designed to seamlessly integrate with your existing workflows, capable of handling most information work today.
Rethinking Tools for Intelligent AI
As large language models (LLMs) become increasingly intelligent, the tools we use need to evolve alongside them. Traditional software is built for human input—a legacy stretching back to the 1970s. This paradigm is shifting, and the future is generative at its very core. Current solutions are often incremental, non-interpretable, or require a change in workflow, limiting their adoption and utility.
An LLM-Based Operating System
Autopilot addresses these challenges by providing an LLM-based operating system where models become the primary orchestrator and reasoner, working in tandem with our context engine. It unhobbles models by providing purpose-built tools and scarce situational context, enabling them to parse organizations and think more like humans.
Autopilot lives out of its own workspace, connecting directly to services like Drive and Sharepoint, communication channels such as Slack and Email, as well as client documents, personal notes and external databases.
Seamless Integration with Existing Workflows
Autopilot builds projects with the same tools as you. It has access to a complete office suite, browser, and code editor. Autopilots apps are designed for itself, enabling direct state manipulation and complex multi step workflows.
This provides AI with the requisite knowledge a human would need in order to enable meaningful understanding and interaction. Autopilot is able to actively collaborate with the user, requesting preferences and information, taking feedback, and parallelizing tasks while you focus on what matters most.
By reporting on progress in real time, it enables constant human-in-the-loop collaboration. When faced with complex challenges, Autopilot can self-replicate, spinning up swarms of collaborative agents focused on a common goal. This allows for efficient task delegation and execution, just like a well-coordinated team.
The Memory Stack: Powering the Context Engine
All of this is made possible by Autopilot's memory stack, which extends beyond a shared workspace to ensure consistency across file systems and inputs. It enables continuous reflection and output iteration—this is the context engine.
The Context Engine: A New Paradigm
The context engine allows models to reason over large bodies of knowledge with true understanding. This is also what enables Autopilot to plan, reason over, and execute on tasks requiring hundreds of steps.
Retrieval-Augmented Generation (RAG) is fundamentally constrained by the architecture of search, extending only to a few semantically similar pieces of data. In contrast, Autopilot's context engine enables swarms of agents to relentlessly traverse your knowledge base, tracing new paths, discovering connections, and uncovering insights. By distilling thousands of interactions, we deliver cutting-edge intelligence over huge contexts—without the degradation associated with long-context models.
Context is dynamic, it can learn over time and fix mistakes. Autopilot constantly monitors incoming information and autonomously refines itself with queries to external data sources. This enables deep task understanding and skill acquisition—Autopilot can be taught like any employee with an instruction set of your choosing.
Technical Evaluation: Benchmarking the Context Engine
To evaluate the effectiveness of our context engine, we benchmarked it against other frontier models and RAG implementations using two comprehensive benchmarks:
- HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly (Yen et al., 2024)
- LOFT: Long Context Frontiers Benchmark introduced in Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? (Lee et al., 2024)
Limitations of Traditional Benchmarks
The popular "Needle in a Haystack" test evaluates a model's ability to locate specific information within a long context window. However, it is saturated for almost all models and shows little correlation with real-world performance. HELMET significantly improves upon existing long-context benchmarks and addresses shortcomings with other popular benchmarks like RULER.
HELMET Benchmark Results
Figure 1 illustrates the long-context benchmark results of frontier LCLMs (Llama-3.1 8B/70B, GPT-4omini, GPT-4o-08-06, and Gemini-1.5 Flash/Pro) at 128k tokens input length. Unexpected trends emerge: on RULER, Llama 8B performs better than Llama 70B, and Gemini 1.5 Flash outperforms Gemini 1.5 Pro. Similarly, on InfiniteBench, Llama 8B outperforms Llama 70B, and on "Needle in a Haystack," Gemini 1.5 Flash surpasses Gemini 1.5 Pro. HELMET, on the other hand, ranks these frontier models more consistently.
Figure 2 compares long-context benchmarks like ZeroSCROLLS, LongBench, L-Eval, RULER, ∞BENCH, and HELMET. HELMET features seven diverse task categories with low correlation between them. It supports evaluations on context window sizes greater than 128k tokens; however, the official repository currently supports evaluations up to a 128k token context size. That's why we use LOFT to evaluate performance on longer context sizes, specifically 1 million tokens.
Evaluation Methodology
We conducted evaluations using the following parameters:
- HELMET: Ran on a random 15% subset of the entire benchmark.
- Task Types and Metrics:
- RAG: Substring exact match
- Passage Re-ranking: NDCG@10 (Normalized Discounted Cumulative Gain)
- Generation with Citations: Recall/Cite
- Long Document QA: Model-based/ROUGE F1/Accuracy
- Summarization: Model-based
- Many-shot In-Context Learning: Accuracy
- Synthetic Recall: Substring exact match
- Task Types and Metrics:
- LOFT: Ran on a random 30% subset of three tasks.
- Task Types and Metrics:
- RAG: Subspan exact match
- Text Retrieval: Recall@1
- SQL: Accuracy
- Task Types and Metrics:
We omitted the Many-shot In-Context Learning task due to the unavailability of datasets for testing on 1 million token context sizes in the official repository. We also skipped the Audio Retrieval and Visual Retrieval tasks. All models used in these evaluations are the latest versions available at the time of writing.
Benchmarking Outcomes
Our evaluations demonstrate that Autopilot's context engine is state-of-the-art on benchmarks like HELMET and outperforms GraphRAG with frontier models. By reasoning over the entire body of knowledge with true understanding, Autopilot extends beyond the limitations of traditional RAG architectures.
Our Mission: Redefining Work
Our mission is to dramatically reduce the time the world spends on rote work—whether it's writing that email you've been meaning to get around to or assembling a crucial presentation. Today, we're bringing the workspace together so humans and AI can collaborate seamlessly. Tomorrow, we'll be organizing autonomous and human workers in AI-native companies.
Join Us in Building the Future
We're assembling a team of world-class engineers, researchers, and product builders to shape the future of the workspace. If you're passionate about pushing the boundaries of what's possible with AI, we'd love to hear from you.
*This blog post was generated with the help of Autopilot