


Z.ai releases GLM-5.1, a 754B-parameter open-weight model under MIT license that can run autonomously for up to 8 hours on a single task.
Head of Growth & Customer Success
The AI industry has spent the past two years optimizing for intelligence. Better benchmarks, higher scores, more reasoning steps. But Z.ai, the Beijing-based company formerly known as Zhipu AI, is asking a different question entirely: can an AI agent stay productive for an entire workday?
GLM-5.1, released in early April 2026 under the MIT license, is designed to do exactly that. It is a 754-billion-parameter open-weight model that can work autonomously on a single task for up to eight hours, completing planning, execution, iterative optimization, and delivery cycles across hundreds of rounds and thousands of tool calls. Where most models peak early and plateau, GLM-5.1 is engineered to keep improving over extended agentic loops.
This is not just a bigger model. It represents a fundamental shift in how we evaluate AI: from single-turn cleverness to sustained, multi-hour productivity.
Z.ai spun out of Tsinghua University research and has grown into one of China's most prominent AI companies. The founding team includes Tsinghua professors Tang Jie (Jie Tang) and Li Juanzi, with CEO Zhang Peng leading the company. The organization has built the GLM model family over several generations and operates both research-focused and commercial AI products.
The rebrand from Zhipu AI to Z.ai signals global ambitions, and the MIT-licensed release of GLM-5.1 is the clearest indication yet that Z.ai intends to compete directly in the international open-weight model market alongside Meta's Llama, Google's Gemma, and other open alternatives.
GLM-5.1 is a massive model. At 754 billion parameters, it dwarfs most open-weight alternatives. The architecture uses a Mixture-of-Experts design with DeepSeek-style Multi-head Latent Attention (MLA) and Dynamic Sparse Attention, according to NVIDIA's NeMo AutoModel coverage. The expert configuration includes 256 routed experts with 8 active per token, providing efficient inference despite the total parameter count.
The model supports a 200K token context length and can generate up to 128K output tokens, numbers that enable feeding substantial codebases or specifications and producing large patches or complete artifacts in a single run.
Z.ai provides an OpenAI-compatible chat completions API endpoint with parameters for "thinking mode," streaming, function calls, tool use, structured outputs, and context caching. A notable feature is "tool_stream," which enables streaming of tool-call arguments during function calling, reducing perceived latency in agent workflows.
For local deployment, GLM-5.1 supports vLLM and SGLang, with additional serving frameworks including xLLM, Transformers, and KTransformers listed on the Hugging Face model card. API access is available through api.z.ai and BigModel.cn, and the model is compatible with coding agent platforms like Claude Code and OpenClaw.
The headline capability of GLM-5.1 is its claim to work autonomously on a single task for up to eight hours. This is not a theoretical maximum. Z.ai designed the model specifically for long "agent loops" where it iterates through experiment-analyze-optimize cycles across hundreds of rounds and thousands of tool calls.
To illustrate this, Z.ai published two long-horizon demonstration results.
On VectorDBBench (SIFT-1M dataset, requiring 95% or higher recall), GLM-5.1 achieved 21,500 queries per second after more than 600 iterations and 6,000 tool calls. For context, the best prior 50-turn result was 3,547 QPS by Claude Opus 4.6. The model did not just match the competition; it continued optimizing far beyond the point where other models would have stopped improving.
On KernelBench Level 3, covering 50 optimization problems, GLM-5.1 achieved a geometric mean speedup of 3.6 times versus the PyTorch eager baseline. The standard torch.compile default managed 1.15 times and max-autotune reached 1.49 times. Claude Opus 4.6 scored 4.2 times on the same benchmark, showing that GLM-5.1 is competitive though not dominant on every task.
These results matter because they demonstrate something benchmarks rarely capture: the ability to grind through a complex optimization problem over hours, making incremental progress with each iteration, rather than producing a single best-effort answer.
Beyond long-horizon tasks, GLM-5.1 posts competitive numbers on standard benchmarks.
On SWE-Bench Pro, GLM-5.1 scores 58.4, leading the pack ahead of GPT-5.4 at 57.7, Claude Opus 4.6 at 57.3, and Gemini 3.1 Pro at 54.2. This benchmark measures real-world software engineering tasks and is widely regarded as one of the most practical coding evaluations.
On Terminal-Bench 2.0, GLM-5.1 scores 63.5, behind Claude Opus 4.6 at 65.4 and Gemini 3.1 Pro at 68.5. On NL2Repo, which measures the ability to generate entire repositories from natural language descriptions, GLM-5.1 scores 42.7 versus Claude Opus 4.6's 49.8.
On the reasoning-focused HLE benchmark, GLM-5.1 scores 31.0, notably behind Gemini 3.1 Pro at 45.0 and GPT-5.4 at 39.8. The model's strengths clearly skew toward sustained coding and agentic tasks rather than pure abstract reasoning.
Z.ai's own benchmark table positions GLM-5.1 against a broad competitive set including Qwen3.6-Plus, MiniMax M2.7, DeepSeek-V3.2, Kimi K2.5, and the major proprietary models. The honest presentation of results where GLM-5.1 trails competitors on certain tasks adds credibility to the overall claims.
The traditional way to evaluate AI models focuses on single-turn or short-interaction quality. How well does the model answer this question? How accurately does it complete this coding task? These measurements are important, but they miss a critical dimension of real-world utility.
In practice, many valuable tasks require sustained attention over extended periods. Optimizing a database query plan, debugging a complex distributed system, refactoring a legacy codebase, or running a thorough security audit all involve iterative cycles of hypothesis, testing, analysis, and refinement. A model that produces a brilliant first answer but cannot iterate productively is of limited use for these workflows.
GLM-5.1's design philosophy directly targets this gap. The model is trained to maintain effectiveness across hundreds of iterations rather than optimizing for first-response quality alone. This aligns with the emerging "agentic AI" paradigm where models operate as persistent workers rather than instant oracles.
VentureBeat captured this shift with the headline "AI joins the 8-hour work day," framing it as a transition from evaluating AI on intelligence to evaluating it on sustained productivity. This reframing has significant implications for how enterprises should think about deploying AI agents.
GLM-5.1 is released under the MIT license, one of the most permissive open-source licenses available. This means developers and businesses can use, modify, and distribute the model for any purpose, including commercial applications, with minimal restrictions.
In the context of the intensifying open-model race involving Qwen, Kimi, DeepSeek, and Western alternatives, the MIT license is a strategic choice. It removes friction for enterprise adoption and makes GLM-5.1 accessible to the broadest possible developer community. Analysts at Constellation Research have noted this as part of a broader trend in Chinese AI labs releasing increasingly capable models under permissive licenses.
The open weights are available on Hugging Face under the "zai-org" organization and on ModelScope, ensuring wide accessibility regardless of geographic restrictions.
For teams building coding agents, GLM-5.1's differentiator is its ability to iterate productively over hundreds of cycles. This pairs naturally with agent frameworks that orchestrate terminal commands, test suites, benchmark runs, and code formatting loops. If your workflow involves an agent that needs to try many approaches to a problem, GLM-5.1 is designed precisely for that use case.
The 200K context length and 128K output capacity enable processing substantial repositories and generating large patches or complete files in a single run. However, this scale of input and output increases latency and cost, making robust streaming and tool-stream handling important for production deployments.
For enterprises evaluating self-hosted options, the MIT license and local deployment support through vLLM and SGLang enable running the model in code-sensitive or air-gapped environments. This is particularly relevant for organizations in defense, finance, and healthcare where data cannot leave internal infrastructure.
For cost-conscious teams, Z.ai's pricing structure introduces peak and off-peak quota multipliers for the Coding Plan. Peak hours (14:00 to 18:00 UTC+8) carry a 3x multiplier, while off-peak hours use a 2x multiplier with a limited-time promotion reducing it to 1x through end of April. Scheduling batch agent runs during off-peak hours can materially reduce costs.
GLM-5.1 is part of a larger trend where AI systems move from tools that respond to prompts into autonomous workers that execute extended workflows. This shift has profound implications for software development, data analysis, and any knowledge work that involves iterative problem-solving.
The question for businesses is not whether AI can produce a good first draft. It is whether AI can handle the full cycle: plan an approach, execute it, evaluate the results, identify what went wrong, adjust the strategy, and repeat until the job is done. This is the capability that GLM-5.1 targets, and it is the capability that will ultimately determine how much human oversight AI agents require.
For productivity-focused applications, this evolution means AI can handle increasingly complex workflows end to end. Email management is a clear example: rather than simply suggesting a reply, an AI agent can classify incoming messages, prioritize them by urgency and context, draft personalized responses that match the sender's tone, and learn from corrections over time. Maylee's Auto-Reply feature, which uses confidence scores to determine when an AI-drafted response is good enough to send automatically, illustrates how this kind of sustained, autonomous behavior is already entering daily workflows.
GLM-5.1's benchmark claims are impressive, but independent verification is still ongoing. The model appeared quickly on community leaderboards and evaluation arenas after release, which will provide more objective data on real-world performance over the coming weeks.
The 754-billion-parameter scale makes local deployment accessible only to well-resourced teams with substantial GPU infrastructure. For most developers, the API route through Z.ai will be more practical, which somewhat undercuts the "open-weight" narrative for all but the largest organizations.
Still, GLM-5.1 represents a meaningful evolution in what open-weight models can do. By shifting the evaluation framework from "how smart is this model?" to "how long can this model stay productive?", Z.ai is staking out a position that may prove more important than any single benchmark score.
GLM-5.1 is a 754-billion-parameter open-weight large language model developed by Z.ai (formerly Zhipu AI), a Beijing-based AI company spun out of Tsinghua University. It is released under the MIT license and designed for long-horizon autonomous coding and agent tasks.
GLM-5.1 can work on a single task for up to 8 hours, iterating through experiment-analyze-optimize cycles across hundreds of rounds and thousands of tool calls. Unlike models that peak early, it is designed to keep improving throughout extended agent loops.
GLM-5.1 leads on SWE-Bench Pro with a score of 58.4, ahead of GPT-5.4 at 57.7 and Claude Opus 4.6 at 57.3. However, it trails Claude Opus 4.6 on NL2Repo (42.7 vs 49.8) and on Terminal-Bench 2.0 (63.5 vs 65.4).
Yes, GLM-5.1 supports local deployment via vLLM, SGLang, xLLM, Transformers, and KTransformers. However, at 754 billion parameters, you need substantial GPU infrastructure. API access through Z.ai is the more practical option for most teams.
GLM-5.1 is released under the MIT license, one of the most permissive open-source licenses. It allows commercial use, modification, and redistribution with minimal restrictions.
GLM-5.1 supports a 200K token context length and can generate up to 128K output tokens, enabling it to process large codebases and produce substantial output in a single run.
Z.ai uses a quota multiplier pricing model for the Coding Plan with 3x during peak hours (14:00-18:00 UTC+8) and 2x during off-peak, with a promotional 1x off-peak rate through end of April 2026. No simple per-token price table has been published.
Model weights are available on Hugging Face (under zai-org) and ModelScope. API access is available through api.z.ai and BigModel.cn. The model is also compatible with Claude Code and OpenClaw.