Xiaomi's MiMo-V2.5-Pro-UltraSpeed achieves record AI inference speed on regular GPUs, outpacing custom silicon solutions dramatically.
Most of us know Xiaomi as the company behind affordable smartphones, electric scooters, or perhaps smart home gadgets. It’s not typically the name that leaps to mind when you think of cutting-edge AI innovation, especially not the kind that blows past speed records set by dedicated custom silicon. Yet, that's precisely what just happened, and it stands to dramatically reshape how companies, particularly startups, approach deploying advanced AI.
Here's what happened: Xiaomi has unveiled MiMo-V2.5-Pro-UltraSpeed, a serving mode for its flagship trillion-parameter language model that can process over 1,000 tokens per second, hitting peaks near 1,200 in demonstrations. To put that in perspective, the AI you're likely chatting with, like GPT-5.5, typically runs at around 68 tokens per second, with Claude Opus 4.6 at 71, and even Google's faster Gemini Flash only reaching 192 tokens per second. Xiaomi's new offering is not just faster; it's a paradigm shift, achieving speeds that are up to 15 times greater than current industry benchmarks.
What makes this truly remarkable is that Xiaomi didn't achieve this feat using bespoke, multi-million-dollar custom hardware. They did it on a single, standard 8-GPU commodity node, the kind you can readily rent from major cloud providers like AWS or Google Cloud right now. This is a crucial distinction that fundamentally alters the economics and accessibility of high-speed AI.
Companies like Cerebras and Groq have poured years and billions into designing specialized chips to solve the very problem of AI inference speed. Cerebras, for example, developed a wafer-scale chip the size of a dinner plate, featuring 44GB of on-chip memory to bypass the bandwidth bottlenecks that plague traditional GPUs. Their impressive work enabled Llama 3.1 405B to hit 969 tokens per second. Groq, with its custom Language Processing Unit (LPU) architecture, reaches speeds between 300 and 750 tokens per second, depending on the model. While these are engineering marvels, their proprietary hardware is not something you can simply spin up in the cloud tonight.
Xiaomi, conversely, achieved its breakthrough purely through software. This involved a clever combination of model-level optimizations and a purpose-built inference engine they call TileRT. This signals a significant pivot in the AI infrastructure race. It’s no longer just about who can build the biggest, most specialized silicon; it’s increasingly about who can extract the most performance from ubiquitous, off-the-shelf hardware. This is a game-changer for startups and enterprises alike in North America, as it democratizes access to what was previously an elite tier of AI performance.
Why This Changes Everything
The core of Xiaomi's speed lies in two sophisticated techniques working in concert, driven by their TileRT inference engine. The first is FP4 Quantization. Normally, AI models operate at 8-bit or 16-bit numerical precision. Xiaomi's innovation is to shrink the "expert layers" of its trillion-parameter model—which constitute the bulk of its complexity—down to 4-bit precision. This drastically reduces the memory footprint and the bandwidth pressure on the GPUs, directly translating to higher speeds. The typical trade-off with such compression is a degradation in quality, but Xiaomi's surgical approach, only compressing specific expert layers while keeping everything else at full precision, reportedly results in near-zero quality loss. This precision in optimization is what separates a good idea from a truly effective one.
The second technique is DFlash speculative decoding, an advanced iteration of a method gaining traction in AI. Standard speculative decoding uses a smaller, faster "draft" model to predict the next few tokens, which the larger model then verifies in parallel. DFlash takes this a step further by skipping the sequential drafting entirely. It fills a whole block of masked positions in a single forward pass. For coding tasks, the large model accepts an average of 6.3 out of 8 proposed tokens per verification round. Imagine confirming six tokens in one fell swoop instead of just one. This dramatically accelerates the output generation process, especially for tasks with predictable structures like code.
TileRT is the glue that binds these innovations. It ensures the entire compute pipeline remains continuously resident within the GPU, eliminating the per-operator launch overheads and execution gaps that often slow down traditional inference systems. Xiaomi accurately labels this approach "extreme model-system codesign," and it’s a testament to the fact that neither technique alone would deliver such a leap. The synergy is everything.
For the North American tech scene, this development has profound implications. For years, the AI infrastructure narrative has been dominated by NVIDIA's GPU dominance and the rise of custom silicon aimed at dethroning it. Xiaomi's move suggests a third path: maximizing commodity hardware through software innovation. This could free up significant capital for startups, allowing them to invest more in model development and application layers rather than being shackled by astronomical hardware costs or long lead times for custom chips. My read is that this accelerates the democratization of advanced AI, making it more accessible to a wider array of founders and developers who might not have had a seat at the table before.
What This Means for the Future of AI Applications
The ability to perform inference at 1,000 tokens per second fundamentally redefines what's possible with large language models. Think about the current limitations: a 60-token-per-second model, while impressive for conversational AI, struggles with applications demanding real-time responsiveness. Consider fraud detection, where milliseconds matter; generating trading signals, where market opportunities vanish in an instant; or complex real-time agent loops that need to make rapid, iterative decisions. These are all use cases with hard latency constraints that 60 tokens per second simply cannot meet.
At 1,000 tokens per second, these constraints evaporate. You can run dozens of reasoning paths in parallel, quickly evaluating multiple scenarios or generating comprehensive responses almost instantaneously. This opens the door for a new generation of agentic AI systems that can operate with human-like speed in complex, dynamic environments. For startups developing AI-powered financial tools, cybersecurity solutions, or intelligent automation platforms, this isn't just an incremental improvement; it's an enablement of entirely new product categories that were previously science fiction. The MiMo-V2.5-Pro model itself is a frontier-level offering, matching Claude Opus on most coding benchmarks, but at a fraction of the cost—roughly $0.43 input / $0.87 output per million tokens, compared to Opus's $5 input / $25 output. Even with UltraSpeed priced at three times the standard MiMo rate, it offers roughly ten times the output, representing a significant economic advantage.
This development also strengthens Xiaomi’s position as a serious contender in the global AI race, moving beyond its traditional hardware manufacturing roots to become a key innovator in core AI technology. While many still associate China's tech giants with massive scale and data, this move highlights their increasing prowess in fundamental research and optimization that can rival, and in this case, surpass, Western counterparts in specific metrics.
The strategic release, with an API trial running from June 9-23 and an open-sourced FP4-DFlash checkpoint on Hugging Face, further signals Xiaomi's intent to engage the developer community and accelerate adoption. This aligns with a broader trend observed in the startup ecosystem: the increasing importance of open-source contributions and accessible APIs for driving innovation. By making key parts of their breakthrough available, Xiaomi is not just selling a service; it's fostering an ecosystem around its technology, much like other major players have done.
Ultimately, this isn't just about a faster chatbot. It's about shifting the goalposts for AI development. For too long, the narrative has focused on ever-larger models and ever-more-powerful, specialized hardware. Xiaomi's MiMo-V2.5-Pro-UltraSpeed reminds us that ingenuity in software, combined with a deep understanding of hardware-software co-design, can unlock performance that was once thought to require bespoke, multi-billion-dollar investments. This is a win for efficiency, a win for accessibility, and a clear signal that the next frontier in AI innovation will be defined not just by raw power, but by the cleverness with which we harness it.
Frequently asked questions
What is Xiaomi MiMo-V2.5-Pro-UltraSpeed?
It's a serving mode for Xiaomi's trillion-parameter flagship AI model, MiMo-V2.5-Pro, designed to achieve extremely high inference speeds, reaching over 1,000 tokens per second on standard GPUs. It represents a significant leap in AI model performance.
How much faster is Xiaomi MiMo-V2.5-Pro-UltraSpeed compared to ChatGPT?
Xiaomi MiMo-V2.5-Pro-UltraSpeed is significantly faster, hitting over 1,000 tokens per second, which is approximately 15 times faster than GPT-5.5 (ChatGPT), operating around 68 tokens per second. This makes it a market leader in inference speed.
What hardware does Xiaomi use to achieve this speed?
Xiaomi achieved this breakthrough on a single 8-GPU commodity node, utilizing standard hardware rather than specialized custom chips like those from Cerebras or Groq. This approach makes high-speed AI more accessible.
What technical innovations enable MiMo-V2.5-Pro-UltraSpeed's speed?
Its speed is primarily due to two techniques: FP4 Quantization, which compresses expert layers to 4-bit precision with minimal quality loss, and DFlash speculative decoding, which accelerates token generation by verifying multiple proposed tokens in one step. These are combined with the TileRT inference engine.
Does UltraSpeed degrade the quality of the MiMo V2.5 Pro model?
No, UltraSpeed accelerates the exact MiMo V2.5 Pro model, which matches Claude Opus on coding benchmarks. The FP4 Quantization is surgically applied only to expert layers, resulting in near-zero quality degradation.
How does this speed impact AI applications?
Faster inference speeds allow for new applications with hard latency constraints, such as parallel reasoning paths, real-time fraud detection, trading signal generation, and more efficient AI agent loops. These applications were previously not feasible at slower inference rates.







