Published
November 2, 2025
Methods and Diagnostics
Two research papers from Google and DeepSeek landed in October from completely different domains. One processes speech, the other processes documents. Neither bothers converting anything to text first. This exposes something fundamental about how we have been training perception systems for decades.
Published
November 2, 2025
Reading time
6 min read
Author
Fadi Labib

Two research papers landed this month from completely different domains. Google's Speech-to-Retrieval (S2R) processes voice queries. DeepSeek-OCR processes document images.
At first, I assumed they had nothing in common. But the deeper I went, the pattern became obvious: both skip the traditional step of converting information into text.
This isn't just an optimization trick. It exposes something fundamental about how we have been training perception systems.
Most AI systems we use today still depend on one old assumption: "To understand something, we must first turn it into text."
The traditional pipelines look like this:
We built these systems this way because natural language processing dominated AI research for decades. When we needed machines to handle images or audio, we simply added conversion layers. Text became the universal interface because that's where our tools worked.
But humans don't transcribe the world before understanding it. We perceive meaning directly.
Natural language processing had working solutions early. We could analyze text, extract information, and build search engines. When speech recognition and computer vision matured enough to be useful, the logical approach seemed obvious: convert everything to text, then use our existing text tools.
This created a cascade architecture everywhere:
Each conversion step introduces errors that compound downstream. A misrecognized word corrupts the entire query. A mis-segmented character breaks document parsing. The symbolic text bottleneck loses information that was present in the original signal.
Google announced S2R on October 7, 2025. The system is already serving voice search users in multiple languages.
The architecture uses dual encoders:
Audio encoder: Converts speech waveforms directly into semantic embeddings that capture what the person wants, not what words they said.
Document encoder: Generates corresponding vectors for documents in the same semantic space.
During training, both encoders learn simultaneously. The goal is simple: make query vectors close to relevant document vectors in this shared space. When you speak a query, the audio encoder generates a vector, searches against indexed documents, and retrieves what you need.
The ASR transcription step simply doesn't exist.
Google released the Simple Voice Questions (SVQ) dataset with 17 languages to enable reproducible testing. They compared three approaches:
S2R significantly outperforms the baseline cascade and approaches the theoretical upper bound established by perfect human transcription. A more striking result, a lower transcription error rates don't reliably correlate with better retrieval across languages.
This proves that transcription accuracy alone doesn't guarantee good search results. The specific nature and location of errors matter more than their existence.
Released October 21, 2025, DeepSeek-OCR reframes optical character recognition entirely. Instead of extracting text character by character, it treats document images as highly efficient compression mediums.
The architecture has two main components:
The key innovation: compress before applying expensive global attention. This keeps memory requirements minimal even at high resolution.
DeepSeek-3B-MoE decoder: A mixture-of-experts language model that generates text output directly from compressed vision tokens.
Testing on 100 pages with 600-1,300 text tokens each:
Traditional OCR approaches scale linearly with document length. A 10-page document needs 10× the processing of a 1-page document. DeepSeek-OCR processes the entire page as a single compressed image, often using constant tokens regardless of text density.
On real-world benchmarks, DeepSeek-OCR's "small" mode using just 100 tokens per page outperforms systems using 6,000+ tokens. The "base" mode with 256 tokens achieves even better accuracy. A single A100 GPU can process 200,000+ pages per day.
Both systems make the same fundamental choice: they refuse to use text as an intermediate representation.
What S2R eliminates:
What DeepSeek-OCR eliminates:
Error reduction: Eliminating intermediate symbolic steps removes points where mistakes compound. S2R doesn't suffer from transcription errors affecting retrieval. DeepSeek-OCR avoids character recognition failures cascading through document understanding.
Information preservation: Acoustic prosody, tone, and emphasis remain encoded in S2R's audio embeddings. Spatial relationships, typography, layout, and visual hierarchy remain encoded in DeepSeek-OCR's vision tokens. Text conversion loses these cues irreversibly.
Efficiency gains: S2R eliminates the transcription step entirely, reducing latency. DeepSeek-OCR achieves 7-20× token reduction. A document requiring 10,000 text tokens can be represented with just 500-1,000 vision tokens, with proportional reductions in computational cost.
Task-specific optimization: Both models train end-to-end for their actual objectives (retrieval quality, document understanding) rather than intermediate proxy tasks (transcription accuracy, character recognition). The entire system optimizes for what matters.
Traditional approach:
The error happens at transcription. Everything downstream inherits that mistake.
S2R approach:
The system optimizes for meaning, not words. It learns that certain acoustic patterns correspond to specific information needs, even when the exact words are ambiguous.
Traditional OCR approach:
DeepSeek-OCR approach:
The difference isn't incremental. It changes what becomes economically feasible at scale.
If text-as-interface was inherited bias rather than technical necessity, what other "required" steps in our architectures are just legacy thinking from previous paradigm constraints?
Worth auditing our own pipelines with that lens.
Both systems are already deployed or production-ready. S2R serves billions of Google voice searches. DeepSeek-OCR is open-sourced under MIT license and processing millions of pages daily.
The viability is proven. Text remains essential for human communication and many AI tasks. But it's no longer the mandatory intermediate representation for all AI processing.
The paradigm shift: Moving from "perception → symbolic text → understanding" to "perception → semantic embedding → understanding" eliminates bottlenecks we didn't realize we had accepted.
When systems can pass understanding directly rather than through written symbols, AI begins to operate on its own semantic bandwidth. That may be the true beginning of multimodal intelligence.
Primary publication: Variani, E., & Riley, M. (2025, October 7). Speech-to-Retrieval (S2R): A new approach to voice search. Google Research Blog.
Open-source dataset: Simple Voice Questions (SVQ) Dataset, 17 languages, 26 locales, CC-BY-4.0 license.
Benchmark framework: Massive Sound Embedding Benchmark (MSEB).
Status: Live in production as of October 2025, serving multiple languages globally in Google Voice Search.
Academic paper: Wei, H., Sun, Y., & Li, Y. (2025). DeepSeek-OCR: Contexts Optical Compression. arXiv preprint arXiv:2510.18234.
GitHub repository: Full training and inference code (4,000+ stars within 24 hours)
Model weights: Hugging Face model card with complete model (6.67 GB), MIT license (100,000+ downloads first week)
Keep reading

As carmakers announce their transformation into "Software-Defined Vehicles", a revealing question emerges: Why don't we call smartphones "Software-Defined Phones"? The answer exposes a fundamental truth about the automotive industry's struggle to catch up with technology that mobile devices mastered two decades ago. While smartphones were born in the software era, designed from inception as platforms where apps and OS updates define the user experience—traditional automakers are hardware companies desperately trying to think software-first. The "software-defined" prefix isn't just marketing; it's a need for industrial transformation, signaling a pivot that mobile companies never needed because software centricity was self-evident from their beginning. As vehicles evolve into "computers on wheels", they're essentially revealing that the SDV label represents not innovation, but an industry's public acknowledgment that its fundamental assumptions about value creation were wrong.

Open source software powers 96% of all codebases and would cost $8.8 trillion to rebuild, yet just 5% of developers create 96% of its value. Google Test alone saves companies billions.Imagine 2,000 companies each burning money to build their own testing framework, then to maintain it. That's billions down the drain, solving the same problem thousands of times. Meanwhile, bugs caught early save hundreds of thousands per year, and engineers get to build actual products instead of reinventing basic tools. Tech giants aren't sharing code out of generosity, they've figured out that giving away millions in development costs them less than the alternative.

Nokia launched the first smartphone in 1996, 11 years before the iPhone, and had superior technology with a massive R&D budget, thousands of patents, and advanced features like GPS and 5MP cameras. Yet they failed. Why? Not because of technology, but because they couldn't transform from a hardware company to a software company. Developers abandoned them for platforms that took software seriously. Today's traditional carmakers are making the same mistake Nokia made—spending billions on research but focusing on the wrong things, while Tesla and Chinese EVs play by software-first rules, just as Apple and Samsung did against Nokia.