Two research papers landed this month from completely different domains. Google's Speech-to-Retrieval (S2R) processes voice queries. DeepSeek-OCR processes document images.
At first, I assumed they had nothing in common. But the deeper I went, the pattern became obvious: both skip the traditional step of converting information into text.
This isn't just an optimization trick. It exposes something fundamental about how we have been training perception systems.
The Invisible Assumption
Most AI systems we use today still depend on one old assumption: "To understand something, we must first turn it into text."
The traditional pipelines look like this:
- Voice search: Speech → ASR (Automatic Speech Recognition) → Text (transcription) → Search → Results
- Document understanding: Image → OCR (Optical Character Recognition) → Text (character extraction) → Processing → Understanding
We built these systems this way because natural language processing dominated AI research for decades. When we needed machines to handle images or audio, we simply added conversion layers. Text became the universal interface because that's where our tools worked.
But humans don't transcribe the world before understanding it. We perceive meaning directly.
Where This Bias Came From
Natural language processing had working solutions early. We could analyze text, extract information, and build search engines. When speech recognition and computer vision matured enough to be useful, the logical approach seemed obvious: convert everything to text, then use our existing text tools.
This created a cascade architecture everywhere:
- Voice assistants transcribe your words, then process the transcript
- OCR systems extract every character, then try to understand the document
- Video analysis systems generate text descriptions, then analyze those descriptions
Each conversion step introduces errors that compound downstream. A misrecognized word corrupts the entire query. A mis-segmented character breaks document parsing. The symbolic text bottleneck loses information that was present in the original signal.
What Changes When Text Disappears
Google S2R: Already Live in Production
Google announced S2R on October 7, 2025. The system is already serving voice search users in multiple languages.
The architecture uses dual encoders:
-
Audio encoder: Converts speech waveforms directly into semantic embeddings that capture what the person wants, not what words they said.
-
Document encoder: Generates corresponding vectors for documents in the same semantic space.
During training, both encoders learn simultaneously. The goal is simple: make query vectors close to relevant document vectors in this shared space. When you speak a query, the audio encoder generates a vector, searches against indexed documents, and retrieves what you need.
The ASR transcription step simply doesn't exist.
The performance metrics surprised even the researchers
Google released the Simple Voice Questions (SVQ) dataset with 17 languages to enable reproducible testing. They compared three approaches:
- Traditional cascade: Speech → ASR → Text → Retrieval
- Theoretical upper bound: Speech → Perfect human transcription → Retrieval
- S2R: Speech → Audio embedding → Retrieval
S2R significantly outperforms the baseline cascade and approaches the theoretical upper bound established by perfect human transcription. A more striking result, a lower transcription error rates don't reliably correlate with better retrieval across languages.
This proves that transcription accuracy alone doesn't guarantee good search results. The specific nature and location of errors matter more than their existence.
DeepSeek-OCR: 10× Compression at 97% Precision
Released October 21, 2025, DeepSeek-OCR reframes optical character recognition entirely. Instead of extracting text character by character, it treats document images as highly efficient compression mediums.
The architecture has two main components:
- DeepEncoder (vision encoder): Processes document images using a serial design and provides local perception on small patches.
- A convolutional compressor reduces thousands of patch tokens down to a few hundred. CLIP-large provides global understanding of layout and structure.
The key innovation: compress before applying expensive global attention. This keeps memory requirements minimal even at high resolution.
DeepSeek-3B-MoE decoder: A mixture-of-experts language model that generates text output directly from compressed vision tokens.
The compression performance is remarkable
Testing on 100 pages with 600-1,300 text tokens each:
- At 10× compression (100 vision tokens representing 1,000 text tokens): 97% precision
- At 7× compression: 98% precision
- At 20× compression: Still maintains 60% precision
Traditional OCR approaches scale linearly with document length. A 10-page document needs 10× the processing of a 1-page document. DeepSeek-OCR processes the entire page as a single compressed image, often using constant tokens regardless of text density.
On real-world benchmarks, DeepSeek-OCR's "small" mode using just 100 tokens per page outperforms systems using 6,000+ tokens. The "base" mode with 256 tokens achieves even better accuracy. A single A100 GPU can process 200,000+ pages per day.
The Core Similarity: Semantic Instead of Symbolic
Both systems make the same fundamental choice: they refuse to use text as an intermediate representation.
What they eliminated
What S2R eliminates:
- The ASR transcription module
- Text query intermediate representation
- Speech-to-text conversion step
- Error propagation from transcription mistakes
What DeepSeek-OCR eliminates:
- Character-by-character recognition
- Text tokenization of full document content
- Linear scaling with document length
- Layout information loss during text extraction
Gains
Error reduction: Eliminating intermediate symbolic steps removes points where mistakes compound. S2R doesn't suffer from transcription errors affecting retrieval. DeepSeek-OCR avoids character recognition failures cascading through document understanding.
Information preservation: Acoustic prosody, tone, and emphasis remain encoded in S2R's audio embeddings. Spatial relationships, typography, layout, and visual hierarchy remain encoded in DeepSeek-OCR's vision tokens. Text conversion loses these cues irreversibly.
Efficiency gains: S2R eliminates the transcription step entirely, reducing latency. DeepSeek-OCR achieves 7-20× token reduction. A document requiring 10,000 text tokens can be represented with just 500-1,000 vision tokens, with proportional reductions in computational cost.
Task-specific optimization: Both models train end-to-end for their actual objectives (retrieval quality, document understanding) rather than intermediate proxy tasks (transcription accuracy, character recognition). The entire system optimizes for what matters.
Concrete Examples
Voice Search Gone Wrong
Traditional approach:
- You say: "The Scream painting"
- ASR hears: "The Screen painting"
- Search retrieves: Results about painted screens, screen printing, display panels
- You get: Completely wrong results
The error happens at transcription. Everything downstream inherits that mistake.
S2R approach:
- You say: "The Scream painting"
- Audio encoder captures: Semantic intent (famous artwork query) + acoustic patterns
- Search retrieves: Edvard Munch's "The Scream" and related artworks
- You get: Correct results despite potential transcription ambiguity
The system optimizes for meaning, not words. It learns that certain acoustic patterns correspond to specific information needs, even when the exact words are ambiguous.
Document Processing Bottleneck
Traditional OCR approach:
- Input: 500-page technical manual with tables, diagrams, and formulas
- OCR extracts: 500,000 text tokens + attempts to parse structure
- LLM receives: Massive token stream, often hitting context limits
- Processing cost: Very high, requires chunking strategies
- Result: Layout information lost, formulas often corrupted
DeepSeek-OCR approach:
- Input: Same 500-page manual
- Vision encoder generates: 32,000-128,000 vision tokens (depending on mode)
- LLM receives: Compressed semantic representation with preserved layout
- Processing cost: 4-15× lower
- Result: Maintains spatial relationships, handles tables and formulas natively
The difference isn't incremental. It changes what becomes economically feasible at scale.
The Real Question
If text-as-interface was inherited bias rather than technical necessity, what other "required" steps in our architectures are just legacy thinking from previous paradigm constraints?
Worth auditing our own pipelines with that lens.
Where This Leads
Both systems are already deployed or production-ready. S2R serves billions of Google voice searches. DeepSeek-OCR is open-sourced under MIT license and processing millions of pages daily.
The viability is proven. Text remains essential for human communication and many AI tasks. But it's no longer the mandatory intermediate representation for all AI processing.
The paradigm shift: Moving from "perception → symbolic text → understanding" to "perception → semantic embedding → understanding" eliminates bottlenecks we didn't realize we had accepted.
When systems can pass understanding directly rather than through written symbols, AI begins to operate on its own semantic bandwidth. That may be the true beginning of multimodal intelligence.
Research provenance and verification
Google S2R official sources
-
Primary publication: Variani, E., & Riley, M. (2025, October 7). Speech-to-Retrieval (S2R): A new approach to voice search. Google Research Blog.
-
Open-source dataset: Simple Voice Questions (SVQ) Dataset, 17 languages, 26 locales, CC-BY-4.0 license.
-
Benchmark framework: Massive Sound Embedding Benchmark (MSEB).
-
Status: Live in production as of October 2025, serving multiple languages globally in Google Voice Search.
DeepSeek-OCR official sources
-
Academic paper: Wei, H., Sun, Y., & Li, Y. (2025). DeepSeek-OCR: Contexts Optical Compression. arXiv preprint arXiv:2510.18234.
-
GitHub repository: Full training and inference code (4,000+ stars within 24 hours)
-
Model weights: Hugging Face model card with complete model (6.67 GB), MIT license (100,000+ downloads first week)



