Why Do We Still Assume Machines Must Read Before They Understand?

Two research papers landed this month from completely different domains. Google's Speech-to-Retrieval (S2R) processes voice queries. DeepSeek-OCR processes document images.

At first, I assumed they had nothing in common. But the deeper I went, the pattern became obvious: both skip the traditional step of converting information into text.

This isn't just an optimization trick. It exposes something fundamental about how we have been training perception systems.

The Invisible Assumption

Most AI systems we use today still depend on one old assumption: "To understand something, we must first turn it into text."

The traditional pipelines look like this:

Voice search: Speech → ASR (Automatic Speech Recognition) → Text (transcription) → Search → Results
Document understanding: Image → OCR (Optical Character Recognition) → Text (character extraction) → Processing → Understanding

We built these systems this way because natural language processing dominated AI research for decades. When we needed machines to handle images or audio, we simply added conversion layers. Text became the universal interface because that's where our tools worked.

But humans don't transcribe the world before understanding it. We perceive meaning directly.

Where This Bias Came From

Natural language processing had working solutions early. We could analyze text, extract information, and build search engines. When speech recognition and computer vision matured enough to be useful, the logical approach seemed obvious: convert everything to text, then use our existing text tools.

This created a cascade architecture everywhere:

Voice assistants transcribe your words, then process the transcript
OCR systems extract every character, then try to understand the document
Video analysis systems generate text descriptions, then analyze those descriptions

Each conversion step introduces errors that compound downstream. A misrecognized word corrupts the entire query. A mis-segmented character breaks document parsing. The symbolic text bottleneck loses information that was present in the original signal.

What Changes When Text Disappears

Google S2R: Already Live in Production

Google announced S2R on October 7, 2025. The system is already serving voice search users in multiple languages.

The architecture uses dual encoders:

Audio encoder: Converts speech waveforms directly into semantic embeddings that capture what the person wants, not what words they said.
Document encoder: Generates corresponding vectors for documents in the same semantic space.

During training, both encoders learn simultaneously. The goal is simple: make query vectors close to relevant document vectors in this shared space. When you speak a query, the audio encoder generates a vector, searches against indexed documents, and retrieves what you need.

The ASR transcription step simply doesn't exist.

The performance metrics surprised even the researchers

Google released the Simple Voice Questions (SVQ) dataset with 17 languages to enable reproducible testing. They compared three approaches:

Traditional cascade: Speech → ASR → Text → Retrieval
Theoretical upper bound: Speech → Perfect human transcription → Retrieval
S2R: Speech → Audio embedding → Retrieval

S2R significantly outperforms the baseline cascade and approaches the theoretical upper bound established by perfect human transcription. A more striking result, a lower transcription error rates don't reliably correlate with better retrieval across languages.

This proves that transcription accuracy alone doesn't guarantee good search results. The specific nature and location of errors matter more than their existence.

DeepSeek-OCR: 10× Compression at 97% Precision

Released October 21, 2025, DeepSeek-OCR reframes optical character recognition entirely. Instead of extracting text character by character, it treats document images as highly efficient compression mediums.

The architecture has two main components:

DeepEncoder (vision encoder): Processes document images using a serial design and provides local perception on small patches.
A convolutional compressor reduces thousands of patch tokens down to a few hundred. CLIP-large provides global understanding of layout and structure.

The key innovation: compress before applying expensive global attention. This keeps memory requirements minimal even at high resolution.

DeepSeek-3B-MoE decoder: A mixture-of-experts language model that generates text output directly from compressed vision tokens.

The compression performance is remarkable

Testing on 100 pages with 600-1,300 text tokens each:

At 10× compression (100 vision tokens representing 1,000 text tokens): 97% precision
At 7× compression: 98% precision
At 20× compression: Still maintains 60% precision

Traditional OCR approaches scale linearly with document length. A 10-page document needs 10× the processing of a 1-page document. DeepSeek-OCR processes the entire page as a single compressed image, often using constant tokens regardless of text density.

On real-world benchmarks, DeepSeek-OCR's "small" mode using just 100 tokens per page outperforms systems using 6,000+ tokens. The "base" mode with 256 tokens achieves even better accuracy. A single A100 GPU can process 200,000+ pages per day.

The Core Similarity: Semantic Instead of Symbolic

Both systems make the same fundamental choice: they refuse to use text as an intermediate representation.

What they eliminated

What S2R eliminates:

The ASR transcription module
Text query intermediate representation
Speech-to-text conversion step
Error propagation from transcription mistakes

What DeepSeek-OCR eliminates:

Character-by-character recognition
Text tokenization of full document content
Linear scaling with document length
Layout information loss during text extraction

Gains

Error reduction: Eliminating intermediate symbolic steps removes points where mistakes compound. S2R doesn't suffer from transcription errors affecting retrieval. DeepSeek-OCR avoids character recognition failures cascading through document understanding.

Information preservation: Acoustic prosody, tone, and emphasis remain encoded in S2R's audio embeddings. Spatial relationships, typography, layout, and visual hierarchy remain encoded in DeepSeek-OCR's vision tokens. Text conversion loses these cues irreversibly.

Efficiency gains: S2R eliminates the transcription step entirely, reducing latency. DeepSeek-OCR achieves 7-20× token reduction. A document requiring 10,000 text tokens can be represented with just 500-1,000 vision tokens, with proportional reductions in computational cost.

Task-specific optimization: Both models train end-to-end for their actual objectives (retrieval quality, document understanding) rather than intermediate proxy tasks (transcription accuracy, character recognition). The entire system optimizes for what matters.

Concrete Examples

Voice Search Gone Wrong

Traditional approach:

You say: "The Scream painting"
ASR hears: "The Screen painting"
Search retrieves: Results about painted screens, screen printing, display panels
You get: Completely wrong results

The error happens at transcription. Everything downstream inherits that mistake.

S2R approach:

You say: "The Scream painting"
Audio encoder captures: Semantic intent (famous artwork query) + acoustic patterns
Search retrieves: Edvard Munch's "The Scream" and related artworks
You get: Correct results despite potential transcription ambiguity

The system optimizes for meaning, not words. It learns that certain acoustic patterns correspond to specific information needs, even when the exact words are ambiguous.

Document Processing Bottleneck

Traditional OCR approach:

Input: 500-page technical manual with tables, diagrams, and formulas
OCR extracts: 500,000 text tokens + attempts to parse structure
LLM receives: Massive token stream, often hitting context limits
Processing cost: Very high, requires chunking strategies
Result: Layout information lost, formulas often corrupted

DeepSeek-OCR approach:

Input: Same 500-page manual
Vision encoder generates: 32,000-128,000 vision tokens (depending on mode)
LLM receives: Compressed semantic representation with preserved layout
Processing cost: 4-15× lower
Result: Maintains spatial relationships, handles tables and formulas natively

The difference isn't incremental. It changes what becomes economically feasible at scale.

The Real Question

If text-as-interface was inherited bias rather than technical necessity, what other "required" steps in our architectures are just legacy thinking from previous paradigm constraints?

Worth auditing our own pipelines with that lens.

Where This Leads

Both systems are already deployed or production-ready. S2R serves billions of Google voice searches. DeepSeek-OCR is open-sourced under MIT license and processing millions of pages daily.

The viability is proven. Text remains essential for human communication and many AI tasks. But it's no longer the mandatory intermediate representation for all AI processing.

The paradigm shift: Moving from "perception → symbolic text → understanding" to "perception → semantic embedding → understanding" eliminates bottlenecks we didn't realize we had accepted.

When systems can pass understanding directly rather than through written symbols, AI begins to operate on its own semantic bandwidth. That may be the true beginning of multimodal intelligence.

Research provenance and verification

Google S2R official sources

Primary publication: Variani, E., & Riley, M. (2025, October 7). Speech-to-Retrieval (S2R): A new approach to voice search. Google Research Blog.
Open-source dataset: Simple Voice Questions (SVQ) Dataset, 17 languages, 26 locales, CC-BY-4.0 license.
Benchmark framework: Massive Sound Embedding Benchmark (MSEB).
Status: Live in production as of October 2025, serving multiple languages globally in Google Voice Search.

DeepSeek-OCR official sources

Academic paper: Wei, H., Sun, Y., & Li, Y. (2025). DeepSeek-OCR: Contexts Optical Compression. arXiv preprint arXiv:2510.18234.
GitHub repository: Full training and inference code (4,000+ stars within 24 hours)
Model weights: Hugging Face model card with complete model (6.67 GB), MIT license (100,000+ downloads first week)