Skip to main content
  1. Blog/

Testing DeepSeek-OCR: Vision Text Compression for LLMs

·466 words·3 mins

One of my recurring use cases for me is running OCR on an increasing collection of screenshots and then using the metadata to organize them. I don’t have a good system, but it was good enough and is deeply dependent on the OCR mode that I use. Traditional vision models often perform poorly when faced with odd layouts or compressed text regions and at the same time are computationally intensive, while classical OCR tools like Tesseract tend to have issues on screenshots with unusual webpage formatting. So I’ve been searching for a solid local solution something that can process screenshots efficiently and export results directly to Markdown. what do you think of the following statement “A single image of a document can carry as much meaning as thousands of text tokens. This visual channel could offer 10–20× compression while maintaining readable precision after decoding.”. Now think of your own vision system and its link to the brain, how it handles information processing and compression. That principle is worth keeping in mind as that is what forms the basis of new model released that just got released today called DeepSeek-OCR. It made me rethink the usage of vision models.

Modern LLMs struggle with compute scaling as context grows. This is well-documented, and one of the reasons larger VRAM is required for million-token contexts. DeepSeek-OCR addresses this through optical compression, representing textual information visually rather than tokenizing every character. Instead of feeding massive text sequences directly into an LLM, it compresses the document into visual tokens essentially optical summaries that can later be decoded back into text. Benchmarks report around 97% OCR decoding precision at 9–10× compression, which is remarkable for a model this compact.

DeepSeek-OCR has two main parts that forms a full vision-to-text pipeline optimized for OCR decoding precision even under aggressive compression.:

  • DeepEncoder handles high-resolution visual inputs efficiently, maintaining low activation memory while achieving high compression ratios.
  • DeepSeek3B-MoE-A570M serves as the decoder, reconstructing the compressed visual tokens back into text (OCR).

As usual, it was a nightmare to get flash attention working properly. After that it ran remarkably well on my screenshots folder, far exceeding my last round of vision model testing. It performed well with Dutch, German and English even on messy webpage captures that normally confuse OCR systems. I am running on the 16 bit models and it is almost using my complete GPU but the performance is excellent. Deepseek has another hit with this paper in its hand just like the release of Deepseek R1 a year ago that made every other model maker implement thinking into their toolkit. I feel the company as a whole contains very intelligent and innovative researchers that are a net positive to he AI field with their contributions so far.

Try it out: