最近腾讯也开源了一个端到端视觉语言模型POINTS-Reader,在OCR任务上表现不错;至此,国内已经有不少专门针对OCR任务进行训练的开源VLM了。成绩打榜是一方面,实际用起来效果如何、好不好用可能又是另一方面。因此,本文将对比几款最近比较流行的VLM模型,通过不同的Prompt测试模型在PDF上的文本识别效果。

对比模型

名称类型架构大小多语言输出格式
Qwen2.5VL-3BGeneral VLMQwen2_5_VLForConditionalGeneration3B-
MinerU2.0-2505-0.9BExpert VLMMineru2QwenForCausalLM0.9B-
MonkeyOCR-3BExpert VLMQwen2_5_VLForConditionalGeneration3B-
OCRflux-3BExpert VLMQwen2_5_VLForConditionalGeneration3B固定
POINTS-ReaderExpert VLMPOINTSV15ChatModel3B中英-

效果测试

测试图片

选择一张典型的PDF论文的截图,包含表格、图片、段落等信息。 一张论文的测试图片

测试Prompt

为了验证不同模型对Prompt的灵敏度和效果,测试了如下几个典型的Prompt,涵盖从简单的文字提取到复杂的版面结构提取。

1. 简单提取所有的文字

prompt_1 = """extract all the text from the image"""

2. 提取文字和表格

prompt_2 = """
Please extract all the text from the image with the following requirements:
1. Return tables in HTML format.
2. Return all other text in Markdown format.
"""

3. 提取位置、类型和内容

prompt_3 = """Please output the layout information from the PDF image, including each layout element's bbox, its category, and the corresponding text content within the bbox.

1. Bbox format: [x1, y1, x2, y2]

2. Layout Categories: The possible categories are ['Caption', 'Footnote', 'Formula', 'List-item', 'Page-footer', 'Page-header', 'Picture', 'Section-header', 'Table', 'Text', 'Title'].

3. Text Extraction & Formatting Rules:
    - Picture: For the 'Picture' category, the text field should be omitted.
    - Formula: Format its text as LaTeX.
    - Table: Format its text as HTML.
    - All Others (Text, Title, etc.): Format their text as Markdown.

4. Constraints:
    - The output text must be the original text from the image, with no translation.
    - All layout elements must be sorted according to human reading order.

5. Final Output: The entire output must be a single JSON object.
"""

4. 提取表格、图片、并保留标题格式

prompt_4 = """
Below is the image of one page of a document. "
Just return the plain text representation of this document as if you were reading it naturally.n"
ALL tables should be presented in HTML format.n"
If there are images or figures in the page, present them as "<Image>(left,top),(right,bottom)</Image>", (left,top,right,bottom) are the coordinates of the top-left and bottom-right corners of the image or figure.n"
Present all titles and headings as H1 headings.n"
Do not hallucinate.n"
"""

识别效果

1. 简单提取所有的文字

● Qwen2.5VL-3B 在这里插入图片描述

● MinerU2.0-2505-0.9B 在这里插入图片描述

● MonkeyOCR 在这里插入图片描述

● OCRflux-3B

Table 2: SFT Data Comparison on GAIA benchmarks. The best results among all backbones are in bolded.\n\n4.3.3 RL Stimulation\n\nWe compare GAIA performances between models trained after SFT and reinforcement learning. RL models are trained based on the SFT results. As illustrated in Figure 6a and 6b, our experimental results demonstrate significant performance improvements across both Qwen2.5-32B and Qwen2.5-72B models after RL training on both GAIA and WebWalkerQA. The Pass@1 metric shows notable enhancements of +7.8 points for the 32B model and an even more pronounced +13.5 points increase for the 72B variant on GAIA. On WebWalkerQA, WebShaper also improves IS capability on a large scale. This substantial gain highlights the critical role of RL in activating advanced information-seeking capabilities within LLM.\n\nThe breadth and complexity of tasks introduced by our task formalization stimulate dynamic IS strategies during RL. Unlike generic datasets, our carefully curated scenarios require the model to iteratively query relevant information, effectively \"training\" it to prioritize contextually aligned knowledge fragments.\n\n4.3.4 Formalization\n\nIn this part, we validate whether our formalization truly improves the dataset. We compare our dataset to a variation that uses natural language during the data synthesis. This variation takes the current question in each iteration and also uses the Expander agent to expand it to a new question. The Expander process in natural language as well. We SFT Qwen2.5-32B, Qwen2.5-72B, and QwQ on both datasets. The other training setting remains the same. We compare the training results with the variation as shown in Figure 7a.\n\nFL excels NL in all base model backbones. These results indicate that our formalization language can

● POINTS-Reader 在这里插入图片描述

2. 提取文字和表格

● Qwen2.5VL-3B 在这里插入图片描述

● MinerU2.0-2505-0.9B 在这里插入图片描述

● MonkeyOCR 在这里插入图片描述

● OCRflux-3B 在这里插入图片描述

● POINTS-Reader

在这里插入图片描述

3. 提取位置、类型和内容

● Qwen2.5VL-3B 在这里插入图片描述

● MinerU2.0-2505-0.9B 在这里插入图片描述

● MonkeyOCR 在这里插入图片描述

● OCRflux-3B 在这里插入图片描述

● POINTS-Reader

在这里插入图片描述

4. 提取表格、图片、并保留标题格式

● Qwen2.5VL-3B

在这里插入图片描述

● MinerU2.0-2505-0.9B 在这里插入图片描述

● MonkeyOCR 在这里插入图片描述

● OCRflux-3B 在这里插入图片描述

● POINTS-Reader 在这里插入图片描述

结论

通过上述对比,个人主观排名如下,不代表模型实际性能: POINTS-Reader = OCRflux-3B > MonkeyOCR >= MinerU2.0-2505-0.9B > Qwen2.5VL-3B

1. 原生VLM的能力已经足够

如果不需要图片和表格,只对文档进行提取,使用Qwen2.5VL-3B模型即可,而且Prompt不需要特别设置。

2. 子标题识别效果

OCRflux-3B和MinerU2.0-2505-0.9B均可识别出子标题的效果。

3. 模型输出格式

OCRflux-3B对输出格式进行严格限制,最好结合官方Pipeline进行使用,能更好实现一些效果,如跨页合并。

{
    "primary_language": 
    "is_rotation_valid": 
    "rotation_correction": 
    "is_table": 
    "is_diagram": 
    "natural_text":
}

4. 兼容性

说实话这几个模型兼容性较差,虽然整体架构与Qwen2.5VL相差不大,但是各家都有自己的视觉处理器,调用方法有所区别,同一套代码无法完全兼容使用,transformer版本更是千差万别。VLM对于以后实现好用的Agent来说是至关重要的一部分,但是现在尚未有统一的格式,希望有大厂赶紧提出通用的一个标准。

5. 易用性

VLM本质上也是一个LLM,需要推理框架进行加速,否则速度非常慢。vLLM/SGLang等加速框架对LLM支持比较广泛,对于VLM支持较少,需要模型官方团队/开源社区单独适配。特别是POINTS-Reader在代码中强制使用flash attention,无cuda环境完全无法运行。

本站提供的所有下载资源均来自互联网,仅提供学习交流使用,版权归原作者所有。如需商业使用,请联系原作者获得授权。 如您发现有涉嫌侵权的内容,请联系我们 邮箱:[email protected]