NLP(九十一)PDF表格问答

本文将会介绍如何对PDF中的表格使用大模型进行问答。

本文将会介绍如何对PDF中的表格使用多模态大模型(比如GPT-4V模型)进行智能问答,总体的思路如下:

  • 获取表格图片,可使用表格检测工具或版面分析工具
  • 使用多模态大模型进行智能问答

当然,对于纯文字版的PDF,可直接使用fitz模块提取表格 数据,这里我们将PDF中的表格转化为表格数据,使之更具通用性,可同时适合纯文字版或扫描版的PDF文件。

获取表格图片

本文所使用的示例PDF文件(demo2.pdf)如下:

示例PDF文件

在文章表格检测与识别的初次尝试中,笔者介绍了如何使用Microsoft开源的表格检测模型来进行PDF中的表格检测。

在这个例子中,使用开源模型来实现PDF中的表格检测,并将表格区域转化为图片,代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import fitz
from PIL import Image
import torch
from transformers import AutoImageProcessor, TableTransformerForObjectDetection


image_processor = AutoImageProcessor.from_pretrained("./models/table-transformer-detection")
detect_model = TableTransformerForObjectDetection.from_pretrained("./models/table-transformer-detection")
print("load table transformer model...")

def convert_pdf_2_img(pdf_file: str, pages: int) -> None:
pdf_document = fitz.open(pdf_file)
file_name = pdf_file.split('/')[-1].split('.')[0]
# Iterate through each page and convert to an image
image_list = []
real_pages = min(pages, pdf_document.page_count)
for page_number in range(real_pages):
# Get the page
page = pdf_document[page_number]
# Convert the page to an image
pix = page.get_pixmap()
# Create a Pillow Image object from the pixmap
image = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
# Save the image
save_img_path = f"./output/{file_name}_{page_number + 1}.png"
image.save(save_img_path)
image_list.append(save_img_path)

# Close the PDF file
pdf_document.close()
return image_list


def table_detect(image_path):
image = Image.open(image_path).convert('RGB')
file_name = image_path.split('/')[-1].split('.')[0]
inputs = image_processor(images=image, return_tensors="pt")
outputs = detect_model(**inputs)
# convert outputs (bounding boxes and class logits) to COCO API
target_sizes = torch.tensor([image.size[::-1]])
results = image_processor.post_process_object_detection(outputs, threshold=0.9, target_sizes=target_sizes)[0]

i = 0
output_images = []
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
box = [round(i, 2) for i in box.tolist()]
print(
f"Detected {detect_model.config.id2label[label.item()]} with confidence "
f"{round(score.item(), 3)} at location {box}"
)

region = image.crop(box) # 检测
output_image_path = f'./output/{file_name}_table_{i}.jpg'
region.save(output_image_path)
output_images.append(output_image_path)
i += 1
return output_images


if __name__ == '__main__':
test_pdf_file = "./pdf/demo2.pdf"
page_image_list = convert_pdf_2_img(pdf_file=test_pdf_file, pages=2)
for page_image in page_image_list:
table_detect(page_image)

提取的表格图片如下:

表格1图片
表格2图片

当然,我们也可以使用PDF版面分析工具,来获取PDF文件中的表格区域,常见的开源版面分析工具有:百度的PP-StructureV2等。

表格问答

对于获取到的表格图片,我们将其转化为base64形式送入GPT-4V模型中,这样进行进行表格问答了。

示例Python代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# -*- coding: utf-8 -*-
import os
import fitz
import base64
import requests
from pprint import pprint


def get_pdf_content(pdf_path: str) -> str:
doc = fitz.open(pdf_path)
num_pages = doc.page_count
bg_content_list = []

# Full Text of PDF
for page_index in range(num_pages):
page = doc.load_page(page_index)
text = page.get_text()
bg_content_list.append(text)

return ''.join(bg_content_list)


# Function to encode the image
def encode_image(image_path: str):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')


def make_request(pdf_content, images, query):
# Getting the base64 string
image_content = [
{
"type": "text",
"text": query
},
]
for image in images:
base64_image = encode_image(image)
image_content.append({
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
}
})
# OpenAI API Key
api_key = os.getenv("OPENAI_API_KEY")

headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {api_key}"
}

payload = {"model": "gpt-4-vision-preview",
"messages": [{"role": "system",
"content": "You are a helpful assistant."},
{"role": "user",
"content": f"The full text of PDF file is: {pdf_content}"},
{"role": "user",
"content": image_content}],
"max_tokens": 300}

response = requests.post(
"https://api.openai.com/v1/chat/completions",
headers=headers,
json=payload)

pprint(response.json()['choices'])


if __name__ == '__main__':
pdf_file_path = '../data/demo2.pdf'
table_images_list = [
'../output/demo2_1_table_0.jpg',
'../output/demo2_1_table_1.jpg']
test_query = "what's the rank of Alex's city?"
test_pdf_content = get_pdf_content(pdf_path=pdf_file_path)
make_request(pdf_content=test_pdf_content, images=table_images_list, query=test_query)

测试的问题为

what's the rank of Alex's city?

回答这个问题需要两张表格图片,而GPT-4V很好地回答了这个问题,输出结果为:

1
2
3
4
[{'finish_reason': 'stop',
'index': 0,
'message': {'content': "The rank of Alex's city, Shanghai, is 2.",
'role': 'assistant'}}]

总结

本文主要使用表格检测工具来获取PDF中的表格区域,将其转化为图片,同时借多模态大模型(比如GPT-4V)来进行表格问答,这是一个不错的尝试。

本文代码已开源,Github网址为:https://github.com/percent4/pdf-llm_series .

推荐阅读

  1. NLP(八十九)PDF文档智能问答入门
  2. 表格检测与识别的初次尝试
  3. NLP(八十一)智能文档问答助手项目改进
  4. NLP(六十九)智能文档助手升级
  5. NLP(六十一)使用Baichuan-13B-Chat模型构建智能文档

欢迎关注我的公众号NLP奇幻之旅,原创技术文章第一时间推送。

欢迎关注我的知识星球“自然语言处理奇幻之旅”,笔者正在努力构建自己的技术社区。


NLP(九十一)PDF表格问答
https://percent4.github.io/NLP(九十一)PDF表格问答/
作者
Jclian91
发布于
2024年4月3日
许可协议