pyPdf for IndirectObject extraction(用于间接对象提取的 pyPdf)
问题描述
按照这个例子,我可以将所有元素列出到一个 pdf 文件中
Following this example, I can list all elements into a pdf file
import pyPdf
pdf = pyPdf.PdfFileReader(open("pdffile.pdf"))
list(pdf.pages) # Process all the objects.
print pdf.resolvedObjects
现在,我需要从 pdf 文件中提取一个非标准对象.
now, I need to extract a non-standard object from the pdf file.
我的对象是名为 MYOBJECT 的对象,它是一个字符串.
My object is the one named MYOBJECT and it is a string.
我关心的python脚本打印出来的那一段是:
The piece printed by the python script that concernes me is:
{'/MYOBJECT': IndirectObject(584, 0)}
pdf文件是这样的:
558 0 obj
<</Contents 583 0 R/CropBox[0 0 595.22 842]/MediaBox[0 0 595.22 842]/Parent 29 0 R/Resources
<</ColorSpace <</CS0 563 0 R>>
/ExtGState <</GS0 568 0 R>>
/Font<</TT0 559 0 R/TT1 560 0 R/TT2 561 0 R/TT3 562 0 R>>
/ProcSet[/PDF/Text/ImageC]
/Properties<</MC0<</MYOBJECT 584 0 R>>/MC1<</SubKey 582 0 R>> >>
/XObject<</Im0 578 0 R>>>>
/Rotate 0/StructParents 0/Type/Page>>
endobj
...
...
...
584 0 obj
<</Length 8>>stream
1_22_4_1 --->>>> this is the string I need to extract from the object
endstream
endobj
如何按照 584
值来引用我的字符串(当然在 pyPdf 下)??
How can I follow the 584
value in order to refer to my string (under pyPdf of course)??
推荐答案
pdf.pages
中的每个元素都是一个字典,所以假设它在第 1 页,pdf.pages[0]['/MYOBJECT']
应该是你想要的元素.
each element in pdf.pages
is a dictionary, so assuming it's on page 1, pdf.pages[0]['/MYOBJECT']
should be the element you want.
您可以尝试单独打印或在 python 提示中使用 help
和 dir
戳它以了解有关如何获取所需字符串的更多信息
You can try to print that individually or poke at it with help
and dir
in a python prompt for more about how to get the string you want
收到pdf的副本后,我在pdf.resolvedObjects[0][558]['/Resources']['/Properties']['/MC0']['/MYOBJECT找到了对象']
并且可以通过 getData() 获取值
after receiving a copy of the pdf, i found the object at pdf.resolvedObjects[0][558]['/Resources']['/Properties']['/MC0']['/MYOBJECT']
and the value can be retrieved via getData()
以下函数提供了一种更通用的方法来通过递归查找有问题的键来解决此问题
the following function gives a more generic way to solve this by recursively looking for the key in question
import types
import pyPdf
pdf = pyPdf.PdfFileReader(open('file.pdf'))
pages = list(pdf.pages)
def findInDict(needle,haystack):
for key in haystack.keys():
try:
value = haystack[key]
except:
continue
if key == needle:
return value
if type(value) == types.DictType or isinstance(value,pyPdf.generic.DictionaryObject):
x = findInDict(needle,value)
if x is not None:
return x
answer = findInDict('/MYOBJECT',pdf.resolvedObjects).getData()
这篇关于用于间接对象提取的 pyPdf的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:用于间接对象提取的 pyPdf


- ";find_element_by_name(';name';)";和&QOOT;FIND_ELEMENT(BY NAME,';NAME';)";之间有什么区别? 2022-01-01
- 计算测试数量的Python单元测试 2022-01-01
- 检查具有纬度和经度的地理点是否在 shapefile 中 2022-01-01
- 使用公司代理使Python3.x Slack(松弛客户端) 2022-01-01
- 如何使用PYSPARK从Spark获得批次行 2022-01-01
- YouTube API v3 返回截断的观看记录 2022-01-01
- CTR 中的 AES 如何用于 Python 和 PyCrypto? 2022-01-01
- 我如何卸载 PyTorch? 2022-01-01
- 使用 Cython 将 Python 链接到共享库 2022-01-01
- 我如何透明地重定向一个Python导入? 2022-01-01