Parsing Pdf File Using Regular Expressions In Python
Solution 1:
If you are using only regex, it is easy to construct a PDF file that your program will not be able to handle. PDF dictionaries and lists can contain other objects. Regex can't handle recursive structures, at least not Python re module.
A pdf file is a tree of objects and streams:
- Dictionaries:
<<
(name value)*>>
- Lists:
[
(value)*]
- Names:
/
(regular char)* - Strings:
(
(char)*)
- Hex strings:
<
(hexchar)*>
- Numbers: (
-
)? ((digit)+ | (digit)+.
(digit)* |.
(digit)+) - Booleans:
true
|false
- References: (digit)+ (whitespace)+ (digit)+ (whitespace)+
R
Whitespace and comments are ignored in most places.
Comments start with %
and run until the end of the line.
Indirect objects are specified as:
10 obj
(anyobject)
endobj
This object can then be referenced as 1 0 R
. Indirect dictionaries can also have a stream attached:
10 obj
<<
/Length 22
>>
stream
(22 bytes of raw data)
endstream
endobj
A PDF file looks something like this:
%PDF-1.4%ÿÿÿÿ
1 0 obj
<< /Author (MizardX) >>
endobj
2 0 obj
<<
/Type /Catalog
% more required keys>>
endobj
%lots of more indirect objects, one after another
trailer
<<
/Info 1 0 R
/Root 2 0 R
% ... more required keys>>
xref
0 3
0000000000 65535 f
0000000015 00000 n
0000000054 00000 n
startxref
225
%%EOF
The root of the object tree is the trailer
object. Every objects is referenced directly or indirectly from this dictionary.
There are a lot more complexity hidden inside the streams, but that does not affect the file structure.
The full specification can be found at Adobe's website.
Solution 2:
You need to use *?
as the non-greedy version - see documentation here.
Also, note that PDF format is very complex - especially when it starts having binary streams within it - but if you know the PDFs you are looking at are simple then this should work.
Solution 3:
A question mark after the repeated part should take the minimal amount of characters. Also comma is not necessary because \S
already takes it into account.
\d+\s\d+\sobj[\s\S]*?endobj
Post a Comment for "Parsing Pdf File Using Regular Expressions In Python"