Now, whenever the xref table points to a compressed object,
parse_object_with_index will look it up in the corresponding object
stream as if it were a regular object.
With this, our parser gains the bare minimum support for xref streams.
The Parser class is now a generic PDF object parser, of which the new
DocumentParser class derives. DocumentParser now takes over all
functions relating to linearization, pages, xref and trailer handling.
This allows the use of multiple parsers in the same document's
context, which will be needed in order to handle PDF object streams.
We now try to parse the first indirect value and see
if it's the `Linearization Parameter Dictionary`. if it's not, we
fallback to reading the xref table from the end of the document
This code isn't _actually_ used as of right now, but I wrote it at the
same time as all of the code in the previous commit. I realized after
I wrote it that these hint tables aren't super useful if the parser
already has access to the full file. However, this will be useful if
we ever want to stream PDFs from the web (and possibly view them in
the browser).
This is a big step, as most PDFs which are downloaded online will be
linearized. Pretty much the only difference is that the xref structure
is slightly different.
The Parser should hold information relevant for parsing, whereas the
Document should hold information relevant for displaying pages.
With this in mind, there is no reason for the Document to hold the
xref table and trailer. These objects have been moved to the Parser,
which allows the Parser to expose less public methods (which will be
even more evident once linearized PDFs are supported).
We now follow nested page tree nodes to find all of the actual
page dicts, whereas previously we just assumed the root level
page tree node contained all of the page children directly.
This commit introduces the ability to parse the document catalog dict,
as well as the page tree and individual pages. Pages obviously aren't
fully parsed, as we won't care about most of the fields until we
start actually rendering PDFs.
One of the primary benefits of the PDF format is laziness. PDFs are
not meant to be parsed all at once, and the same is true for pages.
When a Document is constructed, it builds a map of page number to
object index, but it does not fetch and parse any of the pages. A page
is only parsed when a caller requests that particular page (and is
cached going forwards).
Additionally, this commit also adds an object_cast function which
logs bad casts if DEBUG_PDF is set. Additionally, utility functions
were added to ArrayObject and DictObject to get all types of objects
from the collections to avoid having to manually cast.
This commit adds a parser as well as the Reader class, which serves
as a utility to aid in reading the PDF both forwards and in reverse.
The parser currently is capable of reading xref tables, as well as
all values. We don't really do anything with any of this information,
however.