For now, this uses UTF-16BE and UTF-8 marked strings in page body text.
These markings should be ignored in body text.
Hand-written, with `set fenc=latin1` and `set binary` in vim, and
xref etc fixed up by running
mutool clean Tests/LibPDF/encoding.pdf Tests/LibPDF/encoding.pdf
as usual.
A manual test, but better than nothing.
I hand-wrote the file, and used mutool to fix up xref and stream
lengths:
mutool clean Tests/LibPDF/type3.pdf Tests/LibPDF/type3.pdf
The file contains one `d1` character which per spec shouldn't
contain color statements, and if it does it should be ignored,
and one `d0` character which can contain color.
The text then sets a color before rendering the text.
Per spec, the text color should affect the `d1` character but
not the `d0` one. We get this wrong, but so does Preview.app.
(PDFium gets it right.)
But independent of the colors, just rendering the glyphs at all
at the right position is already good :^)
Hand-written, based on the text example in Appendix G.2 in
the PDF 1.7 spec, with the xref table fixed up by `mutool clean`:
mutool clean -dggg Tests/LibPDF/text.pdf Tests/LibPDF/text.pdf
I didn't find example code for this and the AI assistant did very
poorly on this as well. So I had to write it all by myself!
It can be much more efficient I think, but I think the overall
shape is maybe roughly fine.
* SampledFunction now keeps the StreamObject it gets data from alive
(doesn't matter too much in practice, but does matter in the test,
where nothing else keeps the stream alive).
* If a sample is an integer, we would previously sample that value
twice and then divide by zero when interpolating. Make sure to
sample 1 unit apart.
Covers DeviceGray, CalRGB, DeviceRGB, DeviceCMYK, Lab, CalGray for now.
Does not yet cover Indexed, Pattern, Separation, DeviceN, ICCBased.
Lovingly hand-written, with the xref table fixed up by mutool.
There were two problems:
1. parse_compressed_object_with_index() parses indirect objects
without going through Parser::parse_indirect_value(), so
push_reference() / pop_reference() weren't called.
Manually call them, both for the indirect object containing
the object stream and for the indirect object within the
object stream.
2. The indirect object within the object stream got decrypted
twice: Once when the object stream data itself got decrypted,
and then incorrectly a second time when the object data within
the stream was read. To fix, disable encryption while parsing
object stream data (since it's already decrypted).
The test is from http://opf-labs.org/format-corpus/pdfCabinetOfHorrors/
which according to readme.md at the same location is CC0.
I created this by typing "sup" into TextEdit.app on macOS 13.4,
hitting Cmd-P to bring up the print dialog, clicked the PDF button
at the bottom, changed Title and Author to "sup", clicked
"Security Options…", and checked "Require password to open document"
(with password "sup").
This file tests several things:
- It has a compressed stream as first object. This used to make the
linearization dict detection logic assert.
- It uses AES as encryption key using version 4 of the encryption
dict. This used to not be implemented.
Note that in some cases (in particular SQL::Result and PDFErrorOr),
there is no Formatter defined for the error type, hence TRY_OR_FAIL
cannot work as-is. Furthermore, this commit leaves untouched the places
where MUST could be replaced by TRY_OR_FAIL.
Inspired by:
https://github.com/SerenityOS/serenity/pull/18710#discussion_r1186892445
Let's put test files with the tests themselves, instead of a random user
directory. (But still copy them so they appear in the user directory
for convenience.)
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).
No functional changes.
Security handlers manage encryption and decription of PDF files. The
standard security handler uses RC4/MD5 to perform its crypto (AES as
well, but that is not yet implemented).
Add a unit test for each sample pdf file that currently exists in the
anon user's `~/Document/pdf` directory.
- linear.pdf
- non-linearized.pdf
- complex.pdf
Each test ensures that the pdf document is parsed and that the page
count is the expected one.