If `Document::resolve()` was called during parsing, it'd change the
reader's current position, so the parsing code that called it would
then end up at an unexpected position in the file.
Parser.cpp already had special-case recovery when a stream's length
was stored in an indirect reference.
Commit ead02da98ac70c ("/JBIG2Globals") in #23503 added another case
where we could resolve indirect reference during parsing, but wasn't
aware of having to save and restore the reader position for that.
Put the save/restore code in `DocumentParser::parse_object_with_index`
instead, right before the place that ultimately changes the reader's
position during `Document::resolve`. This fixes `/JBIG2Globals` and
lets us remove the special-case code for `/Length` handling.
Since this is kind of subtle, include a test.
This test uses a JBIG2Globals with an indirect reference,
and contains an indirect reference for a stream length.
When we parse the main JBIG2 image's stream, we unfilter
its data, which causes these two indirect references to
be resolved during parsing.
I started with the output of
Meta/jbig2_to_pdf.py -o foo.pdf \
Tests/LibGfx/test-inputs/jbig2/bitmap.jbig2
and then manually added a
/DecodeParms <</JBIG2Globals 6 0 R>>
entry pointing to an empty stream, and made that new stream
object's length an indirect reference too for good measure.
I used `mutool clean` to fix up offsets a bit. But that also
removes the indirect reference for a stream's length, so I
manually put that back in and adjusted the offset to the last
object in the xref table and the startxref value.
CMYK data describes which inks a printer should use to print a color.
If a screen should display a color that's supposed to look similar
to what the printer produces, it results in a color very different
to what Color::from_cmyk() produces. (It's also printer-dependent.)
There are many ICC profiles describing printing processes. It doesn't
matter too much which one we use -- most of them look somewhat
similar, and they all look dramatically better than Color::from_cmyk().
This patch adds a function to download a zip file that Adobe offers
on their web site. They even have a page for redistribution:
https://www.adobe.com/support/downloads/iccprofiles/icc_eula_win_dist.html
(That one leads to a broken download though, so this downloads the
end-user version.)
In case we have to move off this download at some point, there are also
a whole bunch of profiles at https://www.color.org/registry/index.xalter
that "may be used, embedded, exchanged, and shared without restriction".
The adobe zip contains a whole bunch of other useful and fun profiles,
so I went with it.
For now, this only unzips the USWebCoatedSWOP.icc file though, and
installs it in ${CMAKE_BINARY_DIR}/Root/res/icc/Adobe/CMYK/. In
Serenity builds, this will make it to /res/icc/Adobe/CMYK in the
disk image. And in lagom build, after #23016 this is the
lagom res staging directory that tools can install via
Core::ResourceImplementation. `pdf` and `MacPDF` already do that,
`TestPDF` now does it too.
The final piece is that LibPDF then loads the profile from there
and uses it for DeviceCMYK color conversions.
(Doing file access from the bowels of a library is a bit weird,
especially in a system that has sandboxing built in. But LibGfx does
that in FontDatabase too already, and LibPDF uses that, so it's not a
new problem.)
These set the horizontal scale factor, character spacing, word
spacing, and text rise respectively.
Also add a global scale transform, and set a text transform matrix
with a scale for some of the text.
Mostly because I audited all places that assigned to `m_text_matrix`
after #22760.
This one is very difficult to trigger in practice.
`show_text()` marks the text rendering matrix dirty already,
so this only has an effect if the `TJ` array starts with a
number, and the matrix isn't marked dirty going in.
`Tm` caches the text rendering matrix, so I changed text.pdf
to contain:
```
1 0 0 1 45 130 Tm
[ 200 (Hello) -2000 (World) ] TJ T*
```
This first sets an x offset of 5 (on top of the normal 40), and
then undoes it (`200` is multiplied by font size (25) / -1000,
and `200 * 25 / -1000` is -5). Before this change, the topmost
"Hello World" ended up slightly indented.
Likely no behavior change in practice, but makes the code easier
to understand, and maybe it helps in the wild somewhere.
I opened Tests/LibGfx/test-inputs/png/wide-gamut-only.png in
Preview.app and used File->Export as PDF... to convert it to a PDF.
I then ran
mutool clean -d Tests/LibPDF/wide-gamut-only.pdf \
Tests/LibPDF/wide-gamut-only.pdf
to decompress it, edited by hand to remove padding around the image
and shrunk the page's MediaBox to be as big as the image, and ran the
command above again to fix up binary offsets in the xref table.
Hand-written (with offsets fixed up by `mutool clean`).
Uses the default encoding for each font. Manual test for now.
Byte strings generated with:
python3 -c "for i in range(4):
print('<' +
''.join('%02x' % r for r in range(i * 64, (i + 1) * 64)) +
'>')"
A local (non-public) PDF I have lying around contains this in
a page's operator stream:
```
[<00b4003e> 3 <002600480051> 3 <005700550044004f0003> -29
<00330044> 3 <0055> -3 <004e0040> 4 <0003> -29 <004c00560003> -31
<0057004b> 4 <00480003> -37 <0050
>] TJ
```
That is, there's a newline in a hexstring after a character.
This led to `Parser error at offset 5184: Unexpected character`.
The spec says in 3.2.3 String Objects, Hexadecimal Strings:
"""Each pair of hexadecimal digits defines one byte of the string.
White-space characters (such as space, tab, carriage return, line feed,
and form feed) are ignored."""
But we didn't ignore whitespace before or after a character, only
in between the bytes.
The spec also says:
"""If the final digit of a hexadecimal string is missing—that is, if
there is an odd number of digits—the final digit is assumed to be 0."""
In that case, we were skipping the closing `>` twice -- or, more
accurately, we ignored the character after it too. This has been
wrong all the way back in #6974.
Add a test that fails if either of the two changes isn't present.
Having some rendering test coverage is motivated by #22362, but this
test wouldn't have found the crashes over there (since colorspaces.pdf
does not contain pattern color spaces). Still, good to have some
in-repo test coverage of PDF rendering.
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).
This commit is auto-generated:
$ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
Meta Ports Ladybird Tests Kernel)
$ perl -pie 's/\bDeprecatedString\b/ByteString/g;
s/deprecated_string/byte_string/g' $xs
$ clang-format --style=file -i \
$(git diff --name-only | grep \.cpp\|\.h)
$ gn format $(git ls-files '*.gn' '*.gni')
Manually added an Outlines dict with three items, one each for
every text string encoding in its title.
(Preview.app apparently can't handle UTF-8 in outlines either.)
For now, this uses UTF-16BE and UTF-8 marked strings in page body text.
These markings should be ignored in body text.
Hand-written, with `set fenc=latin1` and `set binary` in vim, and
xref etc fixed up by running
mutool clean Tests/LibPDF/encoding.pdf Tests/LibPDF/encoding.pdf
as usual.
A manual test, but better than nothing.
I hand-wrote the file, and used mutool to fix up xref and stream
lengths:
mutool clean Tests/LibPDF/type3.pdf Tests/LibPDF/type3.pdf
The file contains one `d1` character which per spec shouldn't
contain color statements, and if it does it should be ignored,
and one `d0` character which can contain color.
The text then sets a color before rendering the text.
Per spec, the text color should affect the `d1` character but
not the `d0` one. We get this wrong, but so does Preview.app.
(PDFium gets it right.)
But independent of the colors, just rendering the glyphs at all
at the right position is already good :^)
Hand-written, based on the text example in Appendix G.2 in
the PDF 1.7 spec, with the xref table fixed up by `mutool clean`:
mutool clean -dggg Tests/LibPDF/text.pdf Tests/LibPDF/text.pdf
I didn't find example code for this and the AI assistant did very
poorly on this as well. So I had to write it all by myself!
It can be much more efficient I think, but I think the overall
shape is maybe roughly fine.
* SampledFunction now keeps the StreamObject it gets data from alive
(doesn't matter too much in practice, but does matter in the test,
where nothing else keeps the stream alive).
* If a sample is an integer, we would previously sample that value
twice and then divide by zero when interpolating. Make sure to
sample 1 unit apart.
Covers DeviceGray, CalRGB, DeviceRGB, DeviceCMYK, Lab, CalGray for now.
Does not yet cover Indexed, Pattern, Separation, DeviceN, ICCBased.
Lovingly hand-written, with the xref table fixed up by mutool.
There were two problems:
1. parse_compressed_object_with_index() parses indirect objects
without going through Parser::parse_indirect_value(), so
push_reference() / pop_reference() weren't called.
Manually call them, both for the indirect object containing
the object stream and for the indirect object within the
object stream.
2. The indirect object within the object stream got decrypted
twice: Once when the object stream data itself got decrypted,
and then incorrectly a second time when the object data within
the stream was read. To fix, disable encryption while parsing
object stream data (since it's already decrypted).
The test is from http://opf-labs.org/format-corpus/pdfCabinetOfHorrors/
which according to readme.md at the same location is CC0.
I created this by typing "sup" into TextEdit.app on macOS 13.4,
hitting Cmd-P to bring up the print dialog, clicked the PDF button
at the bottom, changed Title and Author to "sup", clicked
"Security Options…", and checked "Require password to open document"
(with password "sup").
This file tests several things:
- It has a compressed stream as first object. This used to make the
linearization dict detection logic assert.
- It uses AES as encryption key using version 4 of the encryption
dict. This used to not be implemented.
Note that in some cases (in particular SQL::Result and PDFErrorOr),
there is no Formatter defined for the error type, hence TRY_OR_FAIL
cannot work as-is. Furthermore, this commit leaves untouched the places
where MUST could be replaced by TRY_OR_FAIL.
Inspired by:
https://github.com/SerenityOS/serenity/pull/18710#discussion_r1186892445
Let's put test files with the tests themselves, instead of a random user
directory. (But still copy them so they appear in the user directory
for convenience.)
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).
No functional changes.
Security handlers manage encryption and decription of PDF files. The
standard security handler uses RC4/MD5 to perform its crypto (AES as
well, but that is not yet implemented).
Add a unit test for each sample pdf file that currently exists in the
anon user's `~/Document/pdf` directory.
- linear.pdf
- non-linearized.pdf
- complex.pdf
Each test ensures that the pdf document is parsed and that the page
count is the expected one.