Commit Graph

683 Commits

Author SHA1 Message Date
Nico Weber
32f601f9a4 LibPDF: Fix small bug from #21452
I implemented CFF charset format 2 in 6f783929dd with the note
"I haven't seen this being used in the wild". Now that I have
seen it (0000658.pdf), I can say that this has never worked,
despite me claiming "it's easy to implement".

But now it works!
2024-02-08 13:48:56 +00:00
Nico Weber
9fc47345ce LibGfx+LibPDF: Make sample() functions take ReadonlySpan<>
...instead of Vector<>.

No behavior (or performance) change.
2024-02-06 08:44:53 +01:00
Nico Weber
92a628c07c LibPDF: Always treat /Subtype /Image as binary data when dumping
Sometimes, the "is mostly text" heuristic fails for images.

Before:

    Build/lagom/bin/pdf --render out.png ~/Downloads/0000/0000521.pdf \
        --page 10 --dump-contents 2>&1 | wc -l
       25709

After:

    Build/lagom/bin/pdf --render out.png ~/Downloads/0000/0000521.pdf \
         --page 10 --dump-contents 2>&1 | wc -l
       11376
2024-02-05 21:18:19 -05:00
Nico Weber
f562c470e2 LibGfx+LibPDF: Simpler and faster N-D linear sampling
Previously, if we wanted to to e.g. do linear interpolation in 2-D,
we'd get a sample point like (1.3, 4.4), then get 4 samples around
it at (1, 4), (2, 4), (1, 5), (2, 5), then reduce the 4 samples
to 2 samples by computing the combined samples
`0.3 * f(1, 4) + 0.7 * f(2, 4)` and `0.3 * f(1, 5) + 0.8 * f(2, 5)`,
and then 1-D linearly blending between these two samples with the
factor 0.4. In the end we'd multiply the first value by 0.3 * 0.4,
the second by 0.7 * 0.4, the third by 0.3 * 0.6, and the third by
0.7 * 0.6, and then sum them all up.

This requires computing and storing 2**N samples, followed by
another 2**N iterations to combine the 2**N sampls to a single value.
(N is in practice either 4 or 3, so 2**N isn't super huge.)

Instead, for every sample we can directly compute the product of
weights and sum them up directly. This lets us omit the second loop
and storing 2**N values, in exchange for doing an additional O(n)
work to compute the product.

Takes

    Build/lagom/bin/image --no-output --invert-cmyk \
        --assign-color-profile \
            Build/lagom/Root/res/icc/Adobe/CMYK/USWebCoatedSWOP.icc \
        --convert-to-color-profile serenity-sRGB.icc \
        cmyk.jpg

form 3.42s to 3.08s on my machine, almost 10% faster (and less code).

Here cmyk.jpg is a 2253x3080 cmyk jpeg, and USWebCoatedSWOP.icc is an
mft2 profile with input tables with 256 samples and a 9x9x9x9 CLUT.

The LibPDF change is covered by TEST_CASE(sampled) in LibPDF.cpp,
and the LibGfx change is basically the same change as the one in
LibPDF (where the test results don't change) and the output
subjectively looks identical. So hopefully this causes indeed no
behavior change :^)
2024-02-04 21:49:23 +01:00
Nico Weber
955d73657e LibPDF: Make pdf --dump-contents dump less binary data
For pages containing images or embedded fonts, --dump-contents
used to dump a ton of binary data. That isn't very useful, so
stop doing it.

Before:

    % time Build/lagom/bin/pdf --render out.png \
        ~/Downloads/0000/0000711.pdf --dump-contents | wc -l
      937972

Now:

    % time Build/lagom/bin/pdf --render out.png \
        ~/Downloads/0000/0000711.pdf --dump-contents | wc -l
        6566

Printing 7k lines is also much faster than printing 940k,
0.15s instead of 2s.
2024-02-03 08:26:29 +00:00
Nico Weber
9c762b9650 LibPDF+Meta: Use a CMYK ICC profile to convert CMYK to RGB
CMYK data describes which inks a printer should use to print a color.
If a screen should display a color that's supposed to look similar
to what the printer produces, it results in a color very different
to what Color::from_cmyk() produces. (It's also printer-dependent.)

There are many ICC profiles describing printing processes. It doesn't
matter too much which one we use -- most of them look somewhat
similar, and they all look dramatically better than Color::from_cmyk().

This patch adds a function to download a zip file that Adobe offers
on their web site. They even have a page for redistribution:
https://www.adobe.com/support/downloads/iccprofiles/icc_eula_win_dist.html

(That one leads to a broken download though, so this downloads the
end-user version.)

In case we have to move off this download at some point, there are also
a whole bunch of profiles at https://www.color.org/registry/index.xalter
that "may be used, embedded, exchanged, and shared without restriction".

The adobe zip contains a whole bunch of other useful and fun profiles,
so I went with it.

For now, this only unzips the USWebCoatedSWOP.icc file though, and
installs it in ${CMAKE_BINARY_DIR}/Root/res/icc/Adobe/CMYK/. In
Serenity builds, this will make it to /res/icc/Adobe/CMYK in the
disk image. And in lagom build, after #23016 this is the
lagom res staging directory that tools can install via
Core::ResourceImplementation. `pdf` and `MacPDF` already do that,
`TestPDF` now does it too.

The final piece is that LibPDF then loads the profile from there
and uses it for DeviceCMYK color conversions.

(Doing file access from the bowels of a library is a bit weird,
especially in a system that has sandboxing built in. But LibGfx does
that in FontDatabase too already, and LibPDF uses that, so it's not a
new problem.)
2024-02-01 13:42:04 -07:00
Nico Weber
f840fb6b4e LibPDF: Make DeviceCMYKColorSpace::the() fallible
No behavior change.
2024-02-01 13:42:04 -07:00
Nico Weber
384c6cf0f9 LibPDF: Tweak vertical position of truetype fonts again
See #22821 for a previous attempt. This attempt should settle
things once and for all.

The opentype render path adjusts by `-font_ascender * -y_scale` in
Glyf::Glyph::append_simple_path(), so that's what we need to undo
to draw at the font's baseline.

(OpenType::Font::metrics() returns ascender scaled by y_scale already,
so no need to have the scale here where we undo the shift.)

Previously, we called `baseline()` which just returns the font's
font size, which is pretty meaningless:

https://tonsky.me/blog/font-size/
https://simoncozens.github.io/fonts-and-layout/opentype.html#vertical-metrics-hhea-and-os2

Also, conceptually it makes sense to translate up by the ascender
to get from the upper edge of the glyph to the baseline.
2024-02-01 10:05:40 +01:00
Nico Weber
87112dcbdc LibPDF: Return null for invalid refs, tolerate null objects as outline
https://llvm.org/devmtg/2022-11/slides/TechTalk5-WhatDoesItTakeToRunLLVMBuildbots.pdf
has an xref table that starts like so:

```
xref
0 214
0000000002 65535 f
0000924663 00000 n
0000000003 00000 f
0000000000 00000 f
0000000016 00000 n
0000000160 00000 n
0000000263 00000 n
```

This is a list of objects in the PDF file. The lines ending with 'f'
mean that this object is "free", that is it's not stored in the file.
In this file, objects 0, 2, 3 are free. For free objects, the first
number is the offset of the next free object: Object 0 refers to object
2, 2 to 3, and 3 back to 0 (since it's the last free object).
The lines ending with "n" are actual objects; here the first number is
a byte offset to where that object is stored in the file.

Furthermore, the file contains

```
/Outlines
2
0
R
```

in its root object, meaning that object 2 stores the page outlines.

Since object 2 is set as free, there is no object 2. But the spec
says that an invalid object reference is just the null object.

This patch makes us return null objects for references to free
objects, and it also makes us treat a null object as /Outlines value
the same as not having /Outlines in the first place.

Fixes #23023 -- we can now open that file. (We don't render it super
well, but only for already-known reasons.)

Since I found it a bit confusing: XRefTable has two related methods
here:

1. has_object() returns if an object was explicitly listed in an
   xref table. The first number right after `xref` is the start
   index. So if an xref table were to start with `10`, we'd implicitly
   create 10 trailing objects for which has_object() would return false
2. is_object_in_use() returns true if an object that was in a table
   (i.e. one where has_object() returns true) was listed with 'n' and
   false if it was listed with 'f'.

DocumentParser::parse_object_with_index() should probably return a null
object for the `!has_object()` case as well instead of VERIFY()ing
that has_object() is true. But I haven't seen this in the wild yet,
so keeping as-is for now.
2024-01-31 12:10:19 -05:00
Timothy Flynn
aa0a6d58b2 Userland: Remove LibCore dependency from libraries that do not use it 2024-01-22 08:48:34 -05:00
Nico Weber
a0462f495c LibPDF+MacPDF: Clip text, and add a debug option for disabling it 2024-01-20 08:56:03 +01:00
Nico Weber
90fdf738a1 LibPDF: Alphabetize clip_ fields in RenderingPreferences
No behavior change.
2024-01-20 08:56:03 +01:00
Nico Weber
66f8259a0b LibPDF: Move ClipRAII to .h file
No behavior change.
2024-01-20 08:56:03 +01:00
Tim Ledbetter
459fa8b840 LibPDF: Ensure that xref subsection numbers are u32
Previously, parsing an xref entry with a floating point subsection
number would cause a crash.
2024-01-18 15:11:42 +01:00
Nico Weber
d2f3288666 LibPDF: Apply text matrix to each glyph's position
We still don't apply it to the glyph itself, so they don't show up
scaled or rotated, but they're at the right spot now.

One big thing this here hsa going for it is that the final glyph
position is now calculated with just
`ext_rendering_matrix.map(glyph_position)`.

Also, character_spacing and word_spacing are now used unmodified
in the SimpleFont::draw_string() loop. This also means we no longer
have to undo a scale when updating the position in
`Renderer::show_text()`.

Most of the rest stays pretty yucky though. The root cause of many
problems is that ScaledFont has its rendering sized baked into the
object. We want to render fonts at size font_size times scale from
text matrix times scale from current transformation matrix (but
not size from hotizontal_scaling). So we have to make that the
font_size, but then we have to undo that in a bunch of places to
get the actualy font size.

This will eventually get better when LibPDF moves off ScaledFont.
2024-01-18 14:01:30 +01:00
Nico Weber
f54b0e7c22 LibPDF: Don't accidentally put horizontal_scaling in places
Fonts should have size font_size times total scaling. We tried to
get that by computing text_rendering_matrix.x_scale() * font_size,
but text_rendering_matrix.x_scale() also includes
horizontal_scaling, which shouldn't be part of font size.

Same for character_spacing and word_spacing.

This is all a big mess that's caused by LibPDF using ScaledFont,
which requires scaling to be aprt of the text type. I have an
in-progress local branch that moves LibPDF to directly use VectorFont,
which will hopefully make this (and other things) nicer. But first,
let's get this right, and then make sure we don't regress it when
things change :^)
2024-01-18 14:01:30 +01:00
Nico Weber
abda5e66f6 LibPDF: Scale delta_x by horizontal_scaling in Renderer::show_text()
While PDFFont::draw_string() already returns a position scaled by
horizontal_scaling, the division by text_rendering_matrix.x_scale()
(which also contains the scaling factor) undid it. Reapply it.

Fixes the horizontal layout of the line
"should be the same on all lines: super" in Tests/LibPDF/text.pdf.
2024-01-18 14:01:30 +01:00
Nico Weber
470d1d8dcf LibPDF: Fix order of parameter, text, and current transform matrix
PDF spec 1.7 5.3.3 Text Space Details gives the correct multiplication
order: parameters * textmatrix * ctm.

We used to do text * ctm * parameters
(AffineTransform::multiply() does left-multiplication).

This only matters if `text_state().rise` is non-zero. In practice,
it's almost always zero, in which case the paramter matrix is a
diagonal matrix that commutes.

Fixes the horizontal offset of "super" in Tests/LibPDF/text.pdf.
2024-01-18 14:01:30 +01:00
Nico Weber
6c65c18c40 LibPDF: Add spec ref to Renderer::calculate_text_rendering_matrix() 2024-01-18 14:01:30 +01:00
Nico Weber
13f007aadb LibPDF: Tweak vertical position of truetype fonts
The vertical coordinates for truetype fonts are different somehow.
We compensated a bit for that; now we compensate some more.

This is still not 100% perfect, but much better than before.
2024-01-17 08:44:07 +00:00
Nico Weber
1845a406ea LibPDF: Add debug settings for clipping paths and images 2024-01-17 08:42:56 +00:00
Nico Weber
2d8a22f4b4 LibPDF: Clip images too
Since we can't clip against a general path yet, this clips images
against the bounding box of the current clip path as well.

Clips for images are often rectangular, so this works out well.

(We wastefully still decode and color-convert the entire image.
In a follow-up, we could consider only converting the unclipped
part.)
2024-01-17 08:42:56 +00:00
Nico Weber
5615a2691a LibPDF: Extract activate_clip() / deactivate_clip() functions
No behavior change.
2024-01-17 08:42:56 +00:00
MacDue
d55867e563 LibPDF: Fix paths with negatively sized re (rect) commands
Turns out the width/height in a `re` command can be negative. This
results in rectangles with different winding orders. For example, a
negative width results in a reversed winding order.

Previously, this was lost by passing the rect through an
`AffineTransform` before constructing the path. So instead, this
constructs the rect path, and then transforms the resulting path.
2024-01-16 21:31:20 +00:00
Nico Weber
0e91682283 LibPDF: Be more forgiving about trailing image data
The predictor code assumed that all stream data is image data
(...which would make sense: trailing data there is wasted space).

But some PDFs have trailing data there, e.g. 0000257.pdf, so be
forgiving about it.
2024-01-16 09:55:11 -05:00
Nico Weber
b34509edd2 LibPDF: Make pdf --dump-contents handle \r line endings better
Previously, all page contents ended up overprinting a single line
over and over for PDFs that used only `\r` as line ending.

This is for example useful for 0000364.pdf.
2024-01-15 23:16:45 -07:00
Nico Weber
9f9dbb325b LibPDF: Make prediction filters error on user-controlled alloc OOM 2024-01-15 23:06:06 -07:00
Nico Weber
93f5420282 LibPDF: Start implementing the TIFF predictor
This codepath is separate from the predictor in the TIFF decoder.
The TIFF decoder currently does bits->Color conversion before
processing the predictor. That doesn't fit the PDF model where
filters are processed before converting streams into bitmaps.

If this code here ever grows to handle all cases, maybe we can move
it over to the TIFF decoder and then make it do predictions before
decoding to colors, to share this code.

(TIFF prediction is pretty messy since it's bits-per-pixel-dependent.
PNG prediction is always byte-based, which makes things easier.)
2024-01-15 23:06:06 -07:00
Nico Weber
9a93f677f4 LibPDF: Mark text rendering matrix as dirty after TJ numbers
Mostly because I audited all places that assigned to `m_text_matrix`
after #22760.

This one is very difficult to trigger in practice.

`show_text()` marks the text rendering matrix dirty already,
so this only has an effect if the `TJ` array starts with a
number, and the matrix isn't marked dirty going in.

`Tm` caches the text rendering matrix, so I changed text.pdf
to contain:

```
1 0 0 1 45 130 Tm
[ 200 (Hello) -2000 (World) ] TJ T*
```

This first sets an x offset of 5 (on top of the normal 40), and
then undoes it (`200` is multiplied by font size (25) / -1000,
and `200 * 25 / -1000` is -5). Before this change, the topmost
"Hello World" ended up slightly indented.

Likely no behavior change in practice, but makes the code easier
to understand, and maybe it helps in the wild somewhere.
2024-01-15 08:39:04 +00:00
Nico Weber
f23f5dcd62 LibPDF: Mark text rendering matrix dirty for Td operator
0000342.pdf page 5 contains this snippet:

```
/T1_1 10.976 Tf
0 -31.643 TD
(This)Tj

1 0 0 1 54 745.563 Tm
22.181 -31.643 Td
[(vehicle)-270.926(uses)...
```

The `Tm` marked the text rendering matrix as dirty at the start,
but it then calls calculate_text_rendering_matrix() almost in the
next line, which recalculates the text rendering matrix and caches
the new matrix. The `Td` used to not mark it as dirty, and we'd
draw "vehicle" with an incorrect matrix.
2024-01-15 08:37:55 +00:00
Nico Weber
f4ee9a2333 LibPDF: Support drawing images with 16 bits per channel
This uses the tried-and-true "throw away the lower 8 bits" technique
for now. This lets us render  Tests/LibPDF/wide-gamut-only.pdf.
2024-01-12 16:20:46 -07:00
Nico Weber
5f85aff036 LibPDF: Move ColorSpace::style() to take ReadonlySpan<float>
All ColorSpace subclasses converted to float anyways, and this
allows us to save lots of float->Value->float conversions during
image color space processing.

A bit faster:

```
    N           Min           Max        Median         Avg       Stddev
x  50    0.99054313     1.0412271    0.99933481   1.0052408  0.012931916
+  50    0.97073889     1.0075941    0.97849107  0.98184034 0.0090329046
Difference at 95.0% confidence
	-0.0234004 +/- 0.00442595
	-2.32785% +/- 0.440287%
	(Student's t, pooled s = 0.0111541)
```
2024-01-12 12:37:56 +00:00
Nico Weber
56a4af8d03 LibPDF: Don't reallocate Vectors in ICCBasedColorSpace all the time
Microoptimization; according to ministat a bit faster:

```
    N           Min           Max        Median         Avg       Stddev
x  50     1.0179932     1.0561159     1.0315337   1.0333617 0.0094757426
+  50      1.000875     1.0427601     1.0208509   1.0201902   0.01066116
Difference at 95.0% confidence
	-0.0131715 +/- 0.00400208
	-1.27463% +/- 0.387287%
	(Student's t, pooled s = 0.0100859)
```
2024-01-12 12:37:56 +00:00
Nico Weber
cfd05b1a55 LibPDF: Use MatrixMatrixConversion when possible
Reduces time spent rendering page 3 of 0000849.pdf from 1.32s to 1.13s
on my machine.

Also reduces the time to run Meta/test_pdf.py on 0000.zip
(without 0000849.pdf) from 56s to 54s.
2024-01-12 09:09:56 +01:00
Nico Weber
c161b2d2f9 LibPDF: Extract ICCBasedColorSpace::sRGB() helper 2024-01-12 09:09:56 +01:00
Nico Weber
f7fc2df8ac LibPDF: Simplify load_image() a tiny bit
Images can't use Pattern color spaces, so we'll always have a Color.

No behavior (or perf) change.
2024-01-10 23:26:57 +01:00
Nico Weber
df5451a889 LibPDF: Mark text rendering matrix dirty after changing it in text_begin
A certain PDF was drawing some text used `9 0 0 9 474.54 700.6801 Tm`
to set the text matrix to a matrix that scaled by 9 in one text object.

Then, after ending that text object, it had the following new text
object which contained nothing that invalidated the text matrix:

```
BT
/F1 7 Tf
/DeviceRGB CS
0 0 0 SC
10 TL
86.37849 21.908 Td
(Authorized licensed use limited to: ...) Tj
ET
```

`BT` did reset it as required, but since we didn't mark the matrix
as dirty, we never recomputed it and drew the additional text scaled
up 9x.
2024-01-10 19:42:08 +01:00
Nico Weber
4fd5d450be LibPDF: Add support for image masks
An image mask is a 1-bit-per-pixel bitmap that's black where the
current color should be painted, and white where it should be
transparent (think: like ink).

load_image() already converts images like this into 8-bit-per-pixel
images that have 0xff, 0xff, 0xff in rgb for opaque (originally 0 bit)
pixels and 0, 0, 0 in rgb for transparent pixels.

So we just move copy the image mask's image data into the alpha
channel and replace rgb with the current color, and then draw
it like a regular bitmap.
2024-01-10 09:10:11 +00:00
Nico Weber
e770cf06b0 LibPDF: Send jpeg data down the same path as all other data
JPEG images now honor decode arrays and color spaces.
2024-01-10 09:39:00 +01:00
Nico Weber
f157cd50a1 LibPDF: Use mix() in SampledFunction::evaluate()
No behavior change.
2024-01-04 21:12:23 +01:00
Nico Weber
e16345555b LibPDF: Port 59b50fa43f8c2 to xref and object streams
0000440.pdf contains an xref stream object (at offset 3643676) starting:

```
294 0 obj <<
/Type /XRef
/Index [0 295]
/Size 295
```

and an object stream object (at offset 3640121) starting:

```
230 0 obj <<
/Type /ObjStm
/N 73
/First 614
```

In both cases, the `obj` and the `<<` are separated by non-newline
whitespace.

633e1632d0 made parse_indirect_value() tolerate this, but it didn't
update neither parse_xref_stream() (which parses xref streams) nor
parse_compressed_object_with_index() (which parses object streams),
despite all three changes being part of #14873.

Make parse_xref_stream() and parse_compressed_object_with_index()
call parse_indirect_value() to pick up the fix over there. It's a bit
less code too.

(0000440.pdf is the only PDF in my 1000 test PDFs that this helps,
somewhat surprisingly.)
2024-01-04 11:27:24 +01:00
Nico Weber
9d69c5d434 LibPDF: Tolerate trailing whitespace after %%EOF marker
At first I tried implmenting the quirk from PDF 1.7 Appendix H,
3.4.4, "File Trailer": """Acrobat viewers require only that the %%EOF
marker appear somewhere within the last 1024 bytes of the file.""
This would've been like #22548 but at end-of-file instead of at
start-of-file.

This helped a bunch of files, but also broke a bunch of files that
made more than 1024 bytes of stuff at the end, and it wouldn't have
helped 0000059.pdf, which has over 40k of \0 bytes after the %%EOF.
So just tolerate whitespace after the %%EOF line, and keep ignoring
and arbitrary amount of other stuff after that like before.

This helps:
* 0000599.pdf
  One trailing \0 byte after %%EOF. Due to that byte, the
  is_linearized() check fails and we go down the non-linearized
  codepath. But with this fix, that code path succeeds.
* 0000937.pdf
  Same.
* 0000055.pdf
  Has one space followed by a \n after %%EOF
* 0000059.pdf
  Has over 40kB of trailing \0 bytes

The following files keep working with it:
* 0000242.pdf
  5586 bytes of trailing HTML
* 0000336.pdf
  5586 bytes of trailing HTML fragment
* 0000136.pdf
  2054 bytes of trailing space characters
  This one kind of only worked by accident before since it found
  the %%EOF block before the final %%EOF block. Maybe this is
  even an intentional XRefStm compat hack? Anyways, now it
  find the final block instead.
* 0000327.pdf
  11044 bytes of trailing HTML
2024-01-04 11:19:15 +01:00
Nico Weber
2d12647e29 LibPDF: Add FIXME for "was linearized PDF incrementally updated" check
It's pretty tricky to do, and also tricky with respect to skipping
trailing bytes after %%EOF: The check requires knowning the full size of
the PDF (which means web servers not sending content lengths are out),
but that size has to be after stripping trailing bytes, which normal
static file servers won't do. So PDF viewers would have to download the
last couple bytes of the PDF unconditionally, then strip trailing bytes
and use the count to figure out the final actual PDF size.

Luckily, we don't incrementally download PDFs from the net but
instead require all data to be available in one chunk, so it's
not currently a problem.
2024-01-04 11:19:15 +01:00
Nico Weber
1b45c3e127 LibPDF: Tolerate whitespace after xref and startxref
The spec isn't super clear on if this is allowed:

"""Each cross-reference section shall begin with a line containing the
keyword xref. Following this line..."""

"""The two preceding lines shall contain, one per line and in order, the
keyword startxref and..."""

It kind of sounds like anything goes on both lines as long as they
contain `xref` and `startxref`.

In practice, both seem to always occur at the start of their line,
but in 0000780.pdf (and nowhere else), there's one space after each
keyword before the following linebreak, and this makes that file load.
2024-01-04 10:14:30 +01:00
Nico Weber
efb37f7252 LibPDF: Add Reader::consume_non_eol_whitespace() 2024-01-04 10:14:30 +01:00
Nico Weber
c59e08123b LibPDF: Add a FIXME and a spec comment to Encoding::from_object() 2024-01-04 10:12:11 +01:00
Nico Weber
ad5fc0eda1 LibPDF: An Encoding's /Differences entry is optional
Per "TABLE 5.11 Entries in an encoding dictionary", /Differences is
optional.

(Per "Encodings for TrueType Fonts" in 5.5.5 Character Encoding,
nonsymbolic truetype fonts are even recommended to have "no Differences
array." But in practice, most seem to have it.)

Fixes crashes on:
* 0000001.pdf
* 0000574.pdf
* 0000337.pdf

All three don't render super great, but at least they no longer crash.
2024-01-04 10:12:11 +01:00
Nico Weber
0bb0c7dac2 LibPDF: Scan for PDF file start in first 1024 bytes
Other readers do this too, and files depend on this.

Fixes opening these four files from the PDFA 0000.zip dataset:

* 0000015.pdf
  Starts with `C:\web\webeuncet\_cat\_docs\_publics\` before header
* 0000408.pdf
  Starts with UTF-8 BOM
* 0000524.pdf
  Starts with 867 bytes of HTML containing a PHP backtrace
* 0000680.pdf
  Starts with `C:\web\webeuncet\_cat\_docs\_publics\` too
2024-01-03 10:12:35 +01:00
Nico Weber
9495f64f91 LibPDF: Improve hex string parsing
A local (non-public) PDF I have lying around contains this in
a page's operator stream:

```
[<00b4003e> 3 <002600480051> 3 <005700550044004f0003> -29
<00330044> 3 <0055> -3 <004e0040> 4 <0003> -29 <004c00560003> -31
<0057004b> 4 <00480003> -37 <0050
>] TJ
```

That is, there's a newline in a hexstring after a character.

This led to `Parser error at offset 5184: Unexpected character`.

The spec says in 3.2.3 String Objects, Hexadecimal Strings:
"""Each pair of hexadecimal digits defines one byte of the string.
White-space characters (such as space, tab, carriage return, line feed,
and form feed) are ignored."""

But we didn't ignore whitespace before or after a character, only
in between the bytes.

The spec also says:
"""If the final digit of a hexadecimal string is missing—that is, if
there is an odd number of digits—the final digit is assumed to be 0."""

In that case, we were skipping the closing `>` twice -- or, more
accurately, we ignored the character after it too. This has been
wrong all the way back in #6974.

Add a test that fails if either of the two changes isn't present.
2024-01-02 22:13:21 +01:00
Lucas CHOLLET
f389c1cdba LibGfx+LibPDF: Use LibCompress' implementation of the PackBits decoder
No need to have these three copies :^)
2023-12-27 17:40:11 +01:00
Shannon Booth
e2e7c4d574 Everywhere: Use to_number<T> instead of to_{int,uint,float,double}
In a bunch of cases, this actually ends up simplifying the code as
to_number will handle something such as:

```
Optional<I> opt;
if constexpr (IsSigned<I>)
    opt = view.to_int<I>();
else
    opt = view.to_uint<I>();
```

For us.

The main goal here however is to have a single generic number conversion
API between all of the String classes.
2023-12-23 20:41:07 +01:00
Nico Weber
b63eb4a4dd LibPDF: Implement /Mask support with stream object argument 2023-12-23 20:39:11 +01:00
Nico Weber
a3507ef65b LibPDF: Move error for /ImageMask out of load_image()
...and tweak load_image() to support loading mask images
(which don't have a color space and are always 1 bit per pixel).
2023-12-23 20:39:11 +01:00
Nico Weber
3ad9782e25 LibPDF: Extract a apply_alpha_channel() function
No behavior change.
2023-12-23 20:39:11 +01:00
Nico Weber
4bd11c8eb4 LibPDF: Show a 'rendering unsupported' error for images with /Mask key 2023-12-23 20:39:11 +01:00
Nico Weber
387fecea7f LibPDF: Fix typo in a variable name
No behavior change.
2023-12-23 10:10:24 +01:00
Nico Weber
6723552e95 LibPDF: Add a spec comment and remove a FIXME
I think the ASCIIHexDecode / ASCII85Decode unfilter functions handle
what this FIXME was about already.
2023-12-22 10:58:54 +01:00
Nico Weber
3d07684891 LibPDF: Extract Parser::parse_inline_image()
Pure code move, no intended behavior change.

The motivation is just to make Parser::parse_operators() less nested
and more focused.
2023-12-22 10:58:54 +01:00
Nico Weber
6032c06f6b Revert "LibPDF: Add basic tiled, coloured pattern rendering"
This reverts commit 8ff87911a3.
2023-12-21 19:24:56 +01:00
Nico Weber
7cb216c95b Revert "LibPDF: Offset PaintStyle when painting so pattern overlaps..."
This reverts commit 8c7fc4fe6c.
2023-12-21 19:24:56 +01:00
Nico Weber
6de32e5359 LibPDF: Draw inline images
The idea is to massage the inline image data into something that
looks like a regular image, and then use the normal image drawing code:
We translate the inline image abbreviations to the expanded version at
rendering time, then unfilter (i.e. uncompress) the image data at
rendering time, and the go down the usual image drawing path.

Normal streams are unfiltered when they're first accessed, but
inline image streams live in a page's drawing operators, and this
fits the current approach of parsing a page's operators anew
every time the page is rendered.

(We also need to add some special-case handling for color spaces
of inline images: Inline images can use named color spaces, while
regular images always use direct color space objects.)
2023-12-20 12:45:16 -07:00
Nico Weber
d577d181e3 LibPDF: Clamp linear_srgb values in convert_to_srgb()
This is very crude gamut mapping, but it's better than producing
NaNs when passing negative values to powf(x, 1/2.2).
2023-12-20 12:45:07 +01:00
Nico Weber
022fce75a6 LibPDF: Get inline image data from parser to renderer
We create a inline_image_end operator that has all the relevant data
in a synthetic StreamObject.

inline_image_end is still a RENDERER_TODO(), so no real behavior
change. (Previously we'd call only inline_image_begin, so string the
todo message is about is now a bit different. But no interesting
behavior change.)
2023-12-20 12:19:08 +01:00
Nico Weber
3285502ec6 LibPDF: Extract a Parser::unfilter_stream() method
No behavior change.
2023-12-20 12:19:08 +01:00
Nico Weber
b21f867e88 LibPDF: Don't crash on images with empty filter arrays
0000967.pdf page 2 contains a bunch of inline images with empty
filter arrays.
2023-12-20 12:19:08 +01:00
Nico Weber
13641693cb LibPDF: Use make_object<>() to make objects
No behavior change.
2023-12-20 12:19:08 +01:00
Ali Mohammad Pur
5e1499d104 Everywhere: Rename {Deprecated => Byte}String
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).

This commit is auto-generated:
  $ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
    Meta Ports Ladybird Tests Kernel)
  $ perl -pie 's/\bDeprecatedString\b/ByteString/g;
    s/deprecated_string/byte_string/g' $xs
  $ clang-format --style=file -i \
    $(git diff --name-only | grep \.cpp\|\.h)
  $ gn format $(git ls-files '*.gn' '*.gni')
2023-12-17 18:25:10 +03:30
Nico Weber
f2f07c3a80 LibPDF: Replace if (a) VERIFY(0) with VERIFY(!a)
No behavior change.
2023-12-16 12:39:56 +01:00
Nico Weber
ee74bc2538 LibPDF: Tolerate 0-sized Subrs in PS1 font subprograms
This regressed in 2b3a41be74 in #18031.

Fixes a crash rendering page 2 and onward of
https://pyx-project.org/presentation_dantemv35_en.pdf
2023-12-16 12:39:56 +01:00
Nico Weber
11354dbf9e LibPDF: Remember inline image stream bytes
We still don't process inline images, but now we have the pieces we need
for doing it (`map` and `stream_bytes`).
2023-12-11 10:50:39 +01:00
Nico Weber
cabc6a9d80 LibPDF: Add a comment that PDF 2.0 added a length key for inline images
In practice, basically no file has it, since it was only added in 2.0,
and 1.7 explicitly said "in particular, the Type, Subtype, and Length
entries normally found in a stream or image dictionary are unnecessary."
2023-12-11 10:50:39 +01:00
Nico Weber
071f890847 LibPDF: Require whitespace in front of inline image marker EI
Fixes a crash on page 3 of 0000450.pdf of 0000.zip, where we previously
started interpreting the middle of an inline image content stream as
operators, since it contained `EI` in its pixel data.
2023-12-11 10:50:39 +01:00
Nico Weber
27aae7e2b1 LibPDF: Parse inline image key-value pairs
Not used for anything yet.
2023-12-11 10:50:39 +01:00
Nico Weber
0912896ae0 LibPDF: Extract Parser::parse_dict_contents_until()
No behavior change.
2023-12-11 10:50:39 +01:00
Kyle Pereira
8c7fc4fe6c LibPDF: Offset PaintStyle when painting so pattern overlaps properly 2023-12-10 16:44:24 +01:00
Kyle Pereira
8ff87911a3 LibPDF: Add basic tiled, coloured pattern rendering 2023-12-10 16:44:24 +01:00
Kyle Pereira
8191f2b47a LibPDF: Add parameter for background color of render 2023-12-10 16:44:24 +01:00
Kyle Pereira
60c4803dd3 LibPDF: Pass Renderer to ColorSpace 2023-12-10 16:44:24 +01:00
Kyle Pereira
082a4197b6 LibPDF: Use Variant<Color, PaintStyle> instead of Color for ColorSpaces
This is in anticipation of Pattern color space support which does not
yield a simple color.
2023-12-10 16:44:24 +01:00
Kyle Pereira
e4b8d68039 LibPDF: Permit comments at the end of a stream 2023-12-10 16:44:24 +01:00
Nico Weber
8b50b689f9 LibPDF: Reject invalid "hival" values
Doesn't fire on any of the PDFs I have, and seems like a good thing
to check.
2023-12-07 08:10:40 +00:00
Nico Weber
43cd3d7dbd LibPDF: Tolerate palettes that are one byte too long
Fixes these errors from `Meta/test_pdf.py path/to/0000`, with
0000 being 0000.zip from the PDF/A corpus in unzipped:

    Malformed PDF file: Indexed color space lookup table doesn't
                        match size, in 4 files, on 8 pages, 73 times
      path/to/0000/0000206.pdf 2 4 (2x) 5 (3x) 6 (4x)
      path/to/0000/0000364.pdf 5 6
      path/to/0000/0000918.pdf 5
      path/to/0000/0000683.pdf 8
2023-12-07 08:10:40 +00:00
Nico Weber
832a065687 LibPDF: For low-bpp images, start scanlines on byte boundaries
Required per spec, and we get slanted images without it. Fixes e.g.
page 1 of 0000749.pdf.
2023-12-07 08:10:40 +00:00
Nico Weber
06b9633da5 LibPDF: For indexed images with 1, 2 or 4 bpp, do not repeat bit pattern
When upsampling e.g. the 4-bit value 0b1101 to 8-bit, we used to repeat
the value to fill the full 8-bits, e.g. 0b11011101. This maps RGB colors
to 8-bit nicely, but is the wrong thing to do for palette indices.
Stop doing this for palette indices.

Fixes "Indexed color space index out of range" for 11 files in the
PDF/A 0000.zip test set now that we correctly handle palette indices
as of the previous commit:

    Malformed PDF file: Indexed color space lookup table doesn't match
                        size, in 4 files, on 8 pages, 73 times
      path/to/0000/0000206.pdf 2 4 (2x) 5 (3x) 6 (4x)
      path/to/0000/0000364.pdf 5 6
      path/to/0000/0000918.pdf 5
      path/to/0000/0000683.pdf 8
2023-12-07 08:10:40 +00:00
Nico Weber
8733ba2734 LibPDF: Fix decoding of IndexedColorSpace for palette sizes != 255
Previously, we were scaling palette indices from 0..(palette_size - 1)
to 0..255 before using them as index into the palette. Instead, do not
scale palette indices before using them as indices.

(Renderer::load_image() uses `component_value_decoders.empend(
.0f, 255.0f, dmin, dmax)`, so to get an identity mapping, we have to
return `0, 255` from IndexedColorSpace::default_decode()).

Fixes rendering of the gradient on page 5 of 0000277.pdf.
2023-12-06 15:32:13 +01:00
Nico Weber
4cb0593daf LibPDF: Convert LAB values to bytes differently
Gfx::ICC::Profile's current API takes bytes, so we need to do some
contortions for LAB values to go through.

This will probably become nicer once we implement all the backward
transforms in Gfx::ICC::Profile, but for now let's hack it in
on the LibPDF side.

Makes colors in 0000651.pdf looks good, especially on pages 1 and 7-12.
2023-12-05 11:36:44 -05:00
Nico Weber
b2a1130556 LibGfx/ICC: Implement conversion between different connection spaces
If one profile uses PCSXYZ and the other PCSLAB as connection space,
we now do the necessary XYZ/LAB conversion.

With this and the previous commits, we can now convert from profiles
that use PCSLAB with mAB, such as stress.jpeg from
https://littlecms.com/blog/2020/09/09/browser-check/ :

    % Build/lagom/icc --name sRGB --reencode-to serenity-sRGB.icc
    % Build/lagom/bin/image -o out.png \
        --convert-to-color-profile serenity-sRGB.icc \
        ~/src/jpegfiles/stress.jpeg
2023-12-04 08:02:36 +00:00
Nico Weber
1c88b82dfc LibPDF: Do less work in SampledFunction::evaluate()'s inner loop
Instead of recomputing the left index and the float amount in that
interval for each coordinate all the time, do it once when we
preprocess the input coordinates.

One line less, faster, and arguably easier to read.

No behavior change.
2023-12-02 22:26:13 +01:00
Nico Weber
54883b7d41 LibPDF: Remove get_bounds lambda in SampledFunction::evaluate()
Using `min()` to guarantee the left index is never == `size() - 1`,
even for an interpolation value of 1.0, is less code, and arguably
easier to understand as well.

No behavior change.
2023-12-02 22:26:13 +01:00
Nico Weber
d9fd72007e LibPDF: Add a spec comment to SampledFunction::sample() 2023-12-02 22:26:13 +01:00
Idan Horowitz
aad5c58996 LibPDF: Eliminate reference cycle between OutlineItem parent/children
Since all parents held a reference pointer to their children, and all
children held reference pointers to their parents, both objects would
never get free'd once the document was no longer being used.

Fixes ossfuzz-63833.
2023-12-02 22:23:53 +01:00
Lucas CHOLLET
2a5cb5becb LibCompress: Add LZWDecoder::decode_all()
This method takes bytes as input and decompress everything to a
ByteBuffer. It uses two control codes (clear and end of data) as
described in the GIF, TIFF and PDF specifications.
2023-12-01 12:58:14 +01:00
Nico Weber
f34da6396f LibPDF: Update font size after getting font from cache
Page 1 of 0000277.pdf does:

    BT 22 0 0 22  59  28 Tm /TT2 1 Tf
        (Presented at Photonics West OPTO, February 17, 2016) Tj ET
    BT 32 0 0 32 269 426 Tm /TT1 1 Tf
        (Robert W. Boyd) Tj ET
    BT 22 0 0 22 253 357 Tm /TT2 1 Tf
        (Department of Physics and) Tj ET
    BT 22 0 0 22 105 326 Tm /TT2 1 Tf
        (Max-Planck Centre for Extreme and Quantum Photonics) Tj ET

Every line begins a text operation, then updates the font matrix,
selects a font (TT2, TT1, TT2, TT1), draws some text and ends the text
operation.

`Tm` (which sets the font matrix) contains a scale, and uses that
to update the font size of the currently-active font (cf #20084).
But in this file, we `Tm` first and `Tf` (font selection) second,
so this updates the size of the old font. So when we pull it out
of the cache again on line 3, it would still have the old size
from the `Tm` on line 2.

(The whole text scaling logic in LibPDF imho needs a rethink; the
current approach also causes issues with zero-width glyphs which
currently lead to divisions by zero. But that's for another PR.)

Fixes another regression from c8510b58a3 (which I've accidentally
referred to by 2340e834cd in another commit).
2023-11-26 19:05:13 -05:00
Nico Weber
eb1c99bd72 LibPDF+LibGfx: Make SMasks on jpeg images work
SMasks are greyscale images that get used as alpha channel for a
different image.

JPEGs in PDFs are stored as streams with /DCTDecode filters, and
we have a separate code path for loading those in the PDF renderer.
That code path just calls our JPEG decoder, which creates bitmaps
with format BGRx8888.

So when we process an SMask for such a bitmap, we have to change
the bitmap's format to BGRA8888 in addition to setting alpha values
on all pixels.
2023-11-23 12:13:03 +01:00
Nico Weber
57e2b5ef59 LibPDF+Tests: Correctly decode text strings without explicit encoding 2023-11-22 09:08:06 -07:00
Nico Weber
e39a790c82 LibPDF: Stop converting encodings in object parser
Per 1.7 spec 3.8.1, there are multiple logical text string types:
* text strings
* ASCII strings
* byte strings

Text strings can be in UTF-16BE, PDFDocEncoding, or (since PDF 2.0)
UTF-8.

But byte strings shouldn't be converted but treated as binary
data.

This makes us no longer convert strings used for drawing page text.
TABLE 5.6 "Text-showing operators" lists the operands for text-showing
operators as just "string", not "text string" (even though these strings
confusingly are called "text strings" in the body text), so not doing
this there is correct (and matches other viewers).

We also no longer incorrectly convert strings used for cypto data
(such as passwords), if they start with an UTF-16BE or UTF-8 marker.

No behavior change for outlines and info dict entries.

https://pdfa.org/understanding-utf-8-in-pdf-2-0/ has a good overview of
this.

(ASCII strings only contain ASCII characters and behave the same
anyways.)
2023-11-22 09:08:06 -07:00
Nico Weber
14bcb5219d LibPDF: Tolerate comments before drawing operators
Necessary to be able to render
https://github.com/pdf-association/pdf20examples/blob/master/pdf20-utf8-test.pdf
2023-11-22 08:56:43 +00:00
Nico Weber
9e8cf4fc1a LibPDF: Tolerate comment after last dict item
Necessary to be able to open
https://github.com/pdf-association/pdf20examples/blob/master/pdf20-utf8-test.pdf
2023-11-22 08:56:43 +00:00
Nico Weber
4440452f92 LibPDF: Support images with 1, 2, 4 bits per pixel
They just get upsampled to 8 bits per pixel images.
2023-11-18 07:33:15 +00:00
Nico Weber
bfe27228a3 LibPDF+LibGfx: Don't invert CMYK channels in JPEG data in PDFs
This is a hack: Ideally we'd have a CMYK Bitmap pixel format,
and we'd convert to rgb at blit time. Then we could also apply color
profiles (which for CMYK images are CMYK-based).

Also, the colors for our CMYK->RGB conversion are off for PDFs,
and we have distinct codepaths for this in Gfx::Color (for paths)
and JPEGs. So when we fix that, we'll have to fix it in two places.

But this doesn't require a lot of code and it's a huge visual
progression, so let's go with it for now.
2023-11-17 22:32:40 +00:00
Nico Weber
bd7ae7f91e LibPDF: Consistently asciibetize CommonNames.h
The file wasn't quite decided if it wanted to sort by ascii value
or by case folding. Now it uses ascii value, thanks to vim's
`:'<,'>sort`.

No behavior change.
2023-11-17 20:27:42 +00:00
Nico Weber
29396415d5 LibPDF: Add an initial implementation of type 3 glyph rendering
This is a very inefficient implementation: Every time a type 3 font
glyph is drawn, we parse its operator stream and execute all the
operators therein.

We'll want to instead cache the glyphs in bitmaps (at least in most
cases), like we do for other fonts. But it's a good first step, and
all the coordinate math seems to work in the files I've tested.

Good test files from pdfa dataset 0000.zip:

- 0000559.pdf page 1 (and 2): Has a non-default font matrix;
  text appears mirrored if the font matrix isn't handled correctly

- 0000425.pdf, page 1: Draws several glyphs in a single run;
  glyphs overlap if Renderer::render_type3_glyph() ignores the
  passed-in point

- 0000211.pdf, any page: Uses type 3 glyphs for all text.
  Good perf test (already "reasonably fast")

- 0000521.pdf, page 5 (or 7 or or 16): The little red flag in the
  purple box is a type 3 font glyph, and it's colored (which in part
  means the first operator is `d0`, while all the other documents above
  use `d1`)
2023-11-17 19:47:53 +00:00
Nico Weber
14ddab5519 LibPDF: Stub out type3_font_set_glyph_width*
Type 3 font glyphs begin with either `d0` or `d1`. If we bail out
with an "unsupported" error on the very first operator in a glyph,
we'll never paint the glyph.

Just stub these out for now. We probably want to do more in here in
the future (see "TABLE 5.10 Type 3 font operators" in the 1.7 spec).
2023-11-17 19:47:53 +00:00
Nico Weber
54c98a46d8 LibPDF: Correctly parse the d0 and d1 operators
They are the first operator in a type 3 charproc.
Operator.h already knew about them, but we didn't manage to parse
them, since they're the only two operators that contain a digit.
2023-11-17 19:47:53 +00:00
Nico Weber
5513f8bbe3 LibPDF: Move ScopedState from a function on Renderer into Renderer
No behavior change.
2023-11-17 19:47:53 +00:00
Nico Weber
126a0be595 LibPDF: Pass Renderer to SimpleFont::draw_glyph()
This makes it available in Type3Font::draw_glyph().

No behavior change.
2023-11-17 19:47:53 +00:00
Nico Weber
bcc6439b5f LibPDF: Pass Renderer to PDFFont::draw_string()
It's a bit unfortunate that fonts need to know about the renderer,
but type 3 fonts contain PDF drawing operators, so it's necessary.

On the bright side, it makes it possible to pass fewer parameters
around and compute things locally as needed.

(As we implement more fonts, we'll probably want to create some
functions to do these computations in a central place, eventually.)

No behavior change.
2023-11-17 19:47:53 +00:00
Nico Weber
e0c0864ddf LibPDF: Load a few values off a type 3 font dictionary 2023-11-17 19:47:53 +00:00
Nico Weber
9632d8ee49 LibPDF: Make SimpleFont font matrix configurable
Type 3 fonts can set it to a custom value.
2023-11-17 19:47:53 +00:00
Nico Weber
4cd1a2d319 LibPDF: Add some scaffolding for type 3 fonts 2023-11-17 19:47:53 +00:00
Nico Weber
7f999b1ff5 LibPDF: Sink m_base_font_name from PDFFont into subclasses
/BaseFont is a required key for type 0, type 1, and truetype
font dictionaries, but not for type 3 font dictionaries.

This is mechanical; type 0 fonts don't even use this yet
(but probably should).

PDFFont::initialize() is now empty and could be removed,
but maybe we'll put stuff there again later, so I'm leaving
it around for a bit longer.
2023-11-17 19:47:53 +00:00
Nico Weber
6c1da5db54 LibPDF: Make SimpleFont::draw_glyph() fallible 2023-11-17 19:47:53 +00:00
Nico Weber
843e9daa8c LibPDF: Remove unused PDFFont::type()
This got added in #15270, but its one use then got removed again
in #16150.

No behavior change.
2023-11-17 19:47:53 +00:00
Nico Weber
26fd29baf8 LibPDF: Give Type3 fonts a dedicated error message
They're described in "5.5.4 Type 3 Fonts" in the PDF 1.7 spec, so we
shouldn't `internal_error()` on them. They're just not implemented yet.
2023-11-17 19:47:53 +00:00
Nico Weber
5eaa403ddf LibPDF: Use font dictionary object as cache key, not resource name
In the main page contents, /T0 might refer to a different font than
it might refer to in an XObject. So don't use the `Tf` argument as
font cache key. Instead, use the address of the font dictionary object.

Fixes false cache sharing, and also allows us to share cache entries
if the same font dict is referred to by two different names.

Fixes a regression from 2340e834cd (but keeps the speed-up intact).
2023-11-17 19:14:39 +01:00
Nico Weber
443b3eac77 LibPDF: Let decode_png_prediction() call LibGfx's unfilter_scanline()
It's less code, but it also fixes a bug: The implementation in
Filter.cpp used to use the previous byte as reference value, while
we're supposed to use the value of the previous channel as reference
(at least when a pixel is larger than one byte).
2023-11-17 19:09:50 +01:00
Nico Weber
145ade3a86 LibPDF: Remove a needless AK:: qualification
No behavior change.
2023-11-17 19:09:50 +01:00
Nico Weber
0416a07d56 LibPDF: Make filter byte not part of row in decode_png_prediction()
No behavior change.
2023-11-17 19:09:50 +01:00
Nico Weber
b763960fc2 LibPDF: Convert decode_png_prediction to use spans
No behavior change.
2023-11-17 19:09:50 +01:00
Nico Weber
588d6fab22 LibGfx+LibPDF: Create filter_type() for converting u8 to FilterType
...and use it in LibPDF.

No behavior change.
2023-11-17 19:09:50 +01:00
Nico Weber
7e4fe8e610 LibPDF: Use PNG::paeth_predictor() in png decoding path
No behavior change.

Ideally, the PDF code would just call a function PNGLoader to do the
PNG unfiltering, but let's first try to make the implementations look
more similar.
2023-11-17 19:09:50 +01:00
Lucas CHOLLET
1e8004734f LibPDF: Don't consider the End of Data code as normal ASCII85 input
Data encoded with ASCII85 is terminated with the EOD code 0x7E3E. This
should not be considered as normal input but rather discarded.
2023-11-14 10:15:15 +01:00
Lucas CHOLLET
59a6d4b7bc LibPDF: Factorize duplicated code in Filter::decode_ascii85() 2023-11-14 10:15:15 +01:00
Lucas CHOLLET
2fe0647c68 LibPDF: Handle pdf-specific white spaces correctly in ASCII85
We were previously only looking the space character but PDF white
spaces is a superset of ascii spaces.
2023-11-14 10:15:15 +01:00
Lucas CHOLLET
db08fe12ec LibPDF: Implement Reader::is_[eol, whitespace](char)
These two static members are now used to implement respective `matches_`
methods but will also be useful to provide a global implementation of
the specified concept of whitespace.
2023-11-14 10:15:15 +01:00
Lucas CHOLLET
dac703a0b8 LibPDF: Avoid an unnecessary copy in Filter::decode_ascii85() 2023-11-14 10:15:15 +01:00
Nico Weber
9b022239c3 LibPDF: Apply all offsets of TJ operator
TJ acts on a list of either strings or numbers.
The strings are drawn, and the numbers are treated as offsets.

Previously, we'd only apply the last-seen number as offset when
we saw a string. That had the effect of us ignoring all but the
last number in front of a string, and ignoring numbers at the
end of the list.

Now, we apply all numbers as offsets.
Our rendering of Tests/LibPDF/text.pdf now matches other PDF viewers.
2023-11-14 10:11:09 +01:00
Nico Weber
1c2b0feb7b LibPDF: Change how CFF optional width prefix is stored
Per 5177.Type2.pdf 3.1 "Type 2 Charstring Organization",
a glyph's charstring looks like:

    w? {hs* vs* cm* hm* mt subpath}? {mt subpath}* endchar

The `w?` is the width of the glyph, but it's optional. So all
possible commands after it (hstem* vstem* cntrmask hintmask
moveto endchar) check if there's an extra number at the start
and interpret it as a width, for the very first command we read.

This was done by having an `is_first_command` local bool that
got set to false after the first command. That didn't work with
subrs: If the first command was a call to a subr that just pushed
a bunch of numbers, then the second command after it is the actual
first command.

Instead, move that bool into the state. Set it to false the
first time we try to read a width, since that means we just read
a command that could've been prefixed by a width.
2023-11-14 10:10:34 +01:00
Lucas CHOLLET
9e4d697d23 LibPDF: Detect DCT images correctly
Images can have multiple filters, each one of them is processed
sequentially. Only the last one will be relevant for the image format
(DCT or JPXDecode), so use the last filter instead of the first one to
detect that property.
2023-11-13 10:30:34 -05:00
Nico Weber
f882a3ae37 LibPDF: In ColorSpace creation code, use resolve_to() more
For valid PDFs, this makes no difference.

For invalid PDFs, we now assert during the cast in resolve_to() instead
of returning a PDFError. However, most PDFs are valid, and even for
invalid PDFs, we'd previously keep the old color space around when
getting the PDF error and then usually assert later when the old
color space got passed a color with an unexpected number of components
(since the components were for the new color space).

Doesn't affect any of the > 2000 PDFs I use for testing locally,
is less code, and should make for less surprising asserts when it
does happen.
2023-11-13 10:29:26 -05:00
Lucas CHOLLET
9bc25db9a3 LibPDF: Add support for the LZW filter
This allows us to decode the first page of ThinkingInPostScript.pdf :^)
2023-11-13 14:23:23 +01:00
Lucas CHOLLET
048ef11136 LibPDF: Factorize flate parameters handling to its own function
This part will be shared with the LZW filter, so let's factorize it.
2023-11-13 14:23:23 +01:00
Nico Weber
bbde3cbc90 LibPDF: Tolerate an indirect object as dict for CIE-based color spaces
Namely, for CalGrayColorSpace, CalRGBColorSpace, LabColorSpace.

Fixes a crash rendering any page of Adobe's 5014.CIDFont_Spec.pdf
(which uses CalRGBColorSpace with an indirect dict: The dict is
object `92 0`, and many color spaces are inline objects referring
to it).
2023-11-13 07:12:05 -05:00
Nico Weber
f4a847894f LibPDF: Make SampledFunction::evaluate() work for n-dimensional input
I didn't find example code for this and the AI assistant did very
poorly on this as well. So I had to write it all by myself!

It can be much more efficient I think, but I think the overall
shape is maybe roughly fine.
2023-11-12 07:55:04 +01:00
Nico Weber
a9ef65e64a LibPDF: For multi-output SampledFunctions, fix output colors
For N outputs, the outputs aren't stored in N independent planes.
Instead, N output values are stored right next to each other in
the stream data.
2023-11-11 08:55:37 +01:00
Nico Weber
ec739460e0 LibPDF: Add test for SampledFunction and fix bugs found by it
* SampledFunction now keeps the StreamObject it gets data from alive
  (doesn't matter too much in practice, but does matter in the test,
  where nothing else keeps the stream alive).

* If a sample is an integer, we would previously sample that value
  twice and then divide by zero when interpolating. Make sure to
  sample 1 unit apart.
2023-11-11 08:55:37 +01:00
Nico Weber
323ba7404c LibPDF: Implement SampledFunction::evaluate() for some sampled functions
Things now work for functions that are all of:
* linear
* 1-D input
* 8 bits per sample
2023-11-10 15:03:30 +00:00
Nico Weber
fd1876441a LibPDF: Implement SampledFunction::create() 2023-11-10 15:03:30 +00:00
Nico Weber
cd9f4655ec LibPDF: Tweak implementation of postscript roll op
Since positive offsets roll to the right, it makes more sense
to do the big reverse first. Gets rid of an awkward minus sign.

No behavior change.
2023-11-10 14:45:38 +01:00
Nico Weber
b23ed86889 LibPDF: Implement StitchingFunction::evaluate() 2023-11-10 14:45:16 +01:00
Nico Weber
ba34ddeb21 LibPDF: Implement StitchingFunction creation 2023-11-10 14:45:16 +01:00
Nico Weber
5af6e1c042 LibPDF: Implement DeviceNColorSpace 2023-11-09 23:33:49 +01:00
Nico Weber
0f07049935 LibPDF: Add ColorSpaceFamily::operator==
No behavior change.
2023-11-09 23:33:49 +01:00
Nico Weber
80eec1e16b LibPDF: Implement PostScriptCalculatorFunction
Includes a tokenizer and interpreter for the subset of PostScript
supported in PDF type 4 functions.
2023-11-09 16:06:25 +01:00
Tim Schumacher
a2f60911fe AK: Rename GenericTraits to DefaultTraits
This feels like a more fitting name for something that provides the
default values for Traits.
2023-11-09 10:05:51 -05:00
Nico Weber
bbd86ee4f3 LibPDF: Implement ExponentialInterpolationFunction 2023-11-06 10:01:05 +01:00
Nico Weber
1aed465efe LibPDF: Implement Fuction::create() 2023-11-06 10:01:05 +01:00
Nico Weber
b78ea81de5 LibPDF: Implement SeparationColorSpace
Requires PDF::Function, which isn't implemented yet, so this has
no visual effect yet.
2023-11-06 10:01:05 +01:00
Nico Weber
9204252d02 LibPDF: Add scaffolding for function objects
See PDF 1.7 Spec, "3.9 Functions".
2023-11-06 10:01:05 +01:00
Nico Weber
21894f1cde LibPDF: Fix typos in DeviceN colorspace scaffolding
* Compare array size to 3 and 4, not 4 and 5
* Fix literal typo in error message

Fixes crash processing 0000906.pdf from 0000.zip from the pdf/a dataset.
2023-11-06 09:54:01 +01:00