2021-02-18 13:41:42 +03:00
|
|
|
|
Going Further
|
|
|
|
|
=============
|
|
|
|
|
|
2021-02-18 23:03:40 +03:00
|
|
|
|
.. currentmodule:: weasyprint
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Why WeasyPrint?
|
|
|
|
|
---------------
|
|
|
|
|
|
|
|
|
|
Automatic document generation is a common need of many applications, even if a
|
|
|
|
|
lot of operations do not require printed paper anymore.
|
|
|
|
|
|
|
|
|
|
Invoices, tickets, leaflets, diplomas, documentation, books… All these
|
|
|
|
|
documents are read and used on paper, but also on electronical readers, on
|
|
|
|
|
smartphones, on computers. PDF is a great format to store and display them in
|
|
|
|
|
a reliable way, with pagination.
|
|
|
|
|
|
|
|
|
|
Using HTML and CSS to generate static and paged content can be strange at first
|
|
|
|
|
glance: browsers display only one page, with variable dimensions, often in a
|
|
|
|
|
very dynamic way. But paged media layout is actually included in CSS2_, which
|
|
|
|
|
was already a W3C recommendation in 1998.
|
|
|
|
|
|
|
|
|
|
Other well-known tools can be used to automatically generate PDF documents,
|
|
|
|
|
like LaTeX and LibreOffice, but they miss many advantages that HTML and CSS
|
|
|
|
|
offer. HTML and CSS are very widely known, by developers but also by
|
|
|
|
|
webdesigners. They are specified in a backwards-compatible way, and regularly
|
|
|
|
|
adapted to please the use of billions of people. They are really easy to write
|
|
|
|
|
and generate, with a ridiculous amount of tools that are finely adapted to the
|
|
|
|
|
needs and taste of their users.
|
|
|
|
|
|
|
|
|
|
However, the web engines that are used for browsers were very limited for
|
|
|
|
|
pagination when WeasyPrint was created in 2011. Even now, they lack a lot of
|
|
|
|
|
basic features. That’s why projects such as wkhtmltopdf_ and PagedJS_ have been
|
|
|
|
|
created: they add some of these features to existing browsers.
|
|
|
|
|
|
|
|
|
|
Other solutions have beed developed, including web engine dedicated to paged
|
|
|
|
|
media. Prince_, Antennahouse_ or `Typeset.sh`_ created original renderers
|
|
|
|
|
supporting many features related to pagination. These tools are very powerful,
|
|
|
|
|
but they are not open source.
|
|
|
|
|
|
|
|
|
|
Building a free and open source web renderer generating high-quality documents
|
|
|
|
|
is the main goal of WeasyPrint. Do you think that it was a little bit crazy to
|
|
|
|
|
create such a big project from scratch? Here is what `Simon Sapin`_ wrote
|
|
|
|
|
in WeasyPrint’s documentation one month after the beginning:
|
|
|
|
|
|
|
|
|
|
Are we crazy? Yes. But not that much. Each modern web browser did take many
|
|
|
|
|
developers’ many years of work to get where they are now, but WeasyPrint’s
|
|
|
|
|
scope is much smaller: there is no user-interaction, no JavaScript, no live
|
|
|
|
|
rendering (the document doesn’t changed after it was first parsed) and no
|
|
|
|
|
quirks mode (we don’t need to support every broken page of the web.)
|
|
|
|
|
|
|
|
|
|
We still need however to implement the whole CSS box model and visual
|
|
|
|
|
rendering. This is a lot of work, but we feel we can get something useful
|
|
|
|
|
much quicker than “Let’s build a rendering engine!” may seem.
|
|
|
|
|
|
|
|
|
|
Simon is often right.
|
|
|
|
|
|
|
|
|
|
.. _CSS2: https://www.w3.org/TR/1998/REC-CSS2-19980512/
|
|
|
|
|
.. _wkhtmltopdf: https://wkhtmltopdf.org/
|
|
|
|
|
.. _PagedJS: https://www.pagedjs.org/
|
|
|
|
|
.. _Prince: https://www.princexml.com/
|
|
|
|
|
.. _Antennahouse: https://www.antennahouse.com/
|
|
|
|
|
.. _Typeset.sh: https://typeset.sh/
|
|
|
|
|
.. _Simon Sapin: https://exyr.org/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Why Python?
|
|
|
|
|
-----------
|
|
|
|
|
|
|
|
|
|
Python is a really good language to design a small, OS-agnostic parser. As it
|
|
|
|
|
is object-oriented, it gives the possibility to follow the specification with
|
|
|
|
|
high-level classes and a small amount of very simple code.
|
|
|
|
|
|
|
|
|
|
Speed is not WeasyPrint’s main goal. Web rendering is a very complex task, and
|
|
|
|
|
following :pep:`the Zen of Python <20>` helped a lot to keep our sanity (both in our
|
|
|
|
|
code and in our heads): code simplicity, maintainability and flexibility are
|
|
|
|
|
the most important goals for this library, as they give the ability to stay
|
|
|
|
|
really close to the specification and to fix bugs easily.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Dive into the Source
|
|
|
|
|
--------------------
|
|
|
|
|
|
|
|
|
|
This chapter is a high-level overview of WeasyPrint’s source code. For more
|
|
|
|
|
details, see the various docstrings or even the code itself. When in doubt,
|
|
|
|
|
feel free to :ref:`ask <Support>`!
|
|
|
|
|
|
|
|
|
|
Much `like in web browsers`_, the rendering of a document in WeasyPrint goes
|
|
|
|
|
like this:
|
|
|
|
|
|
|
|
|
|
1. The HTML document is fetched and parsed into a tree of elements (like DOM).
|
|
|
|
|
2. CSS stylesheets (either found in the HTML or supplied by the user) are
|
|
|
|
|
fetched and parsed.
|
|
|
|
|
3. The stylesheets are applied to the DOM-like tree.
|
|
|
|
|
4. The DOM-like tree with styles is transformed into a *formatting structure*
|
|
|
|
|
made of rectangular boxes.
|
|
|
|
|
5. These boxes are *laid-out* with fixed dimensions and position onto pages.
|
|
|
|
|
6. For each page, the boxes are re-ordered to observe stacking rules, and are
|
|
|
|
|
drawn on a PDF page.
|
|
|
|
|
7. Metadata −such as document information, attachments, embedded files,
|
|
|
|
|
hyperlinks, and PDF trim and bleed boxes− are added to the PDF.
|
|
|
|
|
|
|
|
|
|
.. _like in web browsers: http://www.html5rocks.com/en/tutorials/internals/howbrowserswork/#The_main_flow
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Parsing HTML
|
|
|
|
|
............
|
|
|
|
|
|
|
|
|
|
Not much to see here. The :class:`HTML` class handles step 1 and
|
|
|
|
|
gives a tree of HTML *elements*. Although the actual API is different, this
|
|
|
|
|
tree is conceptually the same as what web browsers call *the DOM*.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Parsing CSS
|
|
|
|
|
...........
|
|
|
|
|
|
|
|
|
|
As with HTML, CSS stylesheets are parsed in the :class:`CSS` class
|
|
|
|
|
with an external library, tinycss2_.
|
|
|
|
|
|
|
|
|
|
In addition to the actual parsing, the ``css`` and ``css.validation``
|
|
|
|
|
modules do some pre-processing:
|
|
|
|
|
|
|
|
|
|
* Unknown and unsupported declarations are ignored with warnings.
|
|
|
|
|
Remaining property values are parsed in a property-specific way
|
|
|
|
|
from raw tinycss2 tokens into a higher-level form.
|
|
|
|
|
* Shorthand properties are expanded. For example, ``margin`` becomes
|
|
|
|
|
``margin-top``, ``margin-right``, ``margin-bottom`` and ``margin-left``.
|
|
|
|
|
* Hyphens in property names are replaced by underscores (``margin-top`` becomes
|
|
|
|
|
``margin_top``). This transformation is safe since none of the known (not
|
|
|
|
|
ignored) properties have an underscore character.
|
|
|
|
|
* Selectors are pre-compiled with cssselect2_.
|
|
|
|
|
|
|
|
|
|
.. _tinycss2: https://pypi.python.org/pypi/tinycss2
|
|
|
|
|
.. _cssselect2: https://pypi.python.org/pypi/cssselect2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The Cascade
|
|
|
|
|
...........
|
|
|
|
|
|
|
|
|
|
After that and still in the ``css`` package, the cascade_
|
|
|
|
|
(that’s the C in CSS!) applies the stylesheets to the element tree.
|
|
|
|
|
Selectors associate property declarations to elements. In case of conflicting
|
|
|
|
|
declarations (different values for the same property on the same element),
|
|
|
|
|
the one with the highest *weight* wins. Weights are based on the stylesheet’s
|
|
|
|
|
:ref:`origin <Stylesheet Origins>`, ``!important`` markers, selector
|
|
|
|
|
specificity and source order. Missing values are filled in through
|
|
|
|
|
*inheritance* (from the parent element) or the property’s *initial value*,
|
|
|
|
|
so that every element has a *specified value* for every property.
|
|
|
|
|
|
|
|
|
|
.. _cascade: http://www.w3.org/TR/CSS21/cascade.html
|
|
|
|
|
|
|
|
|
|
These *specified values* are turned into *computed values* in the
|
|
|
|
|
``css.computed_values`` module. Keywords and lengths in various units are
|
|
|
|
|
converted to pixels, etc. At this point the value for some properties can be
|
|
|
|
|
represented by a single number or string, but some require more complex
|
|
|
|
|
objects. For example, a ``Dimension`` object can be either an absolute length
|
|
|
|
|
or a percentage.
|
|
|
|
|
|
|
|
|
|
The final result of the ``css.get_all_computed_styles`` function is a big dict
|
|
|
|
|
where keys are ``(element, pseudo_element_type)`` tuples, and keys are style
|
|
|
|
|
dict objects. Elements are ElementTree elements, while the type of
|
|
|
|
|
pseudo-element is a string for eg. ``::first-line`` selectors, or :obj:`None`
|
|
|
|
|
for “normal” elements. Style dict objects are dicts mapping property names to
|
|
|
|
|
the computed values. (The return value is not the dict itself, but a
|
|
|
|
|
convenience ``style_for`` function for accessing it.)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Formatting Structure
|
|
|
|
|
....................
|
|
|
|
|
|
|
|
|
|
The `visual formatting model`_ explains how *elements* (from the ElementTree
|
|
|
|
|
tree) generate *boxes* (in the formatting structure). This is step 4 above.
|
|
|
|
|
Boxes may have children and thus form a tree, much like elements. This tree is
|
|
|
|
|
generally close but not identical to the ElementTree tree: some elements
|
|
|
|
|
generate more than one box or none.
|
|
|
|
|
|
|
|
|
|
.. _visual formatting model: http://www.w3.org/TR/CSS21/visuren.html
|
|
|
|
|
|
|
|
|
|
Boxes are of a lot of different kinds. For example you should not confuse
|
|
|
|
|
*block-level boxes* and *block containers*, though *block boxes* are both. The
|
|
|
|
|
``formatting_structure.boxes`` module has a whole hierarchy of classes to
|
|
|
|
|
represent all these boxes. We won’t go into the details here, see the module
|
|
|
|
|
and class docstrings.
|
|
|
|
|
|
|
|
|
|
The ``formatting_structure.build`` module takes an ElementTree tree with
|
|
|
|
|
associated computed styles, and builds a formatting structure. It generates the
|
|
|
|
|
right boxes for each element and ensures they conform to the models rules
|
|
|
|
|
(eg. an inline box can not contain a block). Each box has a ``style``
|
|
|
|
|
attribute containing the style dict of computed values.
|
|
|
|
|
|
|
|
|
|
The main logic is based on the ``display`` property, but it can be overridden
|
|
|
|
|
for some elements by adding a handler in the ``html`` module.
|
|
|
|
|
This is how ``<img>`` and ``<td colspan=3>`` are currently implemented,
|
|
|
|
|
for example.
|
|
|
|
|
|
|
|
|
|
This module is rather short as most of HTML is defined in CSS rather than
|
|
|
|
|
in Python, in the `user agent stylesheet`_.
|
|
|
|
|
|
|
|
|
|
The ``formatting_structure.build.build_formatting_structure`` function returns
|
|
|
|
|
the box for the root element (and, through its ``children`` attribute, the
|
|
|
|
|
whole tree).
|
|
|
|
|
|
|
|
|
|
.. _user agent stylesheet: https://github.com/Kozea/WeasyPrint/blob/master/weasyprint/css/html5_ua.css
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Layout
|
|
|
|
|
......
|
|
|
|
|
|
|
|
|
|
Step 5 is the layout. You could say the everything else is glue code and
|
|
|
|
|
this is where the magic happens.
|
|
|
|
|
|
|
|
|
|
During the layout the document’s content is, well, laid out on pages.
|
|
|
|
|
This is when we decide where to do line breaks and page breaks. If a break
|
|
|
|
|
happens inside of a box, that box is split into two (or more) boxes in the
|
|
|
|
|
layout result.
|
|
|
|
|
|
|
|
|
|
According to the `box model`_, each box has rectangular margin, border,
|
|
|
|
|
padding and content areas:
|
|
|
|
|
|
|
|
|
|
.. _box model: http://www.w3.org/TR/CSS21/box.html
|
|
|
|
|
|
|
|
|
|
.. image:: https://www.w3.org/TR/CSS21/images/boxdim.png
|
|
|
|
|
:alt: CSS Box Model
|
|
|
|
|
|
|
|
|
|
While ``box.style`` contains computed values, the `used values`_ are set as
|
|
|
|
|
attributes of the ``Box`` object itself during the layout. This include
|
|
|
|
|
resolving percentages and especially ``auto`` values into absolute, pixel
|
|
|
|
|
lengths. Once the layout done, each box has used values for margins, border
|
|
|
|
|
width, padding of each four sides, as well as the ``width`` and ``height`` of
|
|
|
|
|
the content area. They also have ``position_x`` and ``position_y``, the
|
|
|
|
|
absolute coordinates of the top-left corner of the margin box (**not** the
|
|
|
|
|
content box) from the top-left corner of the page.\ [#]_
|
|
|
|
|
|
|
|
|
|
Boxes also have helpers methods such as ``content_box_y`` and ``margin_width``
|
|
|
|
|
that give other metrics that can be useful in various parts of the code.
|
|
|
|
|
|
|
|
|
|
The final result of the layout is a list of ``PageBox`` objects.
|
|
|
|
|
|
|
|
|
|
.. [#] These are the coordinates *if* no `CSS transform`_ applies.
|
|
|
|
|
Transforms change the actual location of boxes, but they are applied
|
|
|
|
|
later during drawing and do not affect layout.
|
|
|
|
|
.. _used values: http://www.w3.org/TR/CSS21/cascade.html#used-value
|
|
|
|
|
.. _CSS transform: http://www.w3.org/TR/css3-transforms/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Stacking & Drawing
|
|
|
|
|
..................
|
|
|
|
|
|
|
|
|
|
In step 6, the boxes are reordered by the ``stacking`` module to observe
|
|
|
|
|
`stacking rules`_ such as the ``z-index`` property. The result is a tree of
|
|
|
|
|
*stacking contexts*.
|
|
|
|
|
|
|
|
|
|
Next, each laid-out page is *drawn* onto a PDF page. Since each box has
|
|
|
|
|
absolute coordinates on the page from the layout step, the logic here should be
|
|
|
|
|
minimal. If you find yourself adding a lot of logic here, maybe it should go in
|
|
|
|
|
the layout or stacking instead.
|
|
|
|
|
|
|
|
|
|
The code lives in the ``draw`` module.
|
|
|
|
|
|
|
|
|
|
.. _stacking rules: http://www.w3.org/TR/CSS21/zindex.html
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Metadata
|
|
|
|
|
........
|
|
|
|
|
|
|
|
|
|
Finally (step 7), the ``pdf`` adds metadata to the PDF file: document
|
|
|
|
|
information, attachments, hyperlinks, embedded files, trim box and bleed box.
|