2011-10-28 19:14:31 +04:00
|
|
|
|
Hacking WeasyPrint
|
2012-09-19 17:20:34 +04:00
|
|
|
|
==================
|
2011-10-28 19:14:31 +04:00
|
|
|
|
|
2012-03-15 19:01:09 +04:00
|
|
|
|
Assuming you already have the dependencies_, install the `development
|
|
|
|
|
version`_ of WeasyPrint:
|
|
|
|
|
|
|
|
|
|
.. _dependencies: /install/
|
|
|
|
|
.. _development version: https://github.com/Kozea/WeasyPrint
|
2011-10-28 19:14:31 +04:00
|
|
|
|
|
|
|
|
|
.. code-block:: sh
|
|
|
|
|
|
|
|
|
|
git clone git://github.com/Kozea/WeasyPrint.git
|
|
|
|
|
cd WeasyPrint
|
2012-03-15 19:01:09 +04:00
|
|
|
|
virtualenv --system-site-packages env
|
|
|
|
|
. env/bin/activate
|
|
|
|
|
pip install pytest
|
|
|
|
|
pip install -e .
|
|
|
|
|
weasyprint --help
|
2011-10-28 19:14:31 +04:00
|
|
|
|
|
|
|
|
|
This will install WeasyPrint in “editable” mode (which means that you don’t
|
|
|
|
|
need to re-install it every time you make a change in the source code) as
|
2012-06-04 20:52:33 +04:00
|
|
|
|
well as `py.test`_.
|
2011-10-28 19:14:31 +04:00
|
|
|
|
|
2012-06-04 20:52:33 +04:00
|
|
|
|
Use the ``py.test`` command from the ``WeasyPrint`` directory to run the
|
2011-10-28 19:14:31 +04:00
|
|
|
|
test suite.
|
|
|
|
|
|
2012-02-29 22:13:10 +04:00
|
|
|
|
Please report any bug or feature request on Redmine_ and submit
|
|
|
|
|
patches/pull requests on Github_.
|
|
|
|
|
|
|
|
|
|
.. _py.test: http://pytest.org/
|
|
|
|
|
.. _Redmine: http://redmine.kozea.fr/projects/weasyprint/issues
|
|
|
|
|
.. _Github: https://github.com/Kozea/WeasyPrint
|
2011-10-28 19:14:31 +04:00
|
|
|
|
|
|
|
|
|
Dive into the source
|
|
|
|
|
--------------------
|
|
|
|
|
|
2011-10-31 19:45:22 +04:00
|
|
|
|
The rest of this document is a high-level overview of WeasyPrint’s source
|
|
|
|
|
code. For more details, see the various docstrings or even the code itself.
|
|
|
|
|
When in doubt, don’t hesitate to `ask </community>`_!
|
|
|
|
|
|
2011-10-28 19:14:31 +04:00
|
|
|
|
Much like `in web browsers
|
|
|
|
|
<http://www.html5rocks.com/en/tutorials/internals/howbrowserswork/#The_main_flow>`_,
|
|
|
|
|
the rendering of a document in WeasyPrint goes like this:
|
|
|
|
|
|
|
|
|
|
1. The HTML document is fetched and parsed into a DOM tree
|
|
|
|
|
2. CSS stylesheets (either found in the HTML or supplied by the user) are
|
|
|
|
|
fetched and parsed
|
|
|
|
|
3. The stylesheets are applied to the DOM tree
|
2012-06-04 20:52:33 +04:00
|
|
|
|
4. The DOM tree with styles is transformed into a *formatting structure* made of rectangular boxes.
|
|
|
|
|
5. These boxes are *laid-out* with fixed dimensions and position onto pages.
|
|
|
|
|
6. The boxes are re-ordered to observe stacking rules.
|
|
|
|
|
7. The pages are drawn in a PDF file through a cairo surface.
|
|
|
|
|
8. Cairo’s PDF is modified to add metadata such as bookmarks and hyperlinks.
|
2011-10-28 19:14:31 +04:00
|
|
|
|
|
2011-10-31 19:45:22 +04:00
|
|
|
|
Documents
|
|
|
|
|
.........
|
|
|
|
|
|
2012-06-04 20:52:33 +04:00
|
|
|
|
WeasyPrint’s “entry point” is the ``Document`` class. An instance handles
|
|
|
|
|
a document for all of its lifetime. It is responsible of calling other parts
|
|
|
|
|
of the code for each step listed above.
|
2011-10-31 19:45:22 +04:00
|
|
|
|
|
|
|
|
|
The document is lazy: the various steps of the rendering are only done
|
2012-06-04 20:52:33 +04:00
|
|
|
|
when required (ie. when relevant attributes are accessed.)
|
2011-10-31 19:45:22 +04:00
|
|
|
|
|
|
|
|
|
HTML
|
|
|
|
|
....
|
|
|
|
|
|
|
|
|
|
Not much to see here. lxml.html_ handles step 1 and gives a *DOM tree*.
|
|
|
|
|
The lxml API is not actually DOM, but we’ll call it that anyway. The lxml
|
|
|
|
|
object for the root element (usually ``<html>``) is stored as the ``dom``
|
|
|
|
|
attribute of the document.
|
|
|
|
|
|
|
|
|
|
.. _lxml.html: http://lxml.de/lxmlhtml.html
|
|
|
|
|
|
|
|
|
|
CSS
|
|
|
|
|
...
|
|
|
|
|
|
2012-06-04 20:52:33 +04:00
|
|
|
|
Steps 2 and 3 happen in the ``weasyprint.css`` package. CSS stylesheets are
|
|
|
|
|
parsed with cssutils_. ``@import`` and ``@media`` rules are resolved to find
|
|
|
|
|
applicable rule sets. Then the ``weasyprint.css.validation`` module filters out
|
2011-10-31 19:45:22 +04:00
|
|
|
|
declarations with properties unknown to or unsupported by WeasyPrint or with
|
|
|
|
|
illegal or unsupported values.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.. _cssutils: http://cthedot.de/cssutils/
|
|
|
|
|
.. _lxml.cssselect: http://lxml.de/cssselect.html
|
|
|
|
|
|
|
|
|
|
As well as validation, several transformations are made at or around
|
|
|
|
|
this point:
|
|
|
|
|
|
|
|
|
|
* Shorthand properties are expanded. For example, ``margin`` is replaced by
|
|
|
|
|
``margin-top``, ``margin-right``, ``margin-bottom`` and ``margin-left``.
|
|
|
|
|
* Some values are simplified. They come as a list as a list of cssutils
|
2012-02-29 22:13:10 +04:00
|
|
|
|
`Value objects`_. For example, keyword values are replaced by simple
|
2011-10-31 19:45:22 +04:00
|
|
|
|
strings and the list is dropped when its length is always one for a given
|
|
|
|
|
property.
|
2012-02-29 22:13:10 +04:00
|
|
|
|
* Hyphens in property names are replaced by underscores (``margin-top``
|
2011-10-31 19:45:22 +04:00
|
|
|
|
becomes ``margin_top``) so that they can be used as Python attribute names
|
|
|
|
|
later on.
|
|
|
|
|
|
|
|
|
|
.. _Value objects: http://packages.python.org/cssutils/docs/css.html#values
|
|
|
|
|
|
|
|
|
|
After that, the cascade_ (that’s the C in CSS!), together with inhertance
|
|
|
|
|
and initial values, assigns a value for each property to each DOM element.
|
2012-06-04 20:52:33 +04:00
|
|
|
|
These values are *computed* in ``weasyprint.css.computed_values``: lengths
|
2011-10-31 19:45:22 +04:00
|
|
|
|
are converted to pixels, etc. The objects representing the values are
|
|
|
|
|
simplified further: pixel length are simple floating points numbers.
|
|
|
|
|
Some values however are still cssutils objects to avoid ambiguities (eg.
|
|
|
|
|
percentages.)
|
|
|
|
|
|
|
|
|
|
.. _cascade: http://www.w3.org/TR/CSS21/cascade.html
|
|
|
|
|
|
|
|
|
|
Finally, the computed style for all DOM elements is stored in the
|
|
|
|
|
``computed_styles`` attribute of the document object.
|
|
|
|
|
|
|
|
|
|
Formatting structure
|
|
|
|
|
....................
|
|
|
|
|
|
|
|
|
|
The `visual formatting model`_ explains how *elements* (from the DOM tree)
|
|
|
|
|
generate *boxes* (in the formatting structure). This is step 4 above.
|
|
|
|
|
Boxes may have children and thus form a tree, much like elements. This tree
|
|
|
|
|
is generally close but not identical to the DOM tree: some elements generate
|
|
|
|
|
no or more than one box.
|
|
|
|
|
|
|
|
|
|
.. _visual formatting model: http://www.w3.org/TR/CSS21/visuren.html
|
|
|
|
|
|
|
|
|
|
Boxes are of a lot of different kinds. For example you should not confuse
|
|
|
|
|
*block-level boxes* and *block containers*, though *block boxes* are both.
|
2012-06-04 20:52:33 +04:00
|
|
|
|
The ``weasyprint.formatting_structure.boxes`` module has a whole hierarchy of
|
2011-10-31 19:45:22 +04:00
|
|
|
|
classes to represent all these boxes. We won’t go into the details here, see
|
|
|
|
|
the module and class docstrings.
|
|
|
|
|
|
2012-06-04 20:52:33 +04:00
|
|
|
|
The ``weasyprint.formatting_structure.build`` module takes a DOM tree with
|
2012-02-29 22:13:10 +04:00
|
|
|
|
associated computed styles, and builds a formatting structure. It generates
|
2011-10-31 19:45:22 +04:00
|
|
|
|
the right boxes for each element and ensures they conform to the models rules.
|
2012-02-29 22:13:10 +04:00
|
|
|
|
(Eg. an inline box can not contain a block.) Each box has a ``some_box.style``
|
2011-10-31 19:45:22 +04:00
|
|
|
|
attribute containing computed values for each known CSS property.
|
|
|
|
|
|
2012-06-04 20:52:33 +04:00
|
|
|
|
The main logic is based on the ``display`` property, but it can be overridden for some elements by adding a handler in the ``weasyprint.html`` module.
|
|
|
|
|
This is how ``<img>`` and ``<td colspan=3>`` are currently implemented,
|
|
|
|
|
for example.
|
2012-02-29 22:13:10 +04:00
|
|
|
|
This module is rather short as most of HTML is defined in CSS rather than
|
|
|
|
|
in Python, in the `user agent stylesheet`_.
|
2011-10-31 19:45:22 +04:00
|
|
|
|
|
|
|
|
|
The box for the root element (and, through its ``children`` attribute, the
|
|
|
|
|
whole tree) is set to the ``formatting_structure`` attribute of the document.
|
|
|
|
|
|
2012-02-29 22:13:10 +04:00
|
|
|
|
.. _user agent stylesheet: https://github.com/Kozea/WeasyPrint/blob/master/weasyprint/css/html5_ua.css
|
|
|
|
|
|
2011-10-31 19:45:22 +04:00
|
|
|
|
Layout
|
|
|
|
|
......
|
|
|
|
|
|
|
|
|
|
Step 5 is the layout. You could say the everything else is glue code and
|
|
|
|
|
this is where the magic happens.
|
|
|
|
|
|
|
|
|
|
During the layout the document’s content is … laid out on pages. This is when
|
|
|
|
|
we decide where to do line breaks and page breaks. If a break happens inside
|
|
|
|
|
of a box, that box is split into two (or more) boxes in the layout result.
|
|
|
|
|
|
|
|
|
|
According to the `box model`_, each box has rectangular margin, border,
|
|
|
|
|
padding and content areas:
|
|
|
|
|
|
|
|
|
|
.. _box model: http://www.w3.org/TR/CSS21/box.html
|
|
|
|
|
|
|
|
|
|
.. image:: http://www.w3.org/TR/CSS21/images/boxdim.png
|
|
|
|
|
:align: center
|
|
|
|
|
|
|
|
|
|
While ``box.style`` contains computed values, the `used values`_ are set
|
|
|
|
|
as attributes of the ``Box`` object itself during the layout. This
|
|
|
|
|
include resolving percentages and especially ``auto`` values into absolute,
|
|
|
|
|
pixel lengths. Once the layout done, each box has used values for
|
|
|
|
|
margins, border width, padding of each four sides, as well as the ``width``
|
|
|
|
|
and ``height`` of the content area. They also have ``position_x`` and
|
|
|
|
|
``position_y``, the absolute coordinates of the top-left corner of the
|
|
|
|
|
margin box (**not** the content box) from the top-left corner of the page.
|
|
|
|
|
|
|
|
|
|
.. _used values: http://www.w3.org/TR/CSS21/cascade.html#used-value
|
|
|
|
|
|
|
|
|
|
Boxes also have helpers methods such as ``content_box_y()`` and
|
|
|
|
|
``margin_width()`` that give other metrics that can be useful in various
|
|
|
|
|
parts of the code.
|
|
|
|
|
|
|
|
|
|
When the layout is done, a list of ``PageBox`` objects is set to the
|
|
|
|
|
``pages`` attribute of the document.
|
|
|
|
|
|
2012-06-04 20:52:33 +04:00
|
|
|
|
Stacking
|
|
|
|
|
........
|
|
|
|
|
|
|
|
|
|
In step 6, the boxes are reorder by the ``weasyprint.stacking`` module
|
|
|
|
|
to observe `stacking rules`_ such as the ``z-index`` property.
|
|
|
|
|
The result is a tree of `stacking contexts`.
|
|
|
|
|
|
|
|
|
|
.. _stacking rules: http://www.w3.org/TR/CSS21/zindex.html
|
|
|
|
|
|
2011-10-31 19:45:22 +04:00
|
|
|
|
Drawing
|
|
|
|
|
.......
|
|
|
|
|
|
2012-06-04 20:52:33 +04:00
|
|
|
|
Next, in step 7, each laid-out page is *drawn* onto a cairo_ surface.
|
2011-10-31 19:45:22 +04:00
|
|
|
|
Since each box has absolute coordinates on the page from the layout step,
|
|
|
|
|
the logic here should be minimal. If you find yourself adding a lot of logic
|
2012-06-04 20:52:33 +04:00
|
|
|
|
here, maybe it should go in the layout or stacking instead.
|
2011-10-31 19:45:22 +04:00
|
|
|
|
|
2012-06-04 20:52:33 +04:00
|
|
|
|
The code lives in the ``weasyprint.draw`` module and is called by the
|
|
|
|
|
``write_to`` method of the document.
|
2011-10-31 19:45:22 +04:00
|
|
|
|
|
|
|
|
|
.. _cairo: http://cairographics.org/pycairo/
|
2012-06-04 20:52:33 +04:00
|
|
|
|
|
|
|
|
|
Metadata
|
|
|
|
|
........
|
|
|
|
|
|
|
|
|
|
Finally (step 8), the ``weasyprint.pdf`` parses the PDF file produced by cairo
|
|
|
|
|
and makes an *incremental update* to add internal and external hyperlinks,
|
|
|
|
|
as well as outlines / bookmarks.
|