1
1
mirror of https://github.com/Kozea/WeasyPrint.git synced 2024-10-05 08:27:22 +03:00
WeasyPrint/docs/hacking.rst
2012-09-19 15:20:34 +02:00

214 lines
8.4 KiB
ReStructuredText
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Hacking WeasyPrint
==================
Assuming you already have the dependencies_, install the `development
version`_ of WeasyPrint:
.. _dependencies: /install/
.. _development version: https://github.com/Kozea/WeasyPrint
.. code-block:: sh
git clone git://github.com/Kozea/WeasyPrint.git
cd WeasyPrint
virtualenv --system-site-packages env
. env/bin/activate
pip install pytest
pip install -e .
weasyprint --help
This will install WeasyPrint in “editable” mode (which means that you dont
need to re-install it every time you make a change in the source code) as
well as `py.test`_.
Use the ``py.test`` command from the ``WeasyPrint`` directory to run the
test suite.
Please report any bug or feature request on Redmine_ and submit
patches/pull requests on Github_.
.. _py.test: http://pytest.org/
.. _Redmine: http://redmine.kozea.fr/projects/weasyprint/issues
.. _Github: https://github.com/Kozea/WeasyPrint
Dive into the source
--------------------
The rest of this document is a high-level overview of WeasyPrints source
code. For more details, see the various docstrings or even the code itself.
When in doubt, dont hesitate to `ask </community>`_!
Much like `in web browsers
<http://www.html5rocks.com/en/tutorials/internals/howbrowserswork/#The_main_flow>`_,
the rendering of a document in WeasyPrint goes like this:
1. The HTML document is fetched and parsed into a DOM tree
2. CSS stylesheets (either found in the HTML or supplied by the user) are
fetched and parsed
3. The stylesheets are applied to the DOM tree
4. The DOM tree with styles is transformed into a *formatting structure* made of rectangular boxes.
5. These boxes are *laid-out* with fixed dimensions and position onto pages.
6. The boxes are re-ordered to observe stacking rules.
7. The pages are drawn in a PDF file through a cairo surface.
8. Cairos PDF is modified to add metadata such as bookmarks and hyperlinks.
Documents
.........
WeasyPrints “entry point” is the ``Document`` class. An instance handles
a document for all of its lifetime. It is responsible of calling other parts
of the code for each step listed above.
The document is lazy: the various steps of the rendering are only done
when required (ie. when relevant attributes are accessed.)
HTML
....
Not much to see here. lxml.html_ handles step 1 and gives a *DOM tree*.
The lxml API is not actually DOM, but well call it that anyway. The lxml
object for the root element (usually ``<html>``) is stored as the ``dom``
attribute of the document.
.. _lxml.html: http://lxml.de/lxmlhtml.html
CSS
...
Steps 2 and 3 happen in the ``weasyprint.css`` package. CSS stylesheets are
parsed with cssutils_. ``@import`` and ``@media`` rules are resolved to find
applicable rule sets. Then the ``weasyprint.css.validation`` module filters out
declarations with properties unknown to or unsupported by WeasyPrint or with
illegal or unsupported values.
.. _cssutils: http://cthedot.de/cssutils/
.. _lxml.cssselect: http://lxml.de/cssselect.html
As well as validation, several transformations are made at or around
this point:
* Shorthand properties are expanded. For example, ``margin`` is replaced by
``margin-top``, ``margin-right``, ``margin-bottom`` and ``margin-left``.
* Some values are simplified. They come as a list as a list of cssutils
`Value objects`_. For example, keyword values are replaced by simple
strings and the list is dropped when its length is always one for a given
property.
* Hyphens in property names are replaced by underscores (``margin-top``
becomes ``margin_top``) so that they can be used as Python attribute names
later on.
.. _Value objects: http://packages.python.org/cssutils/docs/css.html#values
After that, the cascade_ (thats the C in CSS!), together with inhertance
and initial values, assigns a value for each property to each DOM element.
These values are *computed* in ``weasyprint.css.computed_values``: lengths
are converted to pixels, etc. The objects representing the values are
simplified further: pixel length are simple floating points numbers.
Some values however are still cssutils objects to avoid ambiguities (eg.
percentages.)
.. _cascade: http://www.w3.org/TR/CSS21/cascade.html
Finally, the computed style for all DOM elements is stored in the
``computed_styles`` attribute of the document object.
Formatting structure
....................
The `visual formatting model`_ explains how *elements* (from the DOM tree)
generate *boxes* (in the formatting structure). This is step 4 above.
Boxes may have children and thus form a tree, much like elements. This tree
is generally close but not identical to the DOM tree: some elements generate
no or more than one box.
.. _visual formatting model: http://www.w3.org/TR/CSS21/visuren.html
Boxes are of a lot of different kinds. For example you should not confuse
*block-level boxes* and *block containers*, though *block boxes* are both.
The ``weasyprint.formatting_structure.boxes`` module has a whole hierarchy of
classes to represent all these boxes. We wont go into the details here, see
the module and class docstrings.
The ``weasyprint.formatting_structure.build`` module takes a DOM tree with
associated computed styles, and builds a formatting structure. It generates
the right boxes for each element and ensures they conform to the models rules.
(Eg. an inline box can not contain a block.) Each box has a ``some_box.style``
attribute containing computed values for each known CSS property.
The main logic is based on the ``display`` property, but it can be overridden for some elements by adding a handler in the ``weasyprint.html`` module.
This is how ``<img>`` and ``<td colspan=3>`` are currently implemented,
for example.
This module is rather short as most of HTML is defined in CSS rather than
in Python, in the `user agent stylesheet`_.
The box for the root element (and, through its ``children`` attribute, the
whole tree) is set to the ``formatting_structure`` attribute of the document.
.. _user agent stylesheet: https://github.com/Kozea/WeasyPrint/blob/master/weasyprint/css/html5_ua.css
Layout
......
Step 5 is the layout. You could say the everything else is glue code and
this is where the magic happens.
During the layout the documents content is … laid out on pages. This is when
we decide where to do line breaks and page breaks. If a break happens inside
of a box, that box is split into two (or more) boxes in the layout result.
According to the `box model`_, each box has rectangular margin, border,
padding and content areas:
.. _box model: http://www.w3.org/TR/CSS21/box.html
.. image:: http://www.w3.org/TR/CSS21/images/boxdim.png
:align: center
While ``box.style`` contains computed values, the `used values`_ are set
as attributes of the ``Box`` object itself during the layout. This
include resolving percentages and especially ``auto`` values into absolute,
pixel lengths. Once the layout done, each box has used values for
margins, border width, padding of each four sides, as well as the ``width``
and ``height`` of the content area. They also have ``position_x`` and
``position_y``, the absolute coordinates of the top-left corner of the
margin box (**not** the content box) from the top-left corner of the page.
.. _used values: http://www.w3.org/TR/CSS21/cascade.html#used-value
Boxes also have helpers methods such as ``content_box_y()`` and
``margin_width()`` that give other metrics that can be useful in various
parts of the code.
When the layout is done, a list of ``PageBox`` objects is set to the
``pages`` attribute of the document.
Stacking
........
In step 6, the boxes are reorder by the ``weasyprint.stacking`` module
to observe `stacking rules`_ such as the ``z-index`` property.
The result is a tree of `stacking contexts`.
.. _stacking rules: http://www.w3.org/TR/CSS21/zindex.html
Drawing
.......
Next, in step 7, each laid-out page is *drawn* onto a cairo_ surface.
Since each box has absolute coordinates on the page from the layout step,
the logic here should be minimal. If you find yourself adding a lot of logic
here, maybe it should go in the layout or stacking instead.
The code lives in the ``weasyprint.draw`` module and is called by the
``write_to`` method of the document.
.. _cairo: http://cairographics.org/pycairo/
Metadata
........
Finally (step 8), the ``weasyprint.pdf`` parses the PDF file produced by cairo
and makes an *incremental update* to add internal and external hyperlinks,
as well as outlines / bookmarks.