elinks/src/document/html/README

This is the "Mikulas' HTML Processor". The parser and renderer we actually use.
It's scary, yes.


Parser - renderer interaction
=============================

(See also doc/hacking.txt. Incoherent (and possibly misleaded on some points;
don't trust it to the letter) rambling follows.)

The parser is written modularily so it is separated from the renderer and can
be actually used with a different renderer as well (they do that in Links2,
using graphics renderer with the current parser). The entry point from the rest
of ELinks is inside the renderer, which sets up couple of callbacks (like "put
this text on screen" or "special element indicator", where special element
might be a <hr>, a link or a table) and calls the parser.

Usually, the renderer just lets the parser chew through the document on its own and
only processes the callbacks, sometimes it kicks in though - at that point it
does a "management override", skips a chunk of the source and resumes the parser
after it's over. Most commonly, it does this with tables - when you hit a table,
html_table() does basically nothing and the renderer at that point skips to
</table>. What happens with the table you ask? The renderer calls itself
recursively on just the table; that means a separate parser instance is run for
the table, and for each distinct cell a new renderer instance is called. The
table is (parsed and) rendered two times - first to just find out the wanted
cell sizes, then it optimizes that and figures out the layout and does a second
rendering pass where it uses the calculated cell sizes to actually write the cells
to the canvas.


Box model plans
===============

The design described above - calling the renderer recursively on the table and each cell -
is a cheap substitute for the box model. Except for those cases where the
renderer basically just writes on the canvas sequentially as the text comes in, moving
the pen only rightwards and down; there are only some parameters like
indentation and border sizes which affect the rightwards and down motion.

It needs to be changed so that the renderer maintains a *box model* - at each
moment the text being written out is inside a stack of boxes. This is what
the table renderer achieves by just making each box a separate renderer
instance - you can't get away without boxes when rendering tables. But boxes
are essential for all block elements at the moment you go CSS, and we went CSS
and would like to support floating elements.

Even without support of the float property, boxes will have an immediate advantage since you will be
able to e.g. set their background - now e.g. slashdot looks really ugly since
it has background set on some block elements but since we have no box model we set the
background only on the rendered text itself, not the rectangular canvas around
it.

Boxes can probably be implemented by just transmitting information "here a new
box (block element) begins" and "here a box ends" from parser to the renderer,
and maintaining a box stack with some geometry information in the renderer (and
now, you've just got DOM for block elements if you turn this from a temporary
stack to a persistent tree; but we might not want to do that, at least not
unconditionally).


Rendering boxes
~~~~~~~~~~~~~~~

When you are actually rendering, you must not apply the attributes on the whole
box after rendering; inline elements might have custom attributes and we won't
have boxes for those. Also, you don't know the box width until you've already
rendered it. Several possibilities occur to me:

(i) Dry-run each box to find out the dimensions, prefill the rectangle with
attributes and then hard-render. Awfully quadratic, would kill performance.

(ii) Post-fill: for each line and each box in stack, remember "content width".
After the box is over you look at its each line and post-fill it from the
content width to the box width (at both sides).

(iii) Maximalistic: assume maximal spread and reduce afterwards. Basically fill
the whole line (from starting X coordinate) with the box attributes and when
the box is done rendering, fill the rest of all its lines (from ending X
coordinate) with the parent's box attributes.

I like (iii) most.


Boxes and floats
~~~~~~~~~~~~~~~~

In order to be able to support floats, you want to be able to revisit previous
boxes and resize them based on the new box. To do that, you need to remember
parser contexts for all the boxes in the stack (you should have only few of
those). If you hit a floating box, you:

	* dry-render it to find out the dimensions and then do the dimensions
	  negotiation like you'd do with cell tables; or maybe a different one,
	  I didn't read the specs on it; don't put anything on canvas

	* pop the floating box

	* duplicate the parent box, so you have a "sibling box" that contains
	  all the content hit so far; limit its parser context to reach
	  only to the start of your floating box

	  Note that you don't want to merge multiple float boxes together.
	  So, an implementation might have something like this instead of the
	  "duplication":

	  struct box {
		struct box floaters[];
	  }

	  where floaters are children of this box; normally you have one
	  which is mostly read-only and contains everything rendered so far,
	  and one which points to the next box in your stack. When you process
	  a floating box, it gets a separate entry in the floaters[] array
	  and won't get merged to floaters[0].

	* wipe the sibling box (floaters[*]) from canvas

	* rerender the sibling box with calculated geometry constraints

	* rerender the floating box dtto

(This might be modified based on how floats are actually supposed to behave;
I'm not sure of that. ;-) ) This algorithm is careful not to keep a list of
all sibling boxes in memory since with a simple long document a mere sequence
of paragraphs would cost us huge amount of memory. There's probably no way
around keeping a list of sibling floating boxes in memory, though.
Add short intro and talk about box model 2006-07-12 21:46:16 -04:00			`This is the "Mikulas' HTML Processor". The parser and renderer we actually use.`
			`It's scary, yes.`


			`Parser - renderer interaction`
			`=============================`

			`(See also doc/hacking.txt. Incoherent (and possibly misleaded on some points;`
			`don't trust it to the letter) rambling follows.)`

			`The parser is written modularily so it is separated from the renderer and can`
			`be actually used with a different renderer as well (they do that in Links2,`
			`using graphics renderer with the current parser). The entry point from the rest`
			`of ELinks is inside the renderer, which sets up couple of callbacks (like "put`
			`this text on screen" or "special element indicator", where special element`
			`might be a <hr>, a link or a table) and calls the parser.`

Improve upon pasky's intro to the HTML engine 2006-07-13 15:12:24 -04:00			`Usually, the renderer just lets the parser chew through the document on its own and`
			`only processes the callbacks, sometimes it kicks in though - at that point it`
Add short intro and talk about box model 2006-07-12 21:46:16 -04:00			`does a "management override", skips a chunk of the source and resumes the parser`
Improve upon pasky's intro to the HTML engine 2006-07-13 15:12:24 -04:00			`after it's over. Most commonly, it does this with tables - when you hit a table,`
Add short intro and talk about box model 2006-07-12 21:46:16 -04:00			`html_table() does basically nothing and the renderer at that point skips to`
			`</table>. What happens with the table you ask? The renderer calls itself`
			`recursively on just the table; that means a separate parser instance is run for`
			`the table, and for each distinct cell a new renderer instance is called. The`
			`table is (parsed and) rendered two times - first to just find out the wanted`
			`cell sizes, then it optimizes that and figures out the layout and does a second`
Improve upon pasky's intro to the HTML engine 2006-07-13 15:12:24 -04:00			`rendering pass where it uses the calculated cell sizes to actually write the cells`
			`to the canvas.`
Add short intro and talk about box model 2006-07-12 21:46:16 -04:00

			`Box model plans`
			`===============`

Improve upon pasky's intro to the HTML engine 2006-07-13 15:12:24 -04:00			`The design described above - calling the renderer recursively on the table and each cell -`
			`is a cheap substitute for the box model. Except for those cases where the`
			`renderer basically just writes on the canvas sequentially as the text comes in, moving`
Add short intro and talk about box model 2006-07-12 21:46:16 -04:00			`the pen only rightwards and down; there are only some parameters like`
			`indentation and border sizes which affect the rightwards and down motion.`

			`It needs to be changed so that the renderer maintains a box model - at each`
			`moment the text being written out is inside a stack of boxes. This is what`
			`the table renderer achieves by just making each box a separate renderer`
			`instance - you can't get away without boxes when rendering tables. But boxes`
			`are essential for all block elements at the moment you go CSS, and we went CSS`
			`and would like to support floating elements.`

Improve upon pasky's intro to the HTML engine 2006-07-13 15:12:24 -04:00			`Even without support of the float property, boxes will have an immediate advantage since you will be`
Add short intro and talk about box model 2006-07-12 21:46:16 -04:00			`able to e.g. set their background - now e.g. slashdot looks really ugly since`
Improve upon pasky's intro to the HTML engine 2006-07-13 15:12:24 -04:00			`it has background set on some block elements but since we have no box model we set the`
			`background only on the rendered text itself, not the rectangular canvas around`
Add short intro and talk about box model 2006-07-12 21:46:16 -04:00			`it.`

Improve upon pasky's intro to the HTML engine 2006-07-13 15:12:24 -04:00			`Boxes can probably be implemented by just transmitting information "here a new`
Add short intro and talk about box model 2006-07-12 21:46:16 -04:00			`box (block element) begins" and "here a box ends" from parser to the renderer,`
Improve upon pasky's intro to the HTML engine 2006-07-13 15:12:24 -04:00			`and maintaining a box stack with some geometry information in the renderer (and`
			`now, you've just got DOM for block elements if you turn this from a temporary`
Add short intro and talk about box model 2006-07-12 21:46:16 -04:00			`stack to a persistent tree; but we might not want to do that, at least not`
			`unconditionally).`


			`Rendering boxes`
			`~~~~~~~~~~~~~~~`

			`When you are actually rendering, you must not apply the attributes on the whole`
			`box after rendering; inline elements might have custom attributes and we won't`
			`have boxes for those. Also, you don't know the box width until you've already`
			`rendered it. Several possibilities occur to me:`

			`(i) Dry-run each box to find out the dimensions, prefill the rectangle with`
			`attributes and then hard-render. Awfully quadratic, would kill performance.`

			`(ii) Post-fill: for each line and each box in stack, remember "content width".`
			`After the box is over you look at its each line and post-fill it from the`
			`content width to the box width (at both sides).`

			`(iii) Maximalistic: assume maximal spread and reduce afterwards. Basically fill`
			`the whole line (from starting X coordinate) with the box attributes and when`
			`the box is done rendering, fill the rest of all its lines (from ending X`
			`coordinate) with the parent's box attributes.`

			`I like (iii) most.`


			`Boxes and floats`
			`~~~~~~~~~~~~~~~~`

			`In order to be able to support floats, you want to be able to revisit previous`
			`boxes and resize them based on the new box. To do that, you need to remember`
			`parser contexts for all the boxes in the stack (you should have only few of`
			`those). If you hit a floating box, you:`

			`* dry-render it to find out the dimensions and then do the dimensions`
			`negotiation like you'd do with cell tables; or maybe a different one,`
			`I didn't read the specs on it; don't put anything on canvas`

			`* pop the floating box`

			`* duplicate the parent box, so you have a "sibling box" that contains`
			`all the content hit so far; limit its parser context to reach`
			`only to the start of your floating box`

			`Note that you don't want to merge multiple float boxes together.`
			`So, an implementation might have something like this instead of the`
			`"duplication":`

			`struct box {`
			`struct box floaters[];`
			`}`

			`where floaters are children of this box; normally you have one`
			`which is mostly read-only and contains everything rendered so far,`
			`and one which points to the next box in your stack. When you process`
			`a floating box, it gets a separate entry in the floaters[] array`
			`and won't get merged to floaters[0].`

			`* wipe the sibling box (floaters[*]) from canvas`

			`* rerender the sibling box with calculated geometry constraints`

			`* rerender the floating box dtto`

			`(This might be modified based on how floats are actually supposed to behave;`
			`I'm not sure of that. ;-) ) This algorithm is careful not to keep a list of`
			`all sibling boxes in memory since with a simple long document a mere sequence`
			`of paragraphs would cost us huge amount of memory. There's probably no way`
			`around keeping a list of sibling floating boxes in memory, though.`