Use check_vs instead of set_pos_x and set_pos_y in fixup_typeahead_match.
This saves us a line of code, and in addition, check_vs does not needlessly
scroll when the link is already in view.
Make set_kbd_repeat_count update the status bar and link highlighting
iff the repeat count is changed to a different value.
Delete code to do the same updates from do_action and try_prefix_key.
Besides simplifying the code, this change also fixes some issues with
the status bar and link highlighting not being properly updated in some
situations.
Introduce and use ses_kbd_repeat_count to change
ses->kbdprefix.repeat_count instead of setting it directly.
This change should not cause any change in behaviour.
Add an option to specify the number of overlapping lines when scrolling
page by page (0 by default because this is ELinks' current behaviour).
Signed-off-by: Fabienne Ducroquet <fabiduc@gmail.com>
Bring the code closer to ELinks 0.13.GIT commit
71ccbe0f8d made on 2007-11-07.
Don't define a separate download_flags_T typedef though,
because Doxygen generates nicer links if the enum is used directly.
Doxygen isn't too good at documenting the parameters of a callback
within the documentation of a parameter that points to the callback.
A typedef provides a better place to document the parameters.
If the user chose File -> Save formatted document and typed the name
of an existing file, ELinks offered to resume downloading the file.
There are a few problems with that:
* save_formatted_finish does not actually support resuming. It would
instead overwrite the beginning of the file and not truncate it.
* When save_formatted calls create_download_file, cdf_hop->data
ends up pointing to struct document. If the user then chooses to
resume, lun_resume would read *(int *)cdf_hop->data, hoping to
get cmdw_hop.magic or codw_hop.magic. struct document does not
begin with any such magic value.
* Because ELinks already has the formatted document in memory,
resuming saves neither time nor I/O.
So don't show the "Resume download of the original file" button in
this situation.
After the recent ecmascript_get_interpreter change, I got an assertion
failure in render_document, which calls ecmascript_reset_state and
then asserts that it has set vs->ecmascript != NULL.
ecmascript_reset_state cannot guarantee that because there might not
even be enough free memory for mem_calloc(1, sizeof(struct
ecmascript_interpreter). So, replace the assertion in render_document
with error handling, and likewise in call_onsubmit_and_submit.
The old code failed to write pending spaces before changing the
background color. That seems hard to fix without duplicating code,
and ELinks pads dumped lines to the requested width in these color
modes anyway, so this commit just makes ELinks write all spaces
immediately when colors are being used.
Try the following command before and after this commit:
elinks --no-home --eval "set document.colors.use_document_colors = 2" \
--dump-color-mode 1 --dump test/color.html
In DUMP_FUNCTION_SPECIALIZED, use isscreensafe_ucs (for UTF-8) or
isscreensafe (for unibyte) to detect control characters, and replace
them with spaces. add_document_to_string already did the same.
In DUMP_FUNCTION_SPECIALIZED (used by elinks --dump), detect the
second cell of double-cell (aka fullwidth) characters by comparing to
UCS_NO_CHAR, like add_document_to_string does. Don't use
unicode_to_cell for this any more.
Also, ignore the colors and attributes of the second cell; don't
output any escape sequences for them.
This is especially useful for showing that neither dump_truecolor_utf8
nor dump_truecolor_unibyte modifies its static color[] variable and it
therefore does not matter whether those functions use the same array
or not.
With all the comments and macros needed for this, the source files
don't become much shorter, but anyway I hope they'll be easier to
maintain this way.
This code was included in four variants of dump_to_file().
Move it to a new function dump_references() and make dump_to_file()
then call that. This makes the code size a little smaller.
The time cost will be negligible.
Instead of having four separate function definitions, have just one
sprinkled with #ifdefs, and #include that four times. The purpose
being to make it clearer which parts of these functions are identical
and which ones differ.
As a side effect, this change makes ELinks ignore --dump-color-mode
when dumping in UTF-8. Colourful UTF-8 dumping has not been
implemented and the fallback is now different from before.
Simplify the end-of-line check in get_search_region_from_search_nodes by
relying on the fact that the n member of an instance of struct search
that marks the end of a line will be 0.
Allow searching on the last character of the document. Plain-text searches
already match on the last character as long as it isn't the first character
of a match, and regular-expression searches match on the last character if
the search pattern is longer than 1 character, so the problem addressed by
this commit is very much a corner case.
This commit reverts a portion of commit
fd15049622594d151104d43917984c7ce10993e6 (CVS revision 1.17).
text_typeahead_handler: Document that passing -2 for action_id will cause
a search without error reporting. This behaviour is unintentionally the
current behaviour of text_typeahead_handler, but now it is documented so
that it can be used.
input_line_event_handler: When rewinding, pass -2 for the action_id
parameter to the handler instead of passing again whatever action led to
the rewinding.
The old behavior of input_line_event_handler was particularly problematic
with the search-toggle-regex action and the text_typeahead_handler handler:
input_line_event_handler would call the handler with
ACT_EDIT_SEARCH_TOGGLE_REGEX, and the handler would toggle the setting and
perform the search again; then if the search string no longer matched
anything, the handler would return INPUT_LINE_REWIND to
input_line_event_handler, which would rewind and call the handler with
ACT_EDIT_SEARCH_TOGGLE_REGEX again, thus toggling the option back to the
original setting.
With the new behaviour, input_line_event_handler will not repeat the same
action when re-invoking the handler; in the above example with
search-toggle-regex, the search string will simply be rewound until it
matches with the new setting.
When a link had an onClick event handler that changed the current
document and that link was clicked, ELinks would follow the current link
of the document displayed after executing the handler instead of the
link that was clicked.
Factor goto_link out of goto_current_link.
Use goto_link instead of goto_current_link in activate_link to ensure that
the link that is passed in by enter() is followed.
This check used to be in src/elinks.h. Move it to configure.in so
that (1) the result can be logged and (2) ELinks won't even link with
TRE if wchar_t prevents its use.
Also, rename HAVE_TRE_REGEX_H to CONFIG_TRE, to reflect that it is not
always defined if the header exists.
Check the return value of get_opt_rec on "document.browse.search.regex"
before dereferencing it. The option is not there if regular expression
support is disabled at build time.
This commit fixes a bug introduced in commit
b2d51c75ff0d6c52a4f6a2761801beb641cba3a2.
Check the return value of get_opt_rec on "document.browse.search.regex"
before dereferencing it. The option is not there if regular expression
support is disabled at build time.
This commit fixes a bug introduced in commit
b2d51c75ff0d6c52a4f6a2761801beb641cba3a2.
When the user tells ELinks to search for a regexp, ELinks 0.11.0
passes the regexp to regcomp() and the formatted document to
regexec(), both in the terminal charset. This works OK for unibyte
ASCII-compatible charsets because the regexp metacharacters are all in
the ASCII range. And ELinks 0.11.0 doesn't support multibyte or
ASCII-incompatible (e.g. EBCDIC) charsets in terminals, so it is no
big deal if regexp searches fail in such locales.
ELinks 0.12pre1 attempts to support UTF-8 as the terminal charset if
CONFIG_UTF8 is defined. Then, struct search contains unicode_val_T c
rather than unsigned char c, and get_srch() and add_srch_chr()
together save UTF-32 values there if the terminal charset is UTF-8.
In plain-text searches, is_in_range_plain() compares those values
directly if the search is case sensitive, or folds them to lower case
if the search is case insensitive: with towlower() if the terminal
charset is UTF-8, or with tolower() otherwise. In regexp searches
however, get_search_region_from_search_nodes() still truncates all
values to 8 bits in order to generate the string that
search_for_pattern() then passes to regexec(). In UTF-8 locales,
regexec() expects this string to be in UTF-8 and can't make sense of
the truncated characters. There is also a possible conflict in
regcomp() if the locale is UTF-8 but the terminal charset is not, or
vice versa.
Rejected ways of fixing the charset mismatches:
* When the terminal charset is UTF-8, recode the formatted document
from UTF-32 to UTF-8 for regexp searching. This would work if the
terminal and the locale both use UTF-8, or if both use unibyte
ASCII-compatible charsets, but not if only one of them uses UTF-8.
* Convert both the regexp and the formatted document to the charset of
the locale, as that is what regcomp() and regexec() expect. ELinks
would have to somehow keep track of which bytes in the converted
string correspond to which characters in the document; not entirely
trivial because convert_string() can replace a single unconvertible
character with a string of ASCII characters. If ELinks were
eventually changed to use iconv() for unrecognized charsets, such
tracking would become even harder.
* Temporarily switch to a locale that uses the charset of the
terminal. Unfortunately, it seems there is no portable way to
construct a name for such a locale. It is also possible that no
suitable locale is available; especially on Windows, whose C library
defines MB_LEN_MAX as 2 and thus cannot support UTF-8 locales.
Instead, this commit makes ELinks do the regexp matching with regwcomp
and regwexec from the TRE library. This way, ELinks can losslessly
recode both the pattern and the document to Unicode and rely on the
regexp code in TRE decoding them properly, regardless of locale.
There are some possible problems though:
1. ELinks stores strings as UTF-32 in arrays of unicode_val_T, but TRE
uses wchar_t instead. If wchar_t is UTF-16, as it is on Microsoft
Windows, then TRE will misdecode the strings. It wouldn't be too
hard to make ELinks convert to UTF-16 in this case, but (a) TRE
doesn't currently support UTF-16 either, and it seems possible that
wchar_t-independent UTF-32 interfaces will be added to TRE; and (b)
there seems to be little interest on using ELinks on Windows anyway.
2. The Citrus Project apparently wanted BSD to use a locale-dependent
wchar_t: e.g. UTF-32 in some locales and an ISO 2022 derivative in
others. Regexp searches in ELinks now do not support the latter.
[ Adapted to elinks-0.12 from bug 1060 attachment 506.
Commit message by me. --KON ]
This simplifies the callers a little and may help implement
simultaneous support for different charsets on different terminals
of the same type (bug 1064).
Before this patch, if you first moved the cursor to link X with
move-cursor-up and similar actions, and then clicked link Y with the
mouse, ELinks would activate link X, i.e. not the one you clicked.
This happened because the NAVIGATE_CURSOR_ROUTING mode was left
enabled and made ELinks ignore the doc_view->vs->current_link
member that ELinks had updated according to the click.
Make ELinks return the session to NAVIGATE_LINKWISE mode, so that
the update takes effect.
Reported by Paul B. Mahol.
(cherry picked from commit 4086418069)
In get_entity_string and point_intersect, do not initialise arrays with
static storage duration to zero; the C standard states that such objects
are automatically initialised to zero.
This patch fixes an issue whereby a newline character appearing within
a hidden input field is incorrectly reinterpreted as a space character.
The patch handles almost all cases, and includes a test case.
15/18 tests pass, but the remainder currently fail due to the fact
that ELinks does not currently support textarea scripting.
cache_entry.id => cache_entry.cache_id
document.id => document.cache_id
ecmascript_interpreter.onload_snippets_owner => .onload_snippets_cache_id
This is a combination of:
commit 232c07aa7f
bug 1009: id variables renamed, added document_id to the document.
commit 6007043458bf8f14abfc18b9db60785bdc0279f6
Revert addition of document.document_id
Replace almost all uses of enum connection_state with struct
connection_status. This removes the assumption that errno values used
by the system are between 0 and 100000. The GNU Hurd uses values like
ENOENT = 0x40000002 and EMIG_SERVER_DIED = -308.
This commit is derived from my attachments 450 and 467 to bug 1013.
Check in refresh_view() whether the tab is still current; if not, skip
the draw_doc() and draw_frames() calls because draw_current_link()
called within them asserts that the tab is current. However, do
always call print_screen_status(), because that handles non-current
tabs correctly too.
I think it was not yet possible to trigger the assertion failure with
setTimeout, because input.value modifications by ECMAScript do not
trigger a redraw (bug 1035).
Anything that frees struct form_view must now call the new function
ecmascript_detach_form_view. This function should then clear out any
dangling pointers, but that has not yet been implemented.
Anything that frees or reallocates struct form_state must now call the
new functions ecmascript_detach_form_state or ecmascript_moved_form_state.
These functions should then clear out any dangling pointers, but that has
not yet been implemented.
Commit 0b99fa70ca "Bug 620: Reset form
fields to default values on reload" made render_document() decrement
vs->form_info_len to 0 while vs->form_info remained non-NULL.
copy_vs() then copied the whole structure with copy_struct and did not
change form_info because form_info_len was 0. Both view_state
structures had form_info pointing to the same memory block, causing a
segfault when destroy_vs() tried to free that block a second time.
Reported by أحمد المحمودي.
This change avoids the following error:
gcc -DHAVE_CONFIG_H -I../../.. -I/home/Kalle/src/elinks-0.11/src -I/home/Kalle/prefix/include -I/usr/include/smjs -I/usr/include -I/usr/include/lua50 -I/usr/include -I/usr/include -O0 -ggdb -Wall -Wall -Werror -fno-strict-aliasing -Wno-pointer-sign -Wno-address -fno-strict-overflow -o search.o -c /home/Kalle/src/elinks-0.11/src/viewer/text/search.c
cc1: warnings being treated as errors
/home/Kalle/src/elinks-0.11/src/viewer/text/search.c:257: warning: 'get_search_region_from_search_nodes' defined but not used
make[3]: *** [search.o] Error 1
make[3]: Leaving directory `/home/Kalle/build/i686-pc-linux-gnu/elinks-0.11/src/viewer/text'
get_search_region_from_search_nodes is called only from
search_for_pattern, which already was inside #ifdef HAVE_REGEX_H.
(cherry picked from commit 2aec302d47)
Conflicts:
NEWS
configure.in
The following files also conflicted, but they had not been manually
edited in the elinks-0.12 branch after the previous merge, so I just
kept the 0.13.GIT versions:
doc/man/man1/elinks.1.in
doc/man/man5/elinks.conf.5
doc/man/man5/elinkskeys.5
po/fr.po
po/pl.po
In uri.post, each file name begins and ends with FILE_CHAR.
Previously, file names were not encoded, and names containing
FILE_CHAR could not be used. Because FILE_CHAR is a control
character, the user cannot directly type it in a file input field,
so ELinks asserted that the field did not contain FILE_CHAR.
However, it is possible to get FILE_CHAR in a file input field
with file name completion (ACT_EDIT_AUTO_COMPLETE), causing the
assertion to fail. Now, ELinks encodes FILE_CHAR as "%02", so it
is no longer ambiguous and the assertion is not needed.
The following is in the HTML 4 standard
(<http://www.w3.org/TR/html401/interact/forms.html#push-button>):
push buttons: Push buttons have no default behavior. Each push
button may have client-side scripts associated with the element's
event attributes. When an event occurs (e.g., the user presses the
button, releases it, etc.), the associated script is triggered.
Currently, a button such created by such HTML as "<input type="button"
value="foo" />" submits the form by default in ELinks. According to
the above, it shouldn't.
If ELinks is being linked with SSL library, use its random number
generator.
Otherwise, try /dev/urandom and /dev/prandom. If they do not work,
fall back to rand(), calling srand() only once. This fallback is
mostly interesting for the Hurd and Microsoft Windows.
BitTorrent piece selection and dom/test/html-mangle.c still use rand()
(but not srand()) directly. Those would not benefit from being
unpredictable, I think.
cached->id => cached->cache_id
document->id => document->cache_id
onload_snippets_owner => onload_snippets_document_id
Added the distinct document->document_id.
Always reset ecmascript when a document changes for example a next chunk
of it is loaded.
In the previous version, the first event that had KBD_MOD_PASTE
entered insert mode and was consumed for that; ELinks then inserted
the characters from the remaining events. Now, to make ELinks insert
the first character too, I'm changing things so that KBD_MOD_PASTE
does not cause insert mode to be entered; instead, ELinks inserts
those characters regardless of whether insert mode is on.
in move-link-left-line and others, so move-link-left-line ans others
do not use the keyboard prefix.
(cherry picked from commit 8b281e1404)
(cherry picked from commit 4f2a9eadfc)
Go to the page with a few lines. Follow a link to a page with more lines.
Move cursor down, do not stay on a link.
Go back and do move-link-prev-line. This caused a segmentation fault.
(cherry picked from commit 888ba87516)
(cherry picked from commit 1cbd02c141)
Change mode to NAVIGATE_LINKWISE to preserve the link position when
going back.
(cherry picked from commit 14b37d0362)
(cherry picked from commit a594b2a002)
move-link-down-line moves the cursor down to the line with a link.
move-link-up-line moves the cursor up to the line with a link.
move-link-prev-line moves to the previous link horizontally.
move-link-next-line moves to the next link horizontally.
(cherry picked from commit 8259a56e99)
(cherry picked from commit 2eb3532416)
Go to the page with a few lines. Follow a link to a page with more lines.
Move cursor down, do not stay on a link.
Go back and do move-link-prev-line. This caused a segmentation fault.
(cherry picked from commit 888ba87516)
move-link-down-line moves the cursor down to the line with a link.
move-link-up-line moves the cursor up to the line with a link.
move-link-prev-line moves to the previous link horizontally.
move-link-next-line moves to the next link horizontally.
(cherry picked from commit 8259a56e99)
start_document_refreshes() performs the NULL-pointer checks that
previously all callers to start_document_refresh() must perform
and then calls start_document_refresh().
start_document_refreshes performs the NULL-pointer checks that previously all callers to start_document_refresh must perform and then calls start_document_refresh.
Added EVENT_TEXTAREA used to notify the master terminal
about end of execution of an external program on a slave terminal.
The format of data sent to the master terminal by exec_on_slave_terminal
has changed. Now after 0, fg the value of term is sent.
Therfore this release of ELinks is incompatible with previous releases.
Patch by Witold Filipczyk, taken from his witekfl branch.
Conflicts:
src/viewer/text/textarea.c
Pass the session with some get_opt_* calls. These are the low-hanging fruit. Some places will be difficult because we don't have the session or for other reasons.
Update a comment in encode_multipart, which refers to html_form_control, which has since been renamed to html_special_form_control.
The comment was added with this commit:
commit b4dee890a61a6c8a27a8e4cd1dc3b3b93f1cdb08
Author: Petr Baudis <pasky@ucw.cz>
Date: Fri May 10 13:26:55 2002 +0000
Don't decode and back encode hidden form items (by mikulas, from 0.97).
The function was renamed with this commit:
commit c9d72739c715b3b0c7c6fec582780c1e8f444fc4
Author: Petr Baudis <pasky@ucw.cz>
Date: Sat Dec 18 02:22:28 2004 +0000
html_(tag|form*) -> html_special_\1, to naming prevent conflicts with HTML element handlers. As suggested by Jonas.
I don't remember why I cleared "returns", but it doesn't work
with www.hypermedia.pl/altkom/ and probably with many more sites.
[ From commit e887efc611 on the witekfl
branch. --KON ]
Use it for the actual I/O only. Previously, defining CONFIG_UTF8 and
enabling UTF-8 used to force many strings to the UTF-8 charset
regardless of the terminal charset option. Now, those strings always
follow the terminal charset. This fixes bug 914 which was caused
because _() returned strings in the terminal charset and functions
then assumed they were in UTF-8. This reduction in the effects of
UTF-8 I/O may also simplify future testing.
Previously, html_special_form_control converted
form_control.default_value to the terminal charset, and init_form_state
then copied the value to form_state.value. However, when CONFIG_UTF8
is defined and UTF-8 I/O is enabled, form_state.value is supposed to
be in UTF-8, rather than in the terminal charset.
This mismatch could not be conveniently fixed in
html_special_form_control because that does not know which terminal is
being used and whether UTF-8 I/O is enabled there. Also, constructing
a conversion table from the document charset to form_state.value could
have ruined renderer_context.convert_table, because src/intl/charsets.c
does not support multiple concurrent conversion tables.
So instead, we now keep form_control.default_value in the document
charset, and convert it in the viewer each time it is needed. Because
the result of the conversion is kept in form_state.value between
incremental renderings, this shouldn't even slow things down too much.
I am not implementing the proper charset conversions for the DOM
defaultValue property yet, because the current code doesn't have
them for other string properties either, and bug 805 is already open
for that.
I am going make fc->default_value use the charset of the document, and
recoding the string from the form history to that might lose characters.
This change also affects what ECMAScript sees in the defaultValue property.
<http://www.w3.org/TR/2003/REC-DOM-Level-2-HTML-20030109/html.html#ID-26091157>
says it should represent the HTML "value" attribute, so changing it
based on form history is not appropriate.
straconcat reads the args with va_arg(ap, const unsigned char *),
and the NULL macro may have the wrong type (e.g. int).
Many places pass string literals of type char * to straconcat. This
is in principle also a violation, but I'm ignoring it for now because
if it becomes a problem with some C implementation, then so will the
use of unsigned char * with printf "%s", which is so widespread in
ELinks that I'm not going to try fixing it now.
In goto_mark, copy the current_link of the old view state to the
old_current_link of the new view state so that clear_link will properly
clear the highlight for that link.
This fixes a bug introduced with the removal of link_bg in commit
c91c763d49.
* Recompute the pos variable for each cell, rather than just once per line.
This fixes the bug that only the first cell was being examined.
* Moved the bulk of the code outside the "if (frame && data >= 176 &&
data < 224)" conditional. This fixes the bug that only frame
characters were being added to the string.
* If the cell has UCS_NO_CHAR in it, don't add that to the string.
* Call encode_utf8 even for characters that originated from a frame.
This does not matter yet but will be correct if the function is
later changed to use the Unicode line-drawing characters for frames.
This allows code to use document->cached instead of
find_in_cache(document->uri), thereby increasing the likelihood
of getting the correct cache entry.
This should fix Bug 756 - "assertion (cached)->object.refcount >= 0 failed"
after HTTP proxy was changed.
Patches for this were written by me and then later by Jonas.
This commit combines our independent implementations.
As draw_textarea_utf8 loops over each character of the textarea content, it
checks whether the character is on the screen; draws it if so; increments
the screen co-ordinate; and updates the position in the textarea text.
The last step was being skipped when the character was not on the line,
so a line would be drawn from the beginning, even if the left edge of the
textarea is off the screen.
Closes: Bug 835 - Text in textarea is unaffected by horizontal scrolling of
document in UTF-8 mode
If utf8_char2cells isn't told where the string that contains
the given UTF-8 character ends, it computes that itself. Two users
of utf8_char2cells, format_textutf8 and split_line, were calling
utf8_char2cells in a loop without providing the end of the string,
resulting in numerous calls by utf8_char2cells to strlen.
With this patch, format_textutf8 and split_line each find the end
of the string once and provide it to utf8_char2cells.
This particularly improves performance with textareas, since
format_textutf8 is called multiple times each time the user interacts
with the textarea and when it must be redrawn.
Closes: Bug 823 - Big textarea is too slow with CONFIG_UTF8
UCS_ORPHAN_CELL is currently defined as U+0020 SPACE, which was
already used before this macro, so the behaviour does not change,
but the code seems clearer now.
I searched for ' ' and 32 and 0x20 and \x20, and replaced with
UCS_ORPHAN_CELL wherever UCS_NO_CHAR was involved. However,
some BFU widgets first draw spaces and then overwrite with text;
those will require a more complex fix if UCS_ORPHAN_CELL is ever
changed to some other character.
The configure script no longer recognizes "CONFIG_UTF_8=yes" lines
in custom features.conf files. They will have to be changed to
"CONFIG_UTF8=yes". This incompatibility was deemed acceptable
because no released version of ELinks supports CONFIG_UTF_8.
The --enable-utf-8 option was not renamed.
Suggested by Miciah on #elinks.
What was renamed:
add_utf_8 => add_utf8
cp2utf_8 => cp2utf8
encode_utf_8 => encode_utf8
get_translation_table_to_utf_8 => get_translation_table_to_utf8
goto invalid_utf_8_start_byte => goto invalid_utf8_start_byte
goto utf_8 => goto utf8
goto utf_8_select => goto utf8_select
terminal_interlink.utf_8 => terminal_interlink.utf8
utf_8_to_unicode => utf8_to_unicode
What was not renamed:
terminal._template_.utf_8_io option, TERM_OPT_UTF_8_IO
Compatibility with existing elinks.conf files would require an alias.
--enable-utf-8
Because the name of the charset is UTF-8, --enable-utf-8 looks better
than --enable-utf8.
CONFIG_UTF_8
Will be renamed in a later commit.
Unicode/utf_8.cp, table_utf_8, aliases_utf_8
Will be renamed in a later commit.
form_state.state_cell is no longer used for FC_TEXT, FC_PASSWORD, nor FC_FILE.
Instead, get_link_cursor_offset() computes the cell with utf8_ptr2chars
(a new function) or utf8_ptr2cells. This shouldn't slow down ELinks too
much, as it's done only for the selected link and only once per redraw.
The left side of a scrolled input field is always aligned at a
character boundary. The right side might not be.
The new comments describe how the members were apparently intended to
be used. However, the implementation does not actually work when
CONFIG_UTF_8 is defined, and the current semantics do not even allow
an efficient implementation of long (mostly scrolled out) strings.
Restring the spaghetti in dump_to_file to fix a bug that was introduced
in commit 2a6125e3d0 whereby when
document.dump.codepage != "utf-8", the document itself was not output,
only the references list.