Yes, yes, it probably made sense 30 years ago as a way to save a tiny
amount of memory, but especially when interspersed in structures that
have pointers (aligned to 64 bits these days), it's not even saving
memory today. And it makes us fail in nasty ways when looking at files
with long lines.
So just make them 'int'. And if you have a line that is longer than
2GB, you only have yourself to blame. I no longer care.
In case anybody care, the "test-case" for this was a lovely UDDF file
with a binary divecomputer dump encoded as an XML element. Resulting in
a lovely 41kB single line. Not what poor micro-emacs was designed for,
I'm afraid.
I really should just learn another editor, rather than continue to
polish this turd.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
llength() is currently a 'short' which can overflow and result in signed
numbers if line lengths are larger than 32k. We'll fix the overflow
separately, but before we do that, just use a signed int to hold the
value so that we don't overrun memory allocations when we converted that
negative number to a large positive unsigned integer.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For some reason I had limited things to 0xffff, it really should be 0x10ffff.
We don't actually support a full 32-bit unicode model anyway, since we
use the high bits for the control/meta/^X/special bits, but there was no
reason to limit things to 16 bits when we had 28 bits available. And
the real limit for real Unicode characters is 0x10ffff.
Add a silly example character past the 16-bit range to the UTF8 demo
file:
'SMILING FACE WITH HALO' (U+1F607)
from the 'emoticons' block.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
GCC spotted the following unused variable:
CC file.o
file.c: In function ‘readin’:
file.c:225:6: warning: variable ‘lflag’ set but not used [-Wunused-but-set-variable]
file.c: In function ‘ifile’:
file.c:553:6: warning: variable ‘lflag’ set but not used [-Wunused-but-set-variable]
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These functions convert the byte offset into the column number
(getccol()) and vice versa (getgoal()).
Getting this right means that moving up and down the text gets us the
right columns, rather than moving randomly left and right when you move
up and down. We also won't end up in the middle of a utf-8 character,
because we're not just moving into some random byte offset, we're moving
into a proper column.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This re-introduces vtputc() as the way to show characters, which
reinstates the control character handing, and simplifies show_line() in
the process.
vtputc now takes an "int" that is either a unicode character or a signed
char (so negative values in the range [-1, -128] are considered to be
the same as [128, 255]). This allows us to use it regardless of what
the source of data is.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes actual basic editing work. Including things like
justify-paragraph etc, so lines get justified by number of UTF8
characters rather than bytes.
There are probably tons of broken stuff left, but this actually seems to
get the basics working right.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes it possible to cut-and-paste the UTF8 testfile into a new
buffer, and the end result looks correct.
NOTE! We still do various things wrong while editing. For example,
while the cursor movements were fixed, simple things like deleting a
character still work on single bytes, rather than utf8 characters.
So while this is getting much closer to actually editing UTF-8 data,
it's not there yet.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The TAB handling got broken by commit cee00b0efb ("Show UTF-8 input as
UTF-8 output") when it stopped doing things one byte at a time.
I'm sure the other special character cases are broken too.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This uses the four high bits for the meta and control key sequences.
This means that we will be limiting our Unicode space to 28 bits, but
that's more than we really need.
It *would* be nicer if we just used the sign bit to mark "we have meta
character information") but that would require bigger changes. And we
really don't need to worry about 30-bit unicode. Small steps, remember.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
.. but we do have that 0.1s delay, so if somebody feeds us non-utf8
sequences, we won't delay forever.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Right now the input side can give partial utf8 input, and that showed
that we didn't properly handle that case.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>