Heh. My new UHD monitor makes it easy to have more than 127 lines of
text. I guess the 'char' could be an unsigned char, but quite frankly,
trying to save a couple of bytes per open editor window seems a bit
excessive these days. So just make it 'int'.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yes, yes, it probably made sense 30 years ago as a way to save a tiny
amount of memory, but especially when interspersed in structures that
have pointers (aligned to 64 bits these days), it's not even saving
memory today. And it makes us fail in nasty ways when looking at files
with long lines.
So just make them 'int'. And if you have a line that is longer than
2GB, you only have yourself to blame. I no longer care.
In case anybody care, the "test-case" for this was a lovely UDDF file
with a binary divecomputer dump encoded as an XML element. Resulting in
a lovely 41kB single line. Not what poor micro-emacs was designed for,
I'm afraid.
I really should just learn another editor, rather than continue to
polish this turd.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
llength() is currently a 'short' which can overflow and result in signed
numbers if line lengths are larger than 32k. We'll fix the overflow
separately, but before we do that, just use a signed int to hold the
value so that we don't overrun memory allocations when we converted that
negative number to a large positive unsigned integer.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For some reason I had limited things to 0xffff, it really should be 0x10ffff.
We don't actually support a full 32-bit unicode model anyway, since we
use the high bits for the control/meta/^X/special bits, but there was no
reason to limit things to 16 bits when we had 28 bits available. And
the real limit for real Unicode characters is 0x10ffff.
Add a silly example character past the 16-bit range to the UTF8 demo
file:
'SMILING FACE WITH HALO' (U+1F607)
from the 'emoticons' block.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
GCC spotted the following unused variable:
CC file.o
file.c: In function ‘readin’:
file.c:225:6: warning: variable ‘lflag’ set but not used [-Wunused-but-set-variable]
file.c: In function ‘ifile’:
file.c:553:6: warning: variable ‘lflag’ set but not used [-Wunused-but-set-variable]
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These functions convert the byte offset into the column number
(getccol()) and vice versa (getgoal()).
Getting this right means that moving up and down the text gets us the
right columns, rather than moving randomly left and right when you move
up and down. We also won't end up in the middle of a utf-8 character,
because we're not just moving into some random byte offset, we're moving
into a proper column.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This re-introduces vtputc() as the way to show characters, which
reinstates the control character handing, and simplifies show_line() in
the process.
vtputc now takes an "int" that is either a unicode character or a signed
char (so negative values in the range [-1, -128] are considered to be
the same as [128, 255]). This allows us to use it regardless of what
the source of data is.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes actual basic editing work. Including things like
justify-paragraph etc, so lines get justified by number of UTF8
characters rather than bytes.
There are probably tons of broken stuff left, but this actually seems to
get the basics working right.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes it possible to cut-and-paste the UTF8 testfile into a new
buffer, and the end result looks correct.
NOTE! We still do various things wrong while editing. For example,
while the cursor movements were fixed, simple things like deleting a
character still work on single bytes, rather than utf8 characters.
So while this is getting much closer to actually editing UTF-8 data,
it's not there yet.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The TAB handling got broken by commit cee00b0efb ("Show UTF-8 input as
UTF-8 output") when it stopped doing things one byte at a time.
I'm sure the other special character cases are broken too.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This uses the four high bits for the meta and control key sequences.
This means that we will be limiting our Unicode space to 28 bits, but
that's more than we really need.
It *would* be nicer if we just used the sign bit to mark "we have meta
character information") but that would require bigger changes. And we
really don't need to worry about 30-bit unicode. Small steps, remember.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
.. but we do have that 0.1s delay, so if somebody feeds us non-utf8
sequences, we won't delay forever.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Right now the input side can give partial utf8 input, and that showed
that we didn't properly handle that case.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I'm starting to expand the input value from 'short' (with flags in the
upper eight bytes) to 'int' (with negative values having flags).
Small baby steps.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ttgetc() used some homebrew utf8 to unicode translation, limited to just
the normal latin1 characters. Use the utf8 helper functions to get it
right for the more complex cases.
NOTE! We don't actually handle characters > 0xff right anyway. And we
still end up doing Latin1 in the buffers on input. One small step at a
time.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ok, so it may do odd things if it's not truly utf-8, and when moving up
and down lines that have utf-8 the cursor moves oddly (because the byte
offset within the line stays constant, rather than the character
offset), but with this you can actually open the UTF8 example file and
move around it, and at least some of the movement makes sense.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let's just plan on being fully utf8 some day. We're not there yet, and
maybe we'll never be, but having the halfway mode is not useful either.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
.. by doing the stupid "convert to unicode value and back" model.
This actually populates the 'struct video' array with the unicode
values, so UTF8 input actually shows correctly. In particular, the nice
test-file (UTF-8-demo.txt) shows up not as garbage, but as the UTF-8 it
is.
HOWEVER!
Since the *editing* doesn't know about UTF-8, and considers it just a
stream of bytes, the end result is not actually a usable utf-8 editor.
So don't get too excited yet: this is just a partial step to "actually
edit utf8 data"
NOTE NOTE NOTE! If the character buffer contains Latin1, we will
transform that Latin1 to unicode, and then output it as UTF8. And we
will edit it correctly as the character-by-character data. Also, we
still do the "UTF8 to Latin1" translation on *input*, so with this
commit we can actually continue to *edit* Latin1 text.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>