1
0
forked from aniani/vim
Commit Graph

35 Commits

Author SHA1 Message Date
Christian Brabandt
22e8e12d9f patch 9.1.0645: regex: wrong match when searching multi-byte char case-insensitive
Problem:  regex: wrong match when searching multi-byte char
          case-insensitive (diffsetter)
Solution: Apply proper case-folding for characters and search-string

This patch does the following 4 things:

1) When the regexp engine compares two utf-8 codepoints case
   insensitive it may match an adjacent character, because it assumes
   it can step over as many bytes as the pattern contains.

   This however is not necessarily true because of case-folding, a
   multi-byte UTF-8 character can be considered equal to some
   single-byte value.

   Let's consider the pattern 'ſ' and the string 's'. When comparing and
   ignoring case, the single character 's' matches, and since it matches
   Vim will try to step over the match (by the amount of bytes of the
   pattern), assuming that since it matches, the length of both strings is
   the same.

   However in that case, it should only step over the single byte value
   's' by 1 byte and try to start matching after it again. So for the
   backtracking engine we need to ensure:
   * we try to match the correct length for the pattern and the text
   * in case of a match, we step over it correctly

   There is one tricky thing for the backtracing engine. We also need to
   calculate correctly the number of bytes to compare the 2 different
   utf-8 strings s1 and s2. So we will count the number of characters in
   s1 that the byte len specified. Then we count the number of bytes to
   step over the same number of characters in string s2 and then we can
   correctly compare the 2 utf-8 strings.

2) A similar thing can happen for the NFA engine, when skipping to the
   next character to test for a match. We are skipping over the regstart
   pointer, however we do not consider the case that because of
   case-folding we may need to adjust the number of bytes to skip over.
   So this needs to be adjusted in find_match_text() as well.

3) A related issue turned out, when prog->match_text is actually empty.
   In that case we should try to find the next match and skip this
   condition.

4) When comparing characters using collections, we must also apply case
   folding to each character in the collection and not just to the
   current character from the search string.  This doesn't apply to the
   NFA engine, because internally it converts collections to branches
   [abc] -> a\|b\|c

fixes: #14294
closes: #14756

Signed-off-by: Christian Brabandt <cb@256bit.org>
2024-07-30 20:39:18 +02:00
zeertzjq
a59e031aa0 patch 9.1.0334: No test for highlight behavior with 'ambiwidth'
Problem:  No test for highlight behavior with 'ambiwidth'.
Solution: Add a screendump test for 'ambiwidth' with 'cursorline'.
          (zeertzjq)

closes: #14554

Signed-off-by: zeertzjq <zeertzjq@outlook.com>
Signed-off-by: Christian Brabandt <cb@256bit.org>
2024-04-15 19:14:38 +02:00
Christian Brabandt
c97f4d61cd patch 9.1.0297: Patch 9.1.0296 causes too many issues
Problem:  Patch 9.1.0296 causes too many issues
          (Tony Mechelynck, @chdiza, CI)
Solution: Back out the change for now

Revert "patch 9.1.0296: regexp: engines do not handle case-folding well"

This reverts commit 7a27c108e0 it causes
issues with syntax highlighting and breaks the FreeBSD and MacOS CI. It
needs more work.

fixes: #14487

Signed-off-by: Christian Brabandt <cb@256bit.org>
2024-04-10 16:22:17 +02:00
Christian Brabandt
7a27c108e0 patch 9.1.0296: regexp: engines do not handle case-folding well
Problem:  Regex engines do not handle case-folding well
Solution: Correctly calculate byte length of characters to skip

When the regexp engine compares two utf-8 codepoints case insensitively
it may match an adjacent character, because it assumes it can step over
as many bytes as the pattern contains.

This however is not necessarily true because of case-folding, a
multi-byte UTF-8 character can be considered equal to some single-byte
value.

Let's consider the pattern 'ſ' and the string 's'. When comparing and
ignoring case, the single character 's' matches, and since it matches
Vim will try to step over the match (by the amount of bytes of the
pattern), assuming that since it matches, the length of both strings is
the same.

However in that case, it should only step over the single byte
value 's' so by 1 byte and try to start matching after it again. So for the
backtracking engine we need to ensure:
- we try to match the correct length for the pattern and the text
- in case of a match, we step over it correctly

The same thing can happen for the NFA engine, when skipping to the next
character to test for a match. We are skipping over the regstart
pointer, however we do not consider the case that because of
case-folding we may need to adjust the number of bytes to skip over. So
this needs to be adjusted in find_match_text() as well.

A related issue turned out, when prog->match_text is actually empty. In
that case we should try to find the next match and skip this condition.

fixes: #14294
closes: #14433

Signed-off-by: Christian Brabandt <cb@256bit.org>
2024-04-09 22:53:19 +02:00
Christian Brabandt
d2cc51f9a1 patch 9.1.0011: regexp cannot match combining chars in collection
Problem:  regexp cannot match combining chars in collection
Solution: Check for combining characters in regex collections for the
          NFA and BT Regex Engine

Also, while at it, make debug mode work again.

fixes #10286
closes: #12871

Signed-off-by: Christian Brabandt <cb@256bit.org>
2024-01-04 22:54:08 +01:00
Christian Brabandt
be07caa071 patch 9.0.1777: patch 9.0.1771 causes problems
Problem:  patch 9.0.1771 causes problems
Solution: revert it

Revert "patch 9.0.1771: regex: combining chars in collections not handled"
This reverts commit ca22fc36a4.

Signed-off-by: Christian Brabandt <cb@256bit.org>
2023-08-20 22:28:28 +02:00
Christian Brabandt
ca22fc36a4 patch 9.0.1771: regex: combining chars in collections not handled
Problem:  regex: combining chars in collections not handled
Solution: Check for following combining characters for NFA and BT engine

closes: #10459
closes: #10286

Signed-off-by: Christian Brabandt <cb@256bit.org>
2023-08-20 20:38:56 +02:00
Martin Tournoij
25f3a146a0 patch 9.0.0700: there is no real need for a "big" build
Problem:    There is no real need for a "big" build.
Solution:   Move common features to "normal" build, less often used features
            to the "huge" build. (Martin Tournoij, closes #11283)
2022-10-08 19:26:41 +01:00
Bram Moolenaar
db77cb3c08 patch 9.0.0669: too many delete() calls in tests
Problem:    Too many delete() calls in tests.
Solution:   Use deferred delete where possible.
2022-10-05 21:45:30 +01:00
Bram Moolenaar
cb36c2a3cd patch 9.0.0106: illegal byte regexp test doesn't fail when fix is reversed
Problem:    Illegal byte regexp test doesn't fail when fix is reversed.
Solution:   Make sure illegal bytes end up in sourced script file.
2022-07-29 18:32:20 +01:00
Bram Moolenaar
f50940531d patch 9.0.0105: illegal memory access when pattern starts with illegal byte
Problem:    Illegal memory access when pattern starts with illegal byte.
Solution:   Do not match a character with an illegal byte.
2022-07-29 16:22:25 +01:00
Bram Moolenaar
2457b2bbc2 patch 8.2.4443: regexp pattern test fails on Mac
Problem:    Regexp pattern test fails on Mac.
Solution:   Do not use a swapfile for the buffer.
2022-02-22 16:19:37 +00:00
Bram Moolenaar
6456fae9ba patch 8.2.4440: crash with specific regexp pattern and string
Problem:    Crash with specific regexp pattern and string.
Solution:   Stop at the start of the string.
2022-02-22 13:37:31 +00:00
Bram Moolenaar
424bcae1fb patch 8.2.4273: the EBCDIC support is outdated
Problem:    The EBCDIC support is outdated.
Solution:   Remove the EBCDIC support.
2022-01-31 14:59:41 +00:00
Bram Moolenaar
65b6056659 patch 8.2.3409: reading beyond end of line with invalid utf-8 character
Problem:    Reading beyond end of line with invalid utf-8 character.
Solution:   Check for NUL when advancing.
2021-09-07 19:26:53 +02:00
Bram Moolenaar
0b94e297af patch 8.2.2716: the equivalent class regexp is missing some characters
Problem:    The equivalent class regexp is missing some characters.
Solution:   Update the list of equivalent characters. (Dominique Pellé,
            closes #8029)
2021-04-05 13:59:53 +02:00
Bram Moolenaar
66c50c5653 patch 8.2.2278: falling back to old regexp engine can some patterns
Problem:    Falling back to old regexp engine can some patterns.
Solution:   Do not fall back once [[:lower:]] or [[:upper:]] is used.
            (Christian Brabandt, closes #7572)
2021-01-02 17:43:49 +01:00
Bram Moolenaar
ef2dff52de patch 8.2.2177: pattern "^" does not match if first character is combining
Problem:    Pattern "^" does not match if the first character in the line is
            combining. (Rene Kita)
Solution:   Do accept a match at the start of the line. (closes #6963)
2020-12-21 14:54:32 +01:00
Bram Moolenaar
8a9bc95eae patch 8.2.1786: various Normal mode commands not fully tested
Problem:    Various Normal mode commands not fully tested.
Solution:   Add more tests. (Yegappan Lakshmanan, closes #7059)
2020-10-02 18:48:07 +02:00
Bram Moolenaar
7d40b8a532 patch 8.2.1295: tests 44 and 99 are old style
Problem:    Tests 44 and 99 are old style.
Solution:   Convert to new style tests. (Yegappan Lakshmanan, closes #6536)
2020-07-26 12:52:59 +02:00
Bram Moolenaar
470adb827f patch 8.2.1254: MS-Windows: regexp test may fail if 'iskeyword' set wrongly
Problem:    MS-Windows: regexp test may fail if 'iskeyword' set wrongly.
Solution:   Override the 'iskeyword' value. (Taro Muraoka, closes #6502)
2020-07-20 21:21:30 +02:00
Bram Moolenaar
59de417b90 patch 8.2.0938: NFA regexp uses tolower ()to compare ignore-case
Problem:    NFA regexp uses tolower() to compare ignore-case. (Thayne McCombs)
Solution:   Use utf_fold() when possible. (ref. neovim #12456)
2020-06-09 19:34:54 +02:00
Bram Moolenaar
afc13bd827 patch 8.2.0014: test69 and test95 are old style
Problem:    Test69 and test95 are old style.
Solution:   Convert to new style tests. (Yegappan Lakshmanan, closes #5365)
2019-12-16 22:43:31 +01:00
Bram Moolenaar
2a5b52758b patch 8.1.1720: crash with very long %[] pattern
Problem:    Crash with very long %[] pattern. (Reza Mirzazade farkhani)
Solution:   Check for reg_toolong. (closes #4703)
2019-07-20 18:56:06 +02:00
Bram Moolenaar
221cd9f4dd patch 8.1.0862: no verbose version of character classes
Problem:    No verbose version of character classes.
Solution:   Add [:ident:], [:keyword:] and [:fname:]. (Ozaki Kiichi,
            closes #1373)
2019-01-31 15:34:40 +01:00
Bram Moolenaar
30276f2beb patch 8.1.0811: too many #ifdefs
Problem:    Too many #ifdefs.
Solution:   Graduate FEAT_MBYTE, the final chapter.
2019-01-24 17:59:39 +01:00
Bram Moolenaar
966e58e413 patch 8.0.0623: error for invalid regexp is not very informative
Problem:    The message "Invalid range" is used for multiple errors.
Solution:   Add two more specific error messages. (Itchyny, Ken Hamada)
2017-06-05 16:54:08 +02:00
Bram Moolenaar
13489b9c41 patch 8.0.0529: line in test commented out
Problem:    Line in test commented out.
Solution:   Uncomment the lines for character classes that were failing before
            8.0.0519. (Dominique Pelle, closes #1599)
2017-03-30 22:20:29 +02:00
Bram Moolenaar
0c078fc7db patch 8.0.0519: character classes are not well tested
Problem:    Character classes are not well tested. They can differ between
            platforms.
Solution:   Add tests.  In the documentation make clear which classes depend
            on what library function.  Only use :cntrl: and :graph: for ASCII.
            (Kazunobu Kuriyama, Dominique Pelle, closes #1560)
            Update the documentation.
2017-03-29 15:31:20 +02:00
Bram Moolenaar
d3c907b5d2 patch 7.4.2223
Problem:    Buffer overflow when using latin1 character with feedkeys().
Solution:   Check for an illegal character.  Add a test.
2016-08-17 21:32:09 +02:00
Bram Moolenaar
6bff02eb53 patch 7.4.2222
Problem:    Sourcing a script where a character has 0x80 as a second byte does
            not work. (Filipe L B Correia)
Solution:   Turn 0x80 into K_SPECIAL KS_SPECIAL KE_FILLER. (Christian
            Brabandt, closes #728)  Add a test case.
2016-08-16 22:50:55 +02:00
Bram Moolenaar
ac105ed3c4 patch 7.4.2086
Problem:    Using the system default encoding makes tests unpredictable.
Solution:   Always use utf-8 or latin1 in the new style tests.  Remove setting
            encoding and scriptencoding where it is not needed.
2016-07-21 20:33:32 +02:00
Bram Moolenaar
490465bda6 patch 7.4.1785
Problem:    Regexp test fails on windows.
Solution:   set 'isprint' to the right value for testing.
2016-04-24 15:11:02 +02:00
Bram Moolenaar
af98a49dd0 patch 7.4.1783
Problem:    The old regexp engine doesn't handle character classes correctly.
            (Manuel Ortega)
Solution:   Use regmbc() instead of regc().  Add a test.
2016-04-24 14:40:12 +02:00
Bram Moolenaar
22e421549d patch 7.4.1700
Problem:    Equivalence classes are not properly tested.
Solution:   Add tests for multi-byte and latin1. Fix an error. (Owen Leibman)
2016-04-03 14:02:02 +02:00