[no subject]

texinfo-commits
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[no subject]

From:	Patrice Dumas
Date:	Sat, 29 Jul 2023 11:15:37 -0400 (EDT)
branch: master
commit 4ddf044c0a5d935ba5319c8485c946d897e4416c
Author: Patrice Dumas <pertusus@free.fr>
AuthorDate: Sat Jul 29 17:14:14 2023 +0200

    * tp/Texinfo/ParserNonXS.pm (_next_text): do not try to modify the
    output when there was an encoding error.  Do not use Encode::decode
    to emit a warning in case of decoding error, as it may not, and it
    would not be the same message as the XS parser.
    Instead call Encode::decode with check argument set to
    Encode::FB_CROAK in an eval to get the perl message and the
    information that there was an error.  Get the erroneous byte using
    Encode::decode with Encode::FB_QUIET.  Then call Encode::decode
    again with the default check argument to get the decoded string.
    Based on Gavin input.
    
    * tp/tests/encoded/Makefile.am (EXTRA_DIST),
    tp/tests/encoded/list-of-tests (test_latin1_no_documentencoding):
    prepare a test with a letter in latin1 but no documentencoding to
    test for manuals that assumed latin1, especially for accented letters
    in person names.  Do not set it because the XS and perl parsers lead
    to different outputs.
---
 ChangeLog                                          | 20 ++++++++
 tp/Texinfo/ParserNonXS.pm                          | 56 +++++++++++++++-------
 tp/tests/encoded/Makefile.am                       |  1 +
 tp/tests/encoded/list-of-tests                     |  7 +++
 .../encoded/test_latin1_no_documentencoding.texi   |  8 ++++
 5 files changed, 74 insertions(+), 18 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 8cfe7ac32b..2f568e1652 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -10,6 +10,26 @@
        * doc/texinfo-tex-test.texi (Index and paragraphs): Remove section
        as change making index command end paragraph was reverted.
 
+2023-07-29  Patrice Dumas  <pertusus@free.fr>
+
+       * tp/Texinfo/ParserNonXS.pm (_next_text): do not try to modify the
+       output when there was an encoding error.  Do not use Encode::decode
+       to emit a warning in case of decoding error, as it may not, and it
+       would not be the same message as the XS parser.
+       Instead call Encode::decode with check argument set to
+       Encode::FB_CROAK in an eval to get the perl message and the
+       information that there was an error.  Get the erroneous byte using
+       Encode::decode with Encode::FB_QUIET.  Then call Encode::decode
+       again with the default check argument to get the decoded string.
+       Based on Gavin input.
+
+       * tp/tests/encoded/Makefile.am (EXTRA_DIST),
+       tp/tests/encoded/list-of-tests (test_latin1_no_documentencoding):
+       prepare a test with a letter in latin1 but no documentencoding to
+       test for manuals that assumed latin1, especially for accented letters
+       in person names.  Do not set it because the XS and perl parsers lead
+       to different outputs.
+
 2023-07-29  Patrice Dumas  <pertusus@free.fr>
 
        * NEWS: mention stricter checks of input encoding, with more errors
diff --git a/tp/Texinfo/ParserNonXS.pm b/tp/Texinfo/ParserNonXS.pm
index 98c115d5b7..0c5e0f8121 100644
--- a/tp/Texinfo/ParserNonXS.pm
+++ b/tp/Texinfo/ParserNonXS.pm
@@ -2371,24 +2371,33 @@ sub _next_text($;$)
         return ($next_line, { %{$input->{'input_source_info'}} });
       }
     } elsif ($input->{'fh'}) {
-      my $input_error = 0;
-      local $SIG{__WARN__} = sub {
-        my $message = shift;
-        print STDERR "$input->{'input_source_info'}->{'file_name'}" . ":"
-               . ($input->{'input_source_info'}->{'line_nr'} + 1)
-               . ": input error: $message";
-        $input_error = 1;
-      };
       my $fh = $input->{'fh'};
       my $input_line = <$fh>;
+      # Encode::decode tends to consume the input line, so duplicate it
+      my $duplicate_input_line = $input_line;
+      # Encode::encode with default check argument does not give a
+      # warning on incorrect output, contrary to what the documentation says.
+      # So we call it with FB_CROAK in an eval to get the message first
+      # before calling it again to get the result.
+      # This suits us as we try to output the same message as the XS parser
+      eval { Encode::decode($input->{'file_input_encoding'},
+                            $duplicate_input_line, Encode::FB_CROAK); };
+      if ($@) {
+        # determine the first problematic byte to show it in the error
+        # message, like the XS parser
+        $duplicate_input_line = $input_line;
+        my $partially_decoded = Encode::decode($input->{'file_input_encoding'},
+                                      $duplicate_input_line, Encode::FB_QUIET);
+        my $error_byte = substr($duplicate_input_line, 0, 1);
+        warn("$input->{'input_source_info'}->{'file_name'}:"
+            . ($input->{'input_source_info'}->{'line_nr'} + 1).
+               sprintf(": encoding error at byte 0x%2x\n", ord($error_byte)));
+        # show perl message but only with debugging
+        print STDERR "input error: $@\n" if ($self->{'DEBUG'});
+      }
+      # do the decoding
       my $line = Encode::decode($input->{'file_input_encoding'}, $input_line);
       if (defined($line)) {
-        if ($input_error) {
-          # possible encoding error.  attempt to recover by stripping out
-          # non-ASCII bytes.  there may not be that many in the file.
-          Encode::_utf8_off($line);
-          $line =~ s/[\x80-\xFF]//g;
-        }
         # add an end of line if there is none at the end of file
         if (eof($fh) and $line !~ /\n/) {
           $line .= "\n";
@@ -2768,15 +2777,26 @@ sub _expand_linemacro_arguments($$$$$)
       delete $argument_content->{'extra'};
       # FIXME relocate source marks
       if ($toplevel_braces_nr == 1 and $argument_content->{'text'} =~ 
/^\{(.*)\}$/s) {
+        #if ($argument_content->{'source_marks'}) {
+        #  print STDERR "TODO: relocate source mark?\n";
+        #}
         print STDERR "TURN to bracketed $arg_idx "
           .Texinfo::Common::debug_print_element($argument_content)."\n"
             if ($self->{'DEBUG'});
         $argument_content->{'text'} = $1;
         $argument_content->{'type'} = 'bracketed_arg';
-      }
+      # this message could be added to see all the arguments
+      #} else {
+      #  print STDERR "NOT bracketed with bracket $arg_idx "
+      #    .Texinfo::Common::debug_print_element($argument_content)."\n"
+      #      if ($self->{'DEBUG'});
+      }
+    # this message could be added to see all the arguments
+    #} else {
+    #  print STDERR "LVL0 no brace $arg_idx "
+    #     .Texinfo::Common::debug_print_element($argument_content)."\n"
+    #        if ($self->{'DEBUG'});
     }
-    # do that?
-    #_remove_empty_content($self, $argument);
     $arg_idx++;
   }
   print STDERR "END LINEMACRO ARGS EXPANSION\n" if ($self->{'DEBUG'});
@@ -4922,7 +4942,7 @@ sub _handle_macro($$$$$)
     if ($self->{'DEBUG'});
 
   my $error;
-  # FIXME same stack for linemacro?
+  # FIXME use a different counter for linemacro?
   if ($self->{'MAX_MACRO_CALL_NESTING'}
       and $self->{'macro_expansion_nr'} > $self->{'MAX_MACRO_CALL_NESTING'}) {
     $self->_line_warn(sprintf(__(
diff --git a/tp/tests/encoded/Makefile.am b/tp/tests/encoded/Makefile.am
index 3109ead145..a1b972cc9f 100644
--- a/tp/tests/encoded/Makefile.am
+++ b/tp/tests/encoded/Makefile.am
@@ -1,6 +1,7 @@
 EXTRA_DIST = \
  osé_utf8.texi osé_utf8_no_setfilename.texi \
  manual_include_accented_file_name_latin1.texi \
+ test_latin1_no_documentencoding.texi \
  çss.css cêss.css an_ïmage.png txt_çimage.txt list-of-tests  res_parser
 
 DISTCLEANFILES = tests.log tests.out
diff --git a/tp/tests/encoded/list-of-tests b/tp/tests/encoded/list-of-tests
index 422757cf3d..f6802da4b0 100644
--- a/tp/tests/encoded/list-of-tests
+++ b/tp/tests/encoded/list-of-tests
@@ -29,3 +29,10 @@ manual_include_accented_file_name_latin1_explicit_encoding 
manual_include_accent
 # fails to find the latin1 encoded include file as the locale encoding
 # of the test suite is utf8
 manual_include_accented_file_name_latin1_use_locale_encoding 
manual_include_accented_file_name_latin1.texi --info -D 'needrecodedfilenames 
Need recoded file names' -c MESSAGE_ENCODING=UTF-8 -c 
INPUT_FILE_NAME_ENCODING=UTF-8
+
+# test that a file with some latin1 characters, typically used for person names
+# but no declared encoding does not give a result that is too bad, and lead
+# to a warning.  This corresponds to an actual practice when latin1 was a
+# de-facto default encoding for Texinfo manuals, before UTF-8.
+# Not enabled because perl parser and XS parser lead to different output
+#test_latin1_no_documentencoding test_latin1_no_documentencoding.texi
diff --git a/tp/tests/encoded/test_latin1_no_documentencoding.texi 
b/tp/tests/encoded/test_latin1_no_documentencoding.texi
new file mode 100644
index 0000000000..0164b80a2f
--- /dev/null
+++ b/tp/tests/encoded/test_latin1_no_documentencoding.texi
@@ -0,0 +1,8 @@
+\include texinfo
+
+@node Top
+@top
+
+This manual is by Tommy M�ller.
+
+@bye
[Prev in Thread]
Current Thread
[Next in Thread]
master updated (8ed5b64537 -> 4ddf044c0a), Patrice Dumas, 2023/07/29
- [no subject], Patrice Dumas <=
- [no subject], Patrice Dumas, 2023/07/29
Prev by Date: master updated (8ed5b64537 -> 4ddf044c0a)
Next by Date: [no subject]
Previous by thread: master updated (8ed5b64537 -> 4ddf044c0a)
Next by thread: [no subject]
Index(es):
- Date
- Thread