[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
branch master updated: Simpler more consistent UTF-8 and unicode handlin
From: |
Patrice Dumas |
Subject: |
branch master updated: Simpler more consistent UTF-8 and unicode handling, stricter UTF-8 conversion |
Date: |
Wed, 26 Jul 2023 17:22:06 -0400 |
This is an automated email from the git hooks/post-receive script.
pertusus pushed a commit to branch master
in repository texinfo.
The following commit(s) were added to refs/heads/master by this push:
new 6557161e7c Simpler more consistent UTF-8 and unicode handling,
stricter UTF-8 conversion
6557161e7c is described below
commit 6557161e7c4ad6d1ad2e919ea022e3aab3f8ff8e
Author: Patrice Dumas <pertusus@free.fr>
AuthorDate: Wed Jul 26 23:20:19 2023 +0200
Simpler more consistent UTF-8 and unicode handling, stricter UTF-8
conversion
* tp/Texinfo/XS/parsetexi/end_line.c (end_line_misc_line): map utf8 to
utf-8 for input_encoding, to get the same output as the perl parser
with mime_name and also because it is better.
* tp/Texinfo/Common.pm (%encoding_name_conversion_map): map utf8 to
utf-8 to always use the same conversion in perl, and prefer the strict
conversion.
* tp/Texinfo/Common.pm (encode_file_name, count_bytes): do not use
utf-8 specific conversion, always use Encode encode and also use the
strict conversion.
* tp/Texinfo/Convert/Unicode.pm (_format_eight_bit_accents_stack),
tp/Texinfo/ParserNonXS.pm (_new_text_input, _next_text): use the
utf-8 encoding not utf8 for Encode encode strict conversion.
* tp/Texinfo/Convert/HTML.pm (converter_initialize),
tp/Texinfo/Convert/Unicode.pm: use charnames::vianame to obtain
characters based on a string representation of unicode codepoints, as
it is simple and this is what is described in the documentation.
* tp/Texinfo/Convert/Unicode.pm (unicode_point_decoded_in_encoding):
handle hex strings in the ascii range for 8bit encodings.
* tp/Makefile.tres, tp/t/08misc_commands.t (documentencoding_utf8):
new test with documentencoding utf8.
---
ChangeLog | 31 ++++
tp/Makefile.tres | 1 +
tp/Texinfo/Common.pm | 38 ++---
tp/Texinfo/Convert/HTML.pm | 7 +-
tp/Texinfo/Convert/Unicode.pm | 58 +++----
tp/Texinfo/ParserNonXS.pm | 4 +-
tp/Texinfo/XS/parsetexi/end_line.c | 1 +
tp/t/08misc_commands.t | 8 +
.../results/misc_commands/documentencoding_utf8.pl | 166 +++++++++++++++++++++
9 files changed, 257 insertions(+), 57 deletions(-)
diff --git a/ChangeLog b/ChangeLog
index 3df4e615bc..cb5decfc66 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,34 @@
+2023-07-26 Patrice Dumas <pertusus@free.fr>
+
+ Simpler more consistent UTF-8 and unicode handling, stricter UTF-8
conversion
+
+ * tp/Texinfo/XS/parsetexi/end_line.c (end_line_misc_line): map utf8 to
+ utf-8 for input_encoding, to get the same output as the perl parser
+ with mime_name and also because it is better.
+
+ * tp/Texinfo/Common.pm (%encoding_name_conversion_map): map utf8 to
+ utf-8 to always use the same conversion in perl, and prefer the strict
+ conversion.
+
+ * tp/Texinfo/Common.pm (encode_file_name, count_bytes): do not use
+ utf-8 specific conversion, always use Encode encode and also use the
+ strict conversion.
+
+ * tp/Texinfo/Convert/Unicode.pm (_format_eight_bit_accents_stack),
+ tp/Texinfo/ParserNonXS.pm (_new_text_input, _next_text): use the
+ utf-8 encoding not utf8 for Encode encode strict conversion.
+
+ * tp/Texinfo/Convert/HTML.pm (converter_initialize),
+ tp/Texinfo/Convert/Unicode.pm: use charnames::vianame to obtain
+ characters based on a string representation of unicode codepoints, as
+ it is simple and this is what is described in the documentation.
+
+ * tp/Texinfo/Convert/Unicode.pm (unicode_point_decoded_in_encoding):
+ handle hex strings in the ascii range for 8bit encodings.
+
+ * tp/Makefile.tres, tp/t/08misc_commands.t (documentencoding_utf8):
+ new test with documentencoding utf8.
+
2023-07-26 Gavin Smith <gavinsmith0123@gmail.com>
* doc/texinfo.tex (\summarycontents): Set \extrasecnoskip to
diff --git a/tp/Makefile.tres b/tp/Makefile.tres
index 6efcfb59f4..d3ab682858 100644
--- a/tp/Makefile.tres
+++ b/tp/Makefile.tres
@@ -1437,6 +1437,7 @@ test_files_generated_list =
$(test_tap_files_generated_list) \
t/results/misc_commands/definfoenclose.pl \
t/results/misc_commands/definfoenclose_nestings.pl \
t/results/misc_commands/definfoenclose_with_empty_arg.pl \
+ t/results/misc_commands/documentencoding_utf8.pl \
t/results/misc_commands/documentencoding_zero.pl \
t/results/misc_commands/double_exdent.pl \
t/results/misc_commands/empty_center.pl \
diff --git a/tp/Texinfo/Common.pm b/tp/Texinfo/Common.pm
index 80bec30501..6727566083 100644
--- a/tp/Texinfo/Common.pm
+++ b/tp/Texinfo/Common.pm
@@ -32,7 +32,7 @@ use 5.008001;
# to determine the null file
use Config;
use File::Spec;
-# for find_encoding, resolve_alias and maybe utf8 related functions
+# for find_encoding, resolve_alias
use Encode;
# debugging
@@ -569,6 +569,16 @@ sub valid_tree_transformation ($)
our %encoding_name_conversion_map;
%encoding_name_conversion_map = (
'us-ascii' => 'iso-8859-1',
+ # The mapping to utf-8 is important for perl code, as it means using a strict
+ # conversion to utf-8 and not a lax conversion:
+ #
https://perldoc.perl.org/perlunifaq#What's-the-difference-between-UTF-8-and-utf8?
+ # In more detail, we want to use utf-8 only for two different reasons
+ # 1) if input is malformed it is better to error out it as soon as possible
+ # 2) we do not want to have different behaviour and hard to find bugs
+ # depending on whether the user used @documentencoding utf-8
+ # or @documentencoding utf8. There is a warning with utf8, but
+ # we want to be clear in any case.
+ 'utf8' => 'utf-8',
);
@@ -1318,12 +1328,11 @@ sub encode_file_name($$)
if (not defined($input_encoding));
if ($input_encoding eq 'utf-8' or $input_encoding eq 'utf-8-strict') {
- utf8::encode($file_name);
$encoding = 'utf-8';
} else {
- $file_name = Encode::encode($input_encoding, $file_name);
$encoding = $input_encoding;
}
+ $file_name = Encode::encode($encoding, $file_name);
return ($file_name, $encoding);
}
@@ -1752,23 +1761,7 @@ sub count_bytes($$;$)
$encoding = $self->get_conf('OUTPUT_PERL_ENCODING');
}
- if ($encoding eq 'utf-8'
- or $encoding eq 'utf-8-strict') {
- if (Encode::is_utf8($string)) {
- # Get the number of bytes in the underlying storage. This may
- # be slightly faster than calling Encode::encode_utf8.
- use bytes;
- return length($string);
-
- # Here's another way of doing it.
- #Encode::_utf8_off($string);
- #my $length = length($string);
- #Encode::_utf8_on($string);
- #return $length
- } else {
- return length(Encode::encode_utf8($string));
- }
- } elsif ($encoding and $encoding ne 'ascii') {
+ if ($encoding and $encoding ne 'ascii') {
if (!defined($last_encoding) or $last_encoding ne $encoding) {
# Look up and save encoding object for next time. This is
# slightly faster than calling Encode::encode.
@@ -1781,11 +1774,6 @@ sub count_bytes($$;$)
return length($Encode_encoding_object->encode($string));
} else {
return length($string);
- #my $length = length($string);
- #$string =~ s/\n/\\n/g;
- #$string =~ s/\f/\\f/g;
- #print STDERR "Count($length): $string\n";
- #return $length;
}
}
diff --git a/tp/Texinfo/Convert/HTML.pm b/tp/Texinfo/Convert/HTML.pm
index 27d4bfd759..81e975fa37 100644
--- a/tp/Texinfo/Convert/HTML.pm
+++ b/tp/Texinfo/Convert/HTML.pm
@@ -31,7 +31,8 @@
package Texinfo::Convert::HTML;
-use 5.00405;
+# charnames::vianame is not documented in 5.6.0.
+use 5.008;
# See 'The "Unicode Bug"' under 'perlunicode' man page. This means
# that regular expressions will treat characters 128-255 in a Perl string
@@ -54,6 +55,7 @@ use File::Copy qw(copy);
use Storable;
use Encode qw(find_encoding decode encode);
+use charnames ();
use Texinfo::Commands;
use Texinfo::Common;
@@ -7765,7 +7767,8 @@ sub converter_initialize($)
if ($self->get_conf('OUTPUT_CHARACTERS')
and Texinfo::Convert::Unicode::unicode_point_decoded_in_encoding(
$output_encoding, $unicode_point)) {
- $special_characters_set{$special_character} = chr(hex($unicode_point));
+ $special_characters_set{$special_character}
+ = charnames::vianame("U+$unicode_point");
} elsif ($self->get_conf('USE_NUMERIC_ENTITY')) {
$special_characters_set{$special_character} =
'&#'.hex($unicode_point).';';
} else {
diff --git a/tp/Texinfo/Convert/Unicode.pm b/tp/Texinfo/Convert/Unicode.pm
index 2dfb7511ec..03d15f5433 100644
--- a/tp/Texinfo/Convert/Unicode.pm
+++ b/tp/Texinfo/Convert/Unicode.pm
@@ -19,10 +19,9 @@
package Texinfo::Convert::Unicode;
-# Seems to be the Perl version required for Encode:
-# http://cpansearch.perl.org/src/DANKOGAI/Encode-2.47/Encode/README.e2x
-#
http://coding.derkeiler.com/Archive/Perl/comp.lang.perl.misc/2005-12/msg00833.html
-use 5.007_003;
+# Documentation of earlier releases for perluniintro is missing.
+# charnames::vianame is not documented in 5.6.0.
+use 5.008;
use strict;
# To check if there is no erroneous autovivification
@@ -33,6 +32,9 @@ use Carp qw(cluck);
use Encode;
use Unicode::Normalize;
use Unicode::EastAsianWidth;
+# To obtain unicode characters based on code points represented as
+# strings
+use charnames ();
use Texinfo::MiscXS;
@@ -563,19 +565,12 @@ our %extra_unicode_map = (
%unicode_map = (%unicode_map, %extra_unicode_map);
# set the %unicode_character_brace_no_arg_commands value to the character
-# corresponding to the hex value in %unicode_map.
+# corresponding to the textual hex value in %unicode_map.
our %unicode_character_brace_no_arg_commands;
foreach my $command (keys(%unicode_map)) {
if ($unicode_map{$command} ne '') {
- my $char_nr = hex($unicode_map{$command});
- if ($char_nr > 126 and $char_nr < 255) {
- # this is very strange, indeed. The reason lies certainly in the
- # magic backward compatibility support in Perl for 8bit encodings.
- $unicode_character_brace_no_arg_commands{$command} =
- Encode::decode("iso-8859-1", chr($char_nr));
- } else {
- $unicode_character_brace_no_arg_commands{$command} = chr($char_nr);
- }
+ $unicode_character_brace_no_arg_commands{$command}
+ = charnames::vianame("U+$unicode_map{$command}");
}
}
@@ -697,6 +692,12 @@ foreach my $command (keys(%unicode_accented_letters)) {
}
}
+# Note that the values are not actually used anywhere, they are there
+# to mark unicode codepoints that exist in the encoding. It is important
+# to get them right, though, as the values are shown when debugging.
+# Also note that values below A0, which correspond to the ascii range
+# are not in the values and therefore should be handled differently by the
+# codes using the hash.
my %unicode_to_eight_bit = (
'iso-8859-1' => {
'00A0' => 'A0',
@@ -1332,7 +1333,7 @@ sub unicode_text {
return $text;
}
-# return the 8 bit, if it exists, and the unicode codepoint
+# return the hexadecimal 8 bit string, if it exists, and the unicode codepoint
sub _eight_bit_and_unicode_point($$)
{
my $char = shift;
@@ -1428,36 +1429,36 @@ sub _format_eight_bit_accents_stack($$$$$;$)
my $command = 'TEXT';
$command = $partial_result->[1]->{'cmdname'} if ($partial_result->[1]);
if (defined($partial_result->[0])) {
- print STDERR " -> ".Encode::encode('utf8', $partial_result->[0])
+ print STDERR " -> ".Encode::encode('utf-8', $partial_result->[0])
."|$command\n";
} else {
- print STDERR " -> NO UTF8 |$command\n";
+ print STDERR " -> NO accented character |$command\n";
}
}
}
- # At this point we have the utf8 encoded results for the accent
+ # At this point we have the unicode character results for the accent
# commands stack, with all the intermediate results.
# For each one we'll check if it is possible to encode it in the
# current eight bit output encoding table and, if so set the result
# to the character.
- my $eight_bit = '';
+ my $prev_eight_bit = '';
while (@results_stack) {
my $char = $results_stack[0]->[0];
last if (!defined($char));
- my ($new_eight_bit, $new_codepoint)
+ my ($new_eight_bit, $codepoint)
= _eight_bit_and_unicode_point($char, $encoding);
if ($debug) {
my $command = 'TEXT';
$command = $results_stack[0]->[1]->{'cmdname'}
if ($results_stack[0]->[1]);
- my $new_eight_bit_txt = 'UNDEF';
- $new_eight_bit_txt = $new_eight_bit if (defined($new_eight_bit));
- print STDERR "" . Encode::encode('utf8', $char)
- . " ($command) new_codepoint: $new_codepoint 8bit: $new_eight_bit_txt
old: $eight_bit\n";
+ print STDERR "" . Encode::encode('utf-8', $char) . " ($command) "
+ . "codepoint: $codepoint "
+ ."8bit: ". (defined($new_eight_bit) ? $new_eight_bit : 'UNDEF')
+ . " prev: $prev_eight_bit\n";
}
# no corresponding eight bit character found for a composed character
@@ -1472,7 +1473,7 @@ sub _format_eight_bit_accents_stack($$$$$;$)
# appending or prepending a character. For example this happens for
# @={@,{@~{n}}}, where @,{@~{n}} is expanded to a 2 character:
# n with a tilde, followed by a ,
- # In that case, the additional utf8 diacritic is appended, which
+ # In that case, the additional diacritic is appended, which
# means that it is composed with the , and leaves n with a tilde
# untouched.
# -> the diacritic is appended but the normal form doesn't lead
@@ -1480,11 +1481,11 @@ sub _format_eight_bit_accents_stack($$$$$;$)
# of the string is unchanged. This, for example, happens for
# @ubaraccent{a} since there is no composed accent with a and an
# underbar.
- last if ($new_eight_bit eq $eight_bit
+ last if ($new_eight_bit eq $prev_eight_bit
and !($results_stack[0]->[1]->{'cmdname'} eq 'dotless'
and $char eq 'i'));
$result = $results_stack[0]->[0];
- $eight_bit = $new_eight_bit;
+ $prev_eight_bit = $new_eight_bit;
shift @results_stack;
}
@@ -1545,7 +1546,8 @@ sub unicode_point_decoded_in_encoding($$) {
return 1 if ($encoding eq 'utf-8'
or ($unicode_to_eight_bit{$encoding}
- and
$unicode_to_eight_bit{$encoding}->{$unicode_point}));
+ and ($unicode_to_eight_bit{$encoding}->{$unicode_point}
+ or hex($unicode_point) < 128)));
}
return 0;
}
diff --git a/tp/Texinfo/ParserNonXS.pm b/tp/Texinfo/ParserNonXS.pm
index cbab8ae95a..8eb5fd1b91 100644
--- a/tp/Texinfo/ParserNonXS.pm
+++ b/tp/Texinfo/ParserNonXS.pm
@@ -678,7 +678,7 @@ sub _new_text_input($$)
my $texthandle = do { local *FH };
# In-memory scalar strings are considered a stream of bytes, so need
# to encode/decode.
- $text = Encode::encode("utf8", $text);
+ $text = Encode::encode('utf-8', $text);
# Could fail with error like
# Strings with code points over 0xFF may not be mapped into in-memory file
handles
if (!open ($texthandle, '<', \$text)) {
@@ -2364,7 +2364,7 @@ sub _next_text($;$)
my $next_line = <$texthandle>;
if (defined($next_line)) {
# need to decode to characters
- $next_line = Encode::decode('utf8', $next_line);
+ $next_line = Encode::decode('utf-8', $next_line);
$input->{'input_source_info'}->{'line_nr'} += 1
unless ($input->{'input_source_info'}->{'macro'} ne ''
or defined($input->{'value_flag'}));
diff --git a/tp/Texinfo/XS/parsetexi/end_line.c
b/tp/Texinfo/XS/parsetexi/end_line.c
index 26f5cc5fb7..bd44533fe7 100644
--- a/tp/Texinfo/XS/parsetexi/end_line.c
+++ b/tp/Texinfo/XS/parsetexi/end_line.c
@@ -1397,6 +1397,7 @@ end_line_misc_line (ELEMENT *current)
*/
static struct encoding_map map[] = {
"utf-8", "utf-8",
+ "utf8", "utf-8",
"ascii", "us-ascii",
"shiftjis", "shift_jis",
"latin1", "iso-8859-1",
diff --git a/tp/t/08misc_commands.t b/tp/t/08misc_commands.t
index eb1a5e8d13..4bef76cfe4 100644
--- a/tp/t/08misc_commands.t
+++ b/tp/t/08misc_commands.t
@@ -228,6 +228,13 @@ my @converted_test_cases = (
@setfilename @ @verb{: name :}@
', {'full_document' => 1}],
+# this tests seems somewhat pointless, but it is not, as in perl
+# utf8 may mean a lax handling of UTF-8. We want to avoid using
+# that lax handling of UTF-8, better get errors early.
+['documentencoding_utf8',
+'@documentencoding utf8
+
+'],
['definfoenclose',
'
definfoenclose phoo,//,\\ @definfoenclose phoo,//,\\
@@ -597,6 +604,7 @@ in example
my %info_tests = (
'comment_space_command_on_line' => 1,
'setfilename' => 1,
+ 'documentencoding_utf8' => 1,
);
my %xml_tests = (
diff --git a/tp/t/results/misc_commands/documentencoding_utf8.pl
b/tp/t/results/misc_commands/documentencoding_utf8.pl
new file mode 100644
index 0000000000..019f09de4a
--- /dev/null
+++ b/tp/t/results/misc_commands/documentencoding_utf8.pl
@@ -0,0 +1,166 @@
+use vars qw(%result_texis %result_texts %result_trees %result_errors
+ %result_indices %result_sectioning %result_nodes %result_menus
+ %result_floats %result_converted %result_converted_errors
+ %result_elements %result_directions_text %result_indices_sort_strings);
+
+use utf8;
+
+$result_trees{'documentencoding_utf8'} = {
+ 'contents' => [
+ {
+ 'contents' => [
+ {
+ 'args' => [
+ {
+ 'contents' => [
+ {
+ 'text' => 'utf8'
+ }
+ ],
+ 'info' => {
+ 'spaces_after_argument' => {
+ 'text' => '
+'
+ }
+ },
+ 'type' => 'line_arg'
+ }
+ ],
+ 'cmdname' => 'documentencoding',
+ 'extra' => {
+ 'input_encoding_name' => 'utf-8',
+ 'text_arg' => 'utf8'
+ },
+ 'info' => {
+ 'spaces_before_argument' => {
+ 'text' => ' '
+ }
+ },
+ 'source_info' => {
+ 'file_name' => '',
+ 'line_nr' => 1,
+ 'macro' => ''
+ }
+ },
+ {
+ 'text' => '
+',
+ 'type' => 'empty_line'
+ }
+ ],
+ 'type' => 'before_node_section'
+ }
+ ],
+ 'type' => 'document_root'
+};
+
+$result_texis{'documentencoding_utf8'} = '@documentencoding utf8
+
+';
+
+
+$result_texts{'documentencoding_utf8'} = '
+';
+
+$result_errors{'documentencoding_utf8'} = [
+ {
+ 'error_line' => 'warning: encoding `utf8\' is not a canonical texinfo
encoding
+',
+ 'file_name' => '',
+ 'line_nr' => 1,
+ 'macro' => '',
+ 'text' => 'encoding `utf8\' is not a canonical texinfo encoding',
+ 'type' => 'warning'
+ }
+];
+
+
+$result_floats{'documentencoding_utf8'} = {};
+
+
+
+$result_converted{'plaintext'}->{'documentencoding_utf8'} = '';
+
+
+$result_converted{'html_text'}->{'documentencoding_utf8'} = '
+';
+
+
+$result_converted{'latex'}->{'documentencoding_utf8'} = '\\documentclass{book}
+\\usepackage{amsfonts}
+\\usepackage{amsmath}
+\\usepackage[gen]{eurosym}
+\\usepackage{textcomp}
+\\usepackage{graphicx}
+\\usepackage{etoolbox}
+\\usepackage{titleps}
+\\usepackage[utf8]{inputenc}
+\\usepackage[T1]{fontenc}
+\\usepackage{float}
+% use hidelinks to remove boxes around links to be similar to Texinfo TeX
+\\usepackage[hidelinks]{hyperref}
+
+\\makeatletter
+\\newcommand{\\Texinfosettitle}{No Title}%
+
+% redefine the \\mainmatter command such that it does not clear page
+% as if in double page
+\\renewcommand\\mainmatter{\\clearpage\\@mainmattertrue\\pagenumbering{arabic}}
+\\newenvironment{Texinfopreformatted}{%
+
\\par\\GNUTobeylines\\obeyspaces\\frenchspacing\\parskip=\\z@\\parindent=\\z@}{}
+{\\catcode`\\^^M=13 \\gdef\\GNUTobeylines{\\catcode`\\^^M=13
\\def^^M{\\null\\par}}}
+\\newenvironment{Texinfoindented}{\\begin{list}{}{}\\item\\relax}{\\end{list}}
+
+% used for substitutions in commands
+\\newcommand{\\Texinfoplaceholder}[1]{}
+
+\\newpagestyle{single}{\\sethead[\\chaptername{} \\thechapter{}
\\chaptertitle{}][][\\thepage]
+ {\\chaptername{} \\thechapter{}
\\chaptertitle{}}{}{\\thepage}}
+
+% allow line breaking at underscore
+\\let\\Texinfounderscore\\_
+\\renewcommand{\\_}{\\Texinfounderscore\\discretionary{}{}{}}
+\\renewcommand{\\includegraphics}[1]{\\fbox{FIG \\detokenize{#1}}}
+
+\\makeatother
+% set default for @setchapternewpage
+\\makeatletter
+\\patchcmd{\\chapter}{\\if@openright\\cleardoublepage\\else\\clearpage\\fi}{\\Texinfoplaceholder{setchapternewpage
placeholder}\\clearpage}{}{}
+\\makeatother
+\\pagestyle{single}%
+
+
+\\end{document}
+';
+
+
+$result_converted{'info'}->{'documentencoding_utf8'} = 'This is , produced
from .
+
+
+
+Tag Table:
+
+End Tag Table
+
+
+Local Variables:
+coding: utf-8
+End:
+';
+
+$result_converted_errors{'info'}->{'documentencoding_utf8'} = [
+ {
+ 'error_line' => 'warning: document without nodes
+',
+ 'text' => 'document without nodes',
+ 'type' => 'warning'
+ }
+];
+
+
+
+$result_converted{'xml'}->{'documentencoding_utf8'} = '<documentencoding
encoding="utf8" spaces=" ">utf8</documentencoding>
+
+';
+
+1;
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- branch master updated: Simpler more consistent UTF-8 and unicode handling, stricter UTF-8 conversion,
Patrice Dumas <=