lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] UTF-16 output from automated GUI test?


From: Greg Chicares
Subject: Re: [lmi] UTF-16 output from automated GUI test?
Date: Thu, 20 Oct 2016 19:06:00 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Icedove/45.3.0

On 2016-10-20 11:23, Vadim Zeitlin wrote:
> On Thu, 20 Oct 2016 01:14:29 +0000 Greg Chicares <address@hidden> wrote:
[...]
> GC> commit b47c9d49177f6a57863184929d63e69fba735bb7
> GC> Author: Gregory W. Chicares <address@hidden>
> GC> Date:   Mon Aug 22 21:30:56 2016 +0000
> GC> 
> GC>     Force standard output streams to binary mode
> GC>     
> GC>     See:
> GC>       http://lists.nongnu.org/archive/html/lmi/2016-08/msg00015.html
> GC> 
> GC> That change adds this code to a couple of main() functions:
> GC> 
> GC> +#if defined LMI_MSW
> GC> +    // Force standard output streams to binary mode.
> GC> +    setmode(fileno(stdout), O_BINARY);
> GC> +    setmode(fileno(stderr), O_BINARY);
> GC> +#endif // defined LMI_MSW
> GC> 
> GC> ...which contains no _O_WTEXT or _O_U16TEXT.

BTW, I tried this change:

-        setmode(fileno(stdout), O_BINARY);
-        setmode(fileno(stderr), O_BINARY);
+        setmode(fileno(stdout), O_BINARY | _O_U8TEXT);
+        setmode(fileno(stderr), O_BINARY | _O_U8TEXT);

and found it to be pessimal: much (though not all) of the output is UTF-16,
and line delimiters are uniformly CR-LF. Is seems surprising that I ask for
binary and UTF-8 but get neither, yet I guess that's doomed to fail anyway:

http://mingw.5.n7.nabble.com/O-U8TEXT-and-mingw-td20737.html
|
| this _O_U8TEXT thing
| apparently is only implemented in the msvcrt.dll version on Vista.
|
| Anyway, as far as I understand from the MSDN documentation, and as far
| as light testing indicates, _O_U8TEXT affects only reading and
| recognition of a BOM, it does not affect other I/O to the file.

>  Nevertheless, O_BINARY does result in UTF-16 output as I've just tested in
> my simple example.

Your example is basically:
    _setmode(_fileno(stdout), _O_WTEXT);
    return fputws(L"Hello, world!\n", stdout);
which deliberately uses "wide" strings....

> Thinking about it, it's not really that surprising: the
> string we pass to fputws() is a Unicode (wchar_t, which is UTF-16 under
> MSW) string and O_BINARY apparently disables all conversions, including
> those to the current code page (O_TEXT) or UTF-8 (O_U8TEXT), and not just
> the end-of-line marker replacements.
> 
>  So, as it stands, we can either have ASCII (or UTF-8) output with CR LF in
> it or UTF-16 output with only LF. And unfortunately I don't see any simple
> way to make it work as you'd like, i.e. output ASCII without CRs. Of
> course, it could be done by explicitly converting the strings in the code,
> but this is not very nice and error-prone. The only global solution I see
> would be to build in UTF-8 mode which would do the conversions in wx
> itself, but this is a big change with a lot of ramifications and I just
> don't see you agreeing to or even considering it in the near future.
> 
>  Hence the only resolution I can see is to revert the commit above and live
> with CRs in the output under MSW (which are anyhow "natural" there, as I
> tried to argue before), as it's almost certainly preferable to having to
> deal with UTF-16 instead of simple ASCII.

Reverting that change restores this breakage:

  https://lists.gnu.org/archive/html/lmi/2016-08/msg00015.html
| Because of that, 'test_coding_rules_test.sh' failed with 'wine'.

which that change had successfully fixed. This differs from your example
above in that 'test_coding_rules.exe' apparently emits ASCII only.

AFAICS, lmi never deliberately emits any wide string to stdout or stderr,
but I'm guessing that 'wx_test.exe' implicitly emits wide strings simply
because it uses wx. If that's right, then I guess we have these options:

1) Revert that change, and deal with the consequences some other way:
   i.e., decide to live with CRNL terminators. But I anathematized CR
   in the last millennium, and don't want to go back to a world where
   e.g. fseek() has undefined behavior and pasting from a console
   into email introduces spurious blank lines.

2) Don't revert. Whenever we would do:
     ./wx_test.exe --data_path=/opt/lmi/data >eraseme 2>&1
   instead do this:
     ./wx_test.exe --data_path=/opt/lmi/data 2>&1 |tr --delete '\r' >eraseme

3) Don't revert; build wx with 'enable-utf8'. Perhaps I'll try that,
   because the pitfalls described here:
     http://docs.wxwidgets.org/3.1/overview_unicode.html
   don't sound very important for lmi.

[BTW, that 'docs.wxwidgets.org' page has en-dash in three places where
two dashes "--" really are intended, with 'configure' options. That's
probably a doxygen artifact; AIUI, you can escape it as "\--" or "`--`"
to avoid that markdown.]

If we decide to choose between (1) and (2), then we're filtering '\r'
out of the output of either 'test_coding_rules' or 'wx_test'. That's
six of one vs. half a dozen of the other, and I'll prefer (1) so that
I don't have to de-anathematize CR. I think (3) sounds best, and lmi
generally prefers std::string to wxString, so I wouldn't expect much
trouble...except that you said:

> but this is a big change with a lot of ramifications and I just
> don't see you agreeing to or even considering it in the near future.

What am I missing? I don't think we ever pass enum values to wxPrintf().
We use FromUTF8() only in 'group_quote_pdf_gen_wx.cpp', where in one
place escape_for_html_elem() does switch on wxString elements, but that
should be easy to rewrite.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]