New implementation of pseudo console support (experimental)

Thomas Wolff towo@towo.net
Mon Aug 31 21:07:03 GMT 2020


Am 31.08.2020 um 21:17 schrieb Johannes Schindelin:
> Hi Thomas,
>
> On Mon, 31 Aug 2020, Thomas Wolff wrote:
>
>> Am 31.08.2020 um 18:12 schrieb Thomas Wolff:
>>> Am 31.08.2020 um 17:56 schrieb Johannes Schindelin:
>>>
>>>> On Mon, 31 Aug 2020, Takashi Yano wrote:
>>>>
>>>>> On Mon, 31 Aug 2020 16:22:20 +0200 (CEST)
>>>>> Johannes Schindelin wrote:
>>>>>> On Mon, 31 Aug 2020, Takashi Yano wrote:
>>>>>>
>>>>>>> On Mon, 31 Aug 2020 14:49:04 +0200 (CEST)
>>>>>>> Johannes Schindelin wrote:
>>>>>>>
>>>>>>>> Sorry to latch onto this thread with something slightly
>>>>>>>> different, but we do see pretty serious encoding problems
>>>>>>>> (both with and without `CYGWIN=disable_pcon`) in the Git for
>>>>>>>> Windows and the MSYS2 projects. For example, in
>>>>>>>> https://github.com/msys2/MSYS2-packages/issues/1974 the
>>>>>>>> following issue was reported. If you compile a _MINGW_
>>>>>>>> program from this source code:
>>>>>>>>
>>>>>>>> -- snip --
>>>>>>>> #include <stdio.h>
>>>>>>>>
>>>>>>>> int main(){
>>>>>>>>     puts("Привет мир! Hello world!");
>>>>>>>>     return 0;
>>>>>>>> }
>>>>>>>> -- snap --
>>>>>>>>
>>>>>>>> and then execute it, you will see this output:
>>>>>>>>
>>>>>>>> -- snip --
>>>>>>>> Привет мир! Hello world!
>>>>>>>> -- snap --
>>>>>>> I guess your program (binary exe) does not work as you expect
>>>>>>> in command prompt as well. If you want to use UTF-8 coding in
>>>>>>> output, you should add SetConsoleOutputCP(CP_UTF8) call befere
>>>>>>> puts().
>>>>>> That may be, but I would like to point out that the very same
>>>>>> executable worked quite well in a MinTTY using v3.0.7...
>>> Assuming the test program source file is encoded in UTF-8 when
>>> compiling with x86_64-w64-mingw32-gcc, the string would be output byte
>>> by byte, which happend to be interpreted in UTF-8 when run in a
>>> terminal on cygwin 3.0.7, although the program was not set up to use
>>> UTF-8. The "correct" output was actually buggy behaviour, so current
>>> cygwin has "fixed" it, to your disadvantage in this case. With ConPTY
>>> support, matching encoding on Windows and terminal side need to be
>>> taken care of.
>> My wording was misleading. Maybe it's proper to say it this way:
>> Matching encoding on each side between application and respective system
>> is needed, as ConPTY transforms encoding properly on system level.
> Well, I just wonder how your wording (misleading or not) relates to the
> issue at hand: there are programs out there that simply do not take care
> of calling `SetConsoleOutputCP()`.
Those would use the pre-set system codepage. Unless POSIX functions, 
which need an initial dummy call to setlocale to work, in the Windows 
API, always a codepage is set, typically 850 in European Windows 
installations.
>
> What you are telling me is that those programs are wrong, which I can kind
> of get behind.
No, but ConsoleOutput functions would involve the current codepage, 
which is usually *not* 65001 (the UTF-8 codepage). So if those programs 
output UTF-8 strings, they would actually be byte strings in the 
respective codepage (e.g. 850) by definition of the Windows API. 
Ignoring that in previous cygwin versions and just sending the bytes to 
a UTF-8 terminal would have given you the expected result, but it's 
unfortunately not really correct.
> However, what I do not understand is what you argue should happen with the
> output of such programs (if you address that concern at all, which I am
> not really sure of).
I'm afraid I think the proper way is to show the respective CP850 (or 
whichever) interpretation that you saw;
I'm puzzled though that the output is changed by piping through cat.
Note that you can set previous/expected behaviour consistently with chcp 
as follows:
 > chcp.com
Aktive Codepage: 850.
 > ./conming
ðƒÐÇð©ð▓ðÁÐé ð╝ð©ÐÇ! Hello world!
 > ./conming | cat
?????? ???! Hello world!
 > chcp.com 65001
Aktive Codepage: 65001.
 > ./conming
Привет мир! Hello world!
 > ./conming | cat
Привет мир! Hello world!
 >

>
> Previously, we assumed the output to be in UTF-8 (although I frankly have
> no idea how that worked).
Just by chance, as I described above.
> Starting with v3.1.0 (or at least v3.1.4, I have
> not _really_ verified with earlier versions), the output is assumed to use
> code page 437.
Or whatever the system / you have set.
> With seemingly everybody and their sister switching to UTF-8, I wonder
> whether that even makes sense.
When using the Windows API, the modern way would be to use UTF-16, i.e. 
all functions ending with "W", like WriteConsoleW. If you prefer 8-bit 
functions and want to support Unicode, set the codepage to 65001.
You may still use 8-bit codepages if desired, like CP1252 for Windows 
European ANSI.
A "DOS mode" program using 8 bit output and not setting a codepage is 
really doing something undefined and cannot expect specific output 
beyond ASCII.
>
> So I had a look at the code, and it seems that
> `fhandler_pty_slave::setup_locale()` forces the output encoding to
> C.ASCII if Pseudo Console support is enabled:
>
>    char locale[ENCODING_LEN + 1] = "C";
>    char charset[ENCODING_LEN + 1] = "ASCII";
>    LCID lcid = get_langinfo (locale, charset);
>
>    /* Set console code page from locale */
>    if (get_pseudo_console ())
>      {
>        UINT code_page;
>        if (lcid == 0 || lcid == (LCID) -1)
>          code_page = 20127; /* ASCII */
>        else if (!GetLocaleInfo (lcid,
>                                 LOCALE_IDEFAULTCODEPAGE | LOCALE_RETURN_NUMBER,
>                                 (char *) &code_page, sizeof (code_page)))
>          code_page = 20127; /* ASCII */
>        SetConsoleCP (code_page);
>        SetConsoleOutputCP (code_page);
>      }
>
> Please note that this essentially forces the console output code page to
> ASCII (in my case, the fall-back to 20127 seems not to kick in, but 437 is
> used instead, as LCID x0409 is used).
As seen, the output "ðƒÐÇð©ð▓ðÁÐé ð╝ð©ÐÇ" is not confined to ASCII. I 
doubt the branch with 20127 is taken in the test case, as lcid is likely 
to be something other than 0 or -1.
> However, there is no overriding call to `SetConsoleOutputCP()` later in
> that method, not even when the `charset` is correctly identified as
> `UTF-8` (because my `LANG=en_US.UTF-8`).
I don't know how the ConPTY support code works, but I'd say 
SetConsoleOutputCP is rather to be called on the client side of the pty, 
in the Windows program, if it wants. It might have been an alternative 
way to support Windows codepages from cygwin, before the age of ConPTY, 
as I had once considered.
> Now, what I _really_ do not understand is why Cygwin insists on using the
> console output code page when running in `CYGWIN=disable_pcon` mode...
Because it is proper to interpret output in the way it would be intended 
by the original program if that was correct.
Writing Windows programs so that they could nicely output to UTF-8 
terminals was a neat trick but unfortunately not correct.
>
> Otherwise, this patch would be enough to fix it for me:
>
> -- snip --
> diff --git a/winsup/cygwin/fhandler_tty.cc b/winsup/cygwin/fhandler_tty.cc
> index 43eebc174..2ce8dae9a 100644
> --- a/winsup/cygwin/fhandler_tty.cc
> +++ b/winsup/cygwin/fhandler_tty.cc
> @@ -2867,11 +2867,13 @@ fhandler_pty_slave::setup_locale (void)
>     char charset[ENCODING_LEN + 1] = "ASCII";
>     LCID lcid = get_langinfo (locale, charset);
>
> -  /* Set console code page form locale */
> +  /* Set console code page from locale */
>     if (get_pseudo_console ())
>       {
>         UINT code_page;
> -      if (lcid == 0 || lcid == (LCID) -1)
> +      if (!strcasecmp (charset, "utf-8"))
> +	code_page = CP_UTF8;
> +      else if (lcid == 0 || lcid == (LCID) -1)
>   	code_page = 20127; /* ASCII */
>         else if (!GetLocaleInfo (lcid,
>   			       LOCALE_IDEFAULTCODEPAGE | LOCALE_RETURN_NUMBER,
> -- snap --
>
> But that does _not_ reinstate the previous behavior when Pseudo Console
> support is disabled.
>
> Now, I would call that a regression (the entire idea of `disable_pcon` was
> to fall back to the previous behavior, no?). And I do not really
> understand where it comes from, that regression.
I wouldn't quite call it a regression as it disables buggy behaviour 
which was used as a workaround for a buggy system environment. But 
arguably you could expect such an option to fall back to previous buggy 
behaviour.

Thomas

>   Where does the code path
> differ from the previous one when Pseudo Console support is disabled, and
> how does that relate to the current console output code page?
>
> Ciao,
> Johannes
>
>>> Thomas
>>>
>>>>> at the expense of garbled output for apps which use native
>>>>> code page of the system in the correct maner.
>>>> Are you referring to apps that call the SetConsoleOutputCP() function? If
>>>> so, I am asking myself what would be broken. Because apps that do _not_
>>>> call that function (expecting UTF-8 to be active) would be fixed, while
>>>> apps that _do_ call that function would not care if the Cygwin runtime
>>>> changed it.
>>>>
>>>> Ciao,
>>>> Johannes
>>



More information about the Cygwin-developers mailing list