DEV Community

Yasuhiro Matsumoto
Yasuhiro Matsumoto

Posted on

Please stop hack "chcp 65001"

On GitHub, there are many codes using chcp 65001. This is hack to display UTF-8 on Windows command prompt. Probably, this hack may works well on non multi-byte locales. But this doesn't work correctly on multi-byte locales. For example, jq used this hack. https://github.com/stedolan/jq/pull/824

This hack break Windows command prompt on multi-byte users.

You can see the font is changed and not restored.

On East Asian locale, figure of backslash is not same on non East Asian locale. On non East Asian locale, backslash is displayed like below.

But on East Asian locale, backslash are displayed like following.

FYI, Most of Windows users normally uses non UTF-8 codepage. For example, japanese use codepage 932.

I wrote a patch to fixed this. https://github.com/stedolan/jq/issues/1121

Furthermore, curl used this hack. https://github.com/curl/curl/issues/3008

This also break cmd.exe.

I wrote a patch similar above again. https://github.com/curl/curl/pull/3212

As I wrote on the description of this PR, when changing codepage on Windows console modify font to suitable possibly. And restoring codepage also set suitable font. So original font is not restored correctly. To write UTF-8 without changing console codepage, you should use Wide String APIs.

Please stop to use hack chcp 65001.

P.S.

If you want to run batch file written in UTF-8, you should run with below to avoid breaking command prompt:

start /wait /min foo.bat

Top comments (4)

Collapse
 
richturn_ms profile image
Rich Turner

Hi. Thanks for reaching out on Twitter.

I think what you're seeing here is similar to my slightly naïve approach to supporting UTF-8 in a PR I submitted to the Curl tool to enable VT processing on Windows: github.com/curl/curl/pull/3011/fil...

Specifically, in line #265 of src/tool_main.c, you'll see I force the Console to the UTF-8 codepage.

This was far from the best way to do this can be seen in the commit mattn submitted: github.com/curl/curl/commit/5bfaa8...

Mattn's PR above fixes the behavior clearly articulated in issue 3211 here: github.com/curl/curl/issues/3211 which looks very similar the issue you're seeing.

HTH.

Collapse
 
ttanxu profile image
Xu Tan

Thanks for the explanation. Just a bit correction on the issue itself.

The reason why backslash is a yen-sign Japanese decided to override backslash code point in ASCII with the yen-sign. Similar thing happens in Korean, German, Danish, French and Spanish in their variants of ISO 646: en.wikipedia.org/wiki/Backslash#Co....

Therefore it has nothing to do with East-Asian locales. Simplifed and traditional Chinese region all display backslash as the original backslash.

However this is indeed a revelation of how messed-up Windows localization is due to backward compatibility.

Collapse
 
mintty profile image
mintty

The purpose of the chcp utility is that you can use it if you need it, provided you know what you're doing. It is not a hack.

Collapse
 
ferdnyc profile image
Frank Dana

It's there so the enduser can use it if they need it. I would argue that, if software is swapping the code page out from underneath the user's feet — especially if it isn't properly restored after (and from what Yasuhiro says, it basically CAN'T be properly restored in at least some cases), then using it that way is most definitely a hack.