DEV Community

knut
knut

Posted on

Java and Console Character Encodings

So I got nerd sniped by my buddy Snoopy the other daaaaay...

He's studying CS in Europe and is writing a program for an assignment where he has to input some characters from the command line on Windows and process them. The relevant part of the program is pretty simple. It's like this:

Scanner sc = new Scanner(System.in);
String input = sc.next();
for (int i = 0; i < input.length(); i++) {
    System.out.print(String.format("%02x", (int)input.charAt(i)));
    System.out.println();
}
Enter fullscreen mode Exit fullscreen mode

So he runs it and enters a non-ANSI character: š (that's U+0161). The output he gave me is this:

>java PrintBytes
š
00
Enter fullscreen mode Exit fullscreen mode

Now that's weird. I am pretty sure this is not a null character. I expected to see either a Unicode or UTF-8 representation of this. This was about the time I felt the uncontrollable urge to get involved.

Default Codepage Issues

I downloaded the JDK and tried it on my machine.

>java PrintBytes
š
73
Enter fullscreen mode Exit fullscreen mode

Well, that's weird. Oh, my system codepage is set to Windows, whereas his was set to UTF-8. I used chcp to change it to 65001, which is UTF-8, and got the same odd zero result.

Redirected input from a file

Next test: what if I read the same input from a file instead?

>java PrintBytes < input.txt
c5
a1
Enter fullscreen mode Exit fullscreen mode

Hey, that's correct. That's the UTF-8 representation of it. So something is weird with how Java is reading from an interactive command line compared to file input, even when both come through stdin.

How does Rust do it?

Next test, let's see how it does in Rust.

use std::io::Read;
fn main() {
    for b in std::io::stdin().bytes() {
        let val = b.unwrap();
        match val {
            0xd => println!(""),
            0xa => (),
            _ => println!("{:#02x}", val),
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

The output is good:

>target\debug\printbytes.exe
š
0xc5
0xa1
Enter fullscreen mode Exit fullscreen mode

So Rust is doing it right interactively. The Rust code actually checks if stdin is currently a console handle and calls ReadConsoleW, otherwise calling ReadFile, which handles regular file I/O just fine.

Snoopy also tried writing the equivalent program in Python, and it also did it right. So Java seems to be doing something wrong under certain conditions... but what's the reason?

Finding the answer

A good starting point might be to check the Rust source. My first guess was that somewhere I'd see a call to ReadFile on the stdin handle, but instead I see the lowest level Windows call it makes is to a function I'm not familiar with, ReadConsoleW.

Reading the docs, it references something about ANSI compatibility:

ReadConsole reads keyboard input from a console's input buffer. It behaves like the ReadFile function, except that it can read in either Unicode (wide-character) or ANSI mode.

I found another link that gives a good comparison between ReadFile and ReadConsole. It confirms that ReadConsoleA (the ANSI version) only reads ANSI characters, but ReadConsoleW can read Unicode characters. Rust is reading Unicode characters (hopefully UTF-16 but I'm not really sure), then translating them internally into UTF-8, since its string type is natively UTF-8.

Confirming with C++

Easiest way to confirm was write a little C++ program, going straight to the source. In different modes it can try ReadFile or ReadConsoleW

uint16_t c;
if (argc == 1) {
    ReadFile(GetStdHandle(STD_INPUT_HANDLE), reinterpret_cast<uint8_t*>(&c), 1, nullptr, nullptr);
} else {
    DWORD numRead;
    ReadConsoleW(GetStdHandle(STD_INPUT_HANDLE), &c, 1, &numRead, nullptr);
}

printf("%04x\n", c);
Enter fullscreen mode Exit fullscreen mode

First here's ReadFile mode:

>printbytes_c.exe
š
0000
Enter fullscreen mode Exit fullscreen mode

And then ReadConsoleW mode:

>printbytes_c.exe -c
š
0161
Enter fullscreen mode Exit fullscreen mode

U+0161 is the UTF-16 encoding of the character, so that seems to be showing some Unicode support. Interesting to note that ReadConsoleA showed the same behavior as ReadFile.

Conclusion

The behavior is a little unfortunate in Windows, but it seems to be fairly well documented. Most languages seem to be doing a proper job of handling this, but Java isn't. We can even see it in the debugger. I don't have proper symbols, but at least the top of the stack seems to resolve pretty clearly.

0:004> k
 # Child-SP          RetAddr               Call Site
00 00000016`03ffce28 00007fff`7157c7f4     KERNEL32!ReadFile
01 00000016`03ffce30 00007fff`7157bd76     java!handleRead+0x20
02 00000016`03ffce70 00007fff`71572641     java!JNI_OnLoad+0x196
03 00000016`03ffef00 00000171`9146a02e     java!Java_java_io_FileInputStream_readBytes+0x1d
Enter fullscreen mode Exit fullscreen mode

So Java... do better. Have a way to properly handle Unicode interactive console input. Maybe it does...? A Java expert would probably know, but I can't find it on the Internet with any obvious searches. But also this problem is Windows-specific, so Windows... why you gotta be this way? In conclusion, computers are bad.

Top comments (2)

Collapse
 
onecelledwolf profile image
wolf 🐺

i was having that same problem, i'm glad you figured it out!

Collapse
 
snoopyt7 profile image
snoopyt7

hey that was an interesting read, thanks for posting