DEV Community

Discussion on: Why No Modern Programming Language Should Have a 'Character' Data Type

Collapse
 
awwsmm profile image
Andrew (he/him)

I think 4-byte len UTF-8 is possible (not essentially max to 3 bytes)

It is, UTF-8 can carry up to 4 bytes of information.

My point is that the terminology around what a "character" is has gotten so confusing that we should just stick to well-defined terms like "code point" and "grapheme". "Character" is sometimes confused with one or other of those (or something else entirely) and so I don't think it's a good name for a data type.

If you want to loop over "characters" in a string, you should loop over code points (which are composed of between 1-4 bytes). But why should someone ever want to loop over the individual bytes of a code point? This functionality could be provided, but not at the expense of clarity.