Discussion on: Why No Modern Programming Language Should Have a 'Character' Data Type

awwsmm profile image
Andrew (he/him) Author

UTF-8 tries to straddle performance and usability. By using a variable-width encoding, you're minimising memory usage. It just means that a few of the bits in the leading byte "go unused" because they indicate the number of bytes in the multi-byte character. This is a good compromise, but it still doesn't mean that a "character" should be defined as a variable-width UTF-8 element.

patarapolw profile image
Pacharapol Withayasakpunt

Hmm? What it have to do with combined characters and Zalgo?

Thread Thread
awwsmm profile image
Andrew (he/him) Author

Zalgo is just an extreme form of combining marks / joiner characters. Most people would consider


...to be a single character that just happens to have a lot of accent marks on it. If you define "character" as "grapheme", this is true. If you define "character" as "Unicode code point", it is not. That single "character" contains 34 UTF-16 elements. Try running this on CodePen and have a look at the console:

let ii = 0
const zalgo = "Ȧ̛ͭ̔̔͑̅̈́̉͂̅̇͟͏̡͍̖̝͓̲̲͎̲̬̰̜̫̳̱̣͉͉̦"

while (true) {
  let code = zalgo.charCodeAt(ii)
  if (Number.isNaN(code)) break
  console.log(`${ii}: ${code}`)
  ii += 1

The problem arises because programmers' intuitive understanding of "character" tends to be closer to "code point", while the average person's understanding of "character" tends to be closer to "grapheme".