At my job (and probably yours) we're always copying and pasting stuff. Be it little snippets, big chunks of code from old projects or from the web (of course you never do that), or just plain client content, we're doing it. This post focuses mainly on copying content because that's when you're copying from different sources, programs and interfaces, but it applies to everything.
On one bug-fixing morning I got this screenshot, saying there was a bug with the word "código" on Firefox:
As obvious and visible as a bug can be, I didn't see it. I opened the faulty page on my machine with Chromium: it looked OK. Weird. I opened Chrome, Safari and Firefox and it only ocurred on Firefox. I looked at the code on Sublime Text and it look fine:
So I got stuck looking at it, inspecting it with all the browser's tools I could, but couldn't find any lead. I wrote an
ó beside the word
código and it looked fine. It was definitely a problem with that specific
ó. I could delete the word and move on, but no, I need to get to the bottom of this.
So I copied the word from Sublime Text and searched the web for "translate unicode" and "copy characters reveal unicode" (you can see I was very lost on this) and I was brought to a couple of pages that helped.
One is r12a's Unicode code converter which converted the copied
o&#x0301;. This is two characters, not one as intended. The other page is Grant McLean's Unicode Character Finder which show this when I pasted the culprit character:
It forgo the first "o" because when pasting in this box, it only shows the last character. Definitely two characters. How can this be? I don't know.
This is client text, I copied this from somewhere. I can't recall correctly but I think this particular text was forwarded from a client email. So it is weird character handling from either my end or the client's end or the man in the middle. And only Firefox show it incorrectly.
Note: it seems that some fonts handle these characters in different ways. When writing this post I noticed that the font face I'm using on Sublime handles these two characters as the one it should be, but if I change it to, for example, Inconsolata, it shows up different:
This is because Inconsolata doesn't have this character in its table so it switches to the default one.
Of course I will always copy and paste client text, I will not rewrite everything, it would consume a lot of time.
Is there anyway to sanitize text to avoid this kind of problems?