If you ever needed to scrap a legacy(old) site in Hebrew you probably got a lot of these: �������
The replacement character � (often displayed as a black rhombus with a white question mark) is a symbol found in the Unicode standard at code point U+FFFD in the Specials table. It is used to indicate problems when a system cannot render a stream of data to a correct symbol.
Here is a snippet of what you get if you do not decode:
<html dir=ltr>\n<head>\n<title>�����</title>.......</html>\n
And here is if you do:
<html dir=ltr>\n<head>\n<title>חדשות</title>.......</html>\n
To get the Hebrew letters we'll need to use the TextDecoder
class, which is native to Node.js.
fetch('www.example.com')
.then(res => res.arrayBuffer())
.then(buffer => {
const decoder = new TextDecoder('windows-1255');
return decoder.decode(buffer);
});
Here we are using the windows-1255
encoding option because we are decoding Hebrew characters.
We could choose windows-1251
which is appropriate for the Cyrillic script.
And of course, we like DRY code!
So I recommend exporting this into a function in a utils folder for reuse and a more readable experience (reading the function's name will tell us what it does).
export const decodeLegacyWebsite = async promise =>
promise
.then(res => res.arrayBuffer())
.then(buffer => {
const decoder = new TextDecoder('windows-1255');
return decoder.decode(buffer);
});
The decodeLegacyWebsite
function will receive a promise and return a string representing the Html response from the site.
Top comments (0)