DEV Community

Cover image for Decode a Legacy Website
Daniel Bellmas
Daniel Bellmas

Posted on

Decode a Legacy Website

If you ever needed to scrap a legacy(old) site in Hebrew you probably got a lot of these: �������

The replacement character � (often displayed as a black rhombus with a white question mark) is a symbol found in the Unicode standard at code point U+FFFD in the Specials table. It is used to indicate problems when a system cannot render a stream of data to a correct symbol.

Here is a snippet of what you get if you do not decode:
<html dir=ltr>\n<head>\n<title>�����</title>.......</html>\n

And here is if you do:
<html dir=ltr>\n<head>\n<title>חדשות</title>.......</html>\n

To get the Hebrew letters we'll need to use the TextDecoder class, which is native to Node.js.

fetch('www.example.com')
    .then(res => res.arrayBuffer())
    .then(buffer => {
      const decoder = new TextDecoder('windows-1255');
      return decoder.decode(buffer);
    });
Enter fullscreen mode Exit fullscreen mode

Here we are using the windows-1255 encoding option because we are decoding Hebrew characters.
We could choose windows-1251 which is appropriate for the Cyrillic script.

And of course, we like DRY code!
So I recommend exporting this into a function in a utils folder for reuse and a more readable experience (reading the function's name will tell us what it does).

export const decodeLegacyWebsite = async promise =>
  promise
    .then(res => res.arrayBuffer())
    .then(buffer => {
      const decoder = new TextDecoder('windows-1255');
      return decoder.decode(buffer);
    });
Enter fullscreen mode Exit fullscreen mode

The decodeLegacyWebsite function will receive a promise and return a string representing the Html response from the site.

Discussion (0)