Decode a Legacy Website

#webdev #javascript #decode #legacy

If you ever needed to scrap a legacy(old) site in Hebrew you probably got a lot of these: ��

The replacement character � (often displayed as a black rhombus with a white question mark) is a symbol found in the Unicode standard at code point U+FFFD in the Specials table. It is used to indicate problems when a system cannot render a stream of data to a correct symbol.

Here is a snippet of what you get if you do not decode:
<html dir=ltr>\n<head>\n<title>��</title>.......</html>\n

And here is if you do:
<html dir=ltr>\n<head>\n<title>חדשות</title>.......</html>\n

To get the Hebrew letters we'll need to use the TextDecoder class, which is native to Node.js.

fetch('www.example.com')
    .then(res => res.arrayBuffer())
    .then(buffer => {
      const decoder = new TextDecoder('windows-1255');
      return decoder.decode(buffer);
    });

Here we are using the windows-1255 encoding option because we are decoding Hebrew characters.
We could choose windows-1251 which is appropriate for the Cyrillic script.

And of course, we like DRY code!
So I recommend exporting this into a function in a utils folder for reuse and a more readable experience (reading the function's name will tell us what it does).

export const decodeLegacyWebsite = async promise =>
  promise
    .then(res => res.arrayBuffer())
    .then(buffer => {
      const decoder = new TextDecoder('windows-1255');
      return decoder.decode(buffer);
    });

The decodeLegacyWebsite function will receive a promise and return a string representing the Html response from the site.

DEV Community

Decode a Legacy Website

Top comments (0)

Read next

Downsize your JavaScript: Mastering Bundler Optimizations

Build a Simple Chatbot with Svelte and ElizaBot

Top 5 Popular Frameworks and Libraries for Go in 2024

My Journey with Daytona and How I Plan to Use It Going Forward