DEV Community

Yogi Saputro
Yogi Saputro

Posted on

Data Wrangling with VSCode (Yes, Really!)

In this article, I'm going to show techniques and tips for transforming raw text data to standard format using VSCode. It's effective, quick, and dirty. Suitable for one-time wrangling.
Data Wrangling with Visual Studio Code

People who works around data sometimes get strange one-time requests.

"Hey, can you get this data real quick? Shouldn't be too hard, right? Appreciate it".

It happended to me (as full-stack dev) few times, say: list of banks, list of loan rates, and so on. It's good thing that the data is available. It's bad thing that they are in PDF or HTML table. My system only works with CSV or JSON. So it is wrangling time!

There are many tools available for wrangling data. Nearly all of them are sophisticated and packed with features. Too many features means too many resource needed. I don't want to open Jupyter Notebooks or Tableau just to add commas or parse words. I need my resource for that 10 tabs of Stack Overflow, you know.

Finally, I resort to VSCode. It is software used by me and millions of people, has powerful editing features, and has various extensions. I can tailor it to wrangle data quickly.

Now, I will show how to wrangle data from PDF or HTML and transform it to CSV or JSON, using only VSCode.

Skills

Wrangling data in VSCode comes with a caveat: you need to understand regular expressions a.k.a regex. Don't be intimidated. Treat regex as a friend. Play around with regex tester here, or check out the documentation and cheat-sheet.

Extensions

Before going further, I recommend installing at least three extensions on VSCode.

  1. Rainbow CSV : it highlights columns by colors, detects inconsistency in data, and provides RBQL (basically SQL for CSV file).
  2. JSON to CSV : it converts JSON format to CSV and vice-versa.
  3. Prettify JSON : it makes JSON easier to read.

Data Source

PDF

I'm going to use list of registered lending fintech companies in Indonesia, downloadable here.

HTML

I choose Wikipedia list of national heroes of Indonesia as example. Available both in English and Bahasa Indonesia.

From HTML to CSV

  1. First, note that the first output format is always CSV because of its similarity with natural structure of table (e.g columns, rows, header). Copying table from HTML document
  2. Open the HTML data source. Then, copy text from HTML. On this case, I will copy the whole table. It is also useful to note which column is needed.
  3. On VSCode, create new untitled file ctrl + N Paste text into VSCode
  4. Paste the text. Now we have nice list. Next, we want to remove unnecessary lines. Things like title, page number might get copied accidentally. It can be removed.
  5. Now we need to remove unusable columns. For this case, column Ref(s) is not unusable and meaningless. So we need to get rid of it. If you need to filter columns, there is a way to do it later.
  6. Access replace text menu ctrl + H, then press alt + R for regex search.
  7. Now, what do these refs have in common? They have square brackets. Then, inside square bracket is number or letter. There are some possible regex. Try to approach it. Example of inaccurate regex The picture above with \[[0-9]\] regex works for single digit, but failed for two or more digits and letters. I also prefer to remove trailing whitespaces. Example of accurate regex Going further, I acknowledged that \[[0-9a-z]*\] works. \t?\[..?.?\] also works (it removes whitespace as well). Then, replace with '' a.k.a nothing.
  8. Replace all occurences : ctrl + alt + enter. Now the refs are no more.
  9. Replace characters that confuses CSV, like comma (,) or double-tick ("). It is important to avoid mistakes when parsing CSV. No need for regex now.
    • Replace text ctrl + H
    • Use text : , , replace with ; or leave it empty
    • Replace all occurences : ctrl + alt + enter
    • Repeat on other characters
  10. Setup header manually to match standard CSV format. Setup header This is my recommendation:
    • avoid spaces
    • use snake_case
    • add commas manually
  11. Add proper separator, in this case: comma.
    • For this data, I am a bit lucky. Each data is consistently separated by tab or whitespace. Just need to convert whitespace to comma. This is also why removing unnecessary whitespace on step 7 is important. Extra commas will be confusing.
    • In some cases, data is separated by spaces. It can be confusing, especially when the column is string with spaces. There are strategies to overcome it below.
    • Replace text ctrl + H, then press alt + R for regex search option
    • Use text : \t , replace with ,
    • Replace all occurences : ctrl + alt + enter Highlighted CSV
  12. et voila! Data should be ready by now. Save it as csv file. Rainbow CSV should automatically highlight it afterwards.

From PDF to CSV

The basic is similar to HTML data source

  • Copy all data from PDF
  • Paste into VSCode workspace
  • Remove unnecessary lines
  • Remove unusable columns
  • Regex search and replace
  • Remove special characters (commas, double-ticks, etc.)
  • Setup header manually
  • Add proper separators
  • Save as CSV

Let's dive into action.

  1. PDF data often returns a mess. Unnecessary lines, inconsistent line breaks, space separators, are all present here. Paste PDF mess It is better to observe this pile of mess for a while. Also check with original PDF file to figure out "how did this line ended up here?"
  2. Check header and footer of each PDF page. Something like page number, repeated header, or repeated title might slip in. Clean them all.
  3. Nomor or row number column has no value, so it can be removed. Regex including newline Now there's a little trick. Normal number regex like [0-9. ]+ is not accurate since there are matches at other places. But I know number column is always at the beginning of row. So I add newline \n to my number regex, then replace it with \n.
  4. A lot of regex find-and-replace are required.
    • Replace Android\ndan iOS with Android dan iOS
    • Replace -\n (it means web based fintech, not mobile-based) with Web\n
    • Replace Konvensional\ndan Syariah with Konvensional dan Syariah
  5. Fortunately, there's no character that might confuse CSV. Move on to next step.
  6. Handling header might require domain knowledge. So I basically change the header into this: platform_name, website, company_name, official_register, date_of_registration, lending_type, platform_type.
  7. Now here comes the fun part of adding separators. Remember that we cannot change data, and regex find-and-replace always replace pattern to constant value. Thus, the key is identifying constant values.
    • Many URLs begin with either http, https, or www. So, I can replace http with ,http and www (spaced) with ,www. Just add comma before replaced value. Now that's progress. Multiple cursors
    • Some URLs have no prefix, but have .id suffix. Find it with [a-z]+\.id regex, then press alt + enter to have multiple cursors working on them. Move to leftmost string with ctrl + ←, then add commas manually.
    • All companies in the list begin with PT. So, replace PT (spaced) with ,PT
    • All official registers begin with KEP or S. There are some inconsistencies about spacing and dash/minus character. So, replace KEP ?-?–? with ,KEP-. Also replace S ?[-?–?]+ with ,S-.
    • Lending type is either 'Konvensional', 'Syariah', or 'Konvensional dan Syariah'. Replace Konvensional with ,Konvensional,. Replace Syariah with ,Syariah,. Then replace Konvensional,dan,Syariah with Konvensional dan Syariah
    • Official register ends with year. So I can replace /2021 to /2021, and /2020 to /2020, and so on. CSV Lint menu
    • Use CSV Lint from ctrl + shift + P or menu at bottom of VSCode window to check any incorrect row.
  8. Save as csv file, and we're done!

Extra Miles

Select Specific CSV Columns

Rainbow CSV has RBQL feature, which works like SQL. It can be used to select specific columns if needed.
Query menu

  1. Click on query menu, a new tab will open. Run query to select column
  2. Select columns like in SQL query, then click Run Query result
  3. The result appears in new tab. Save as new CSV using ctrl + shift + s

Save as JSON

Here's the tips to get JSON from CSV file, completely within VSCode.
Convert CSV to JSON

  1. Use JSON to CSV extension. Open the CSV file, then press F1 > Convert CSV to JSON
  2. Use Prettify JSON extension. Press F1 > Prettify JSON. Now it looks good! Pretty JSON
  3. On Windows, there are trailing whitespace \r. Simply find \r and replace with empty string ``
  4. Save as JSON file. Do not save directly using ctrl + S, it will overwrite the CSV file.

Handle Separator Placing

To summarize, here are tips and trick to handle difficult separator placing:

  • Utilize constant value
  • Utilize newline
  • Utilize multiple cursors like alt + enter when searching
  • Utilize manual multiple cursors with ctrl + alt + ↑ or ctrl + alt + ↓. Note that only maximum of 10000 parallel cursors are allowed on VSCode.
  • Utilize CSV Lint

Summary

I've been using this method quite often on various data source. Once, I wrangle 80-page PDF file and transform it to 40000-row CSV in about 20 minutes. Just keep in mind these three things:

  1. Follow these steps in order:
    • Copy all data then paste it into VSCode workspace
    • Remove unnecessary lines
    • Remove unusable columns
    • Regex search and replace
    • Remove special characters (commas, double-ticks, etc.)
    • Setup header manually
    • Add proper separators
    • Save as CSV
    • Add or remove column (optional)
    • Convert to JSON (optional)
  2. Be nimble on regex and its behaviors
  3. Don't be afraid to undo your steps when things get messy. Maybe your regex wasn't accurate enough. Go back and fix your regex. It will save you time rather than handling data manually. Manual wrangling should be exception, not mandatory step.

I am Yogi Saputro, a full stack developer at Pinjammodal.id. I’m also passionate in building system, pipeline, and provide value to businesses. Check out my Github and portfolio as well.

Discussion (0)