In this article, I'm going to show techniques and tips for transforming raw text data to standard format using VSCode. It's effective, quick, and dirty. Suitable for one-time wrangling.
People who works around data sometimes get strange one-time requests.
"Hey, can you get this data real quick? Shouldn't be too hard, right? Appreciate it".
It happended to me (as full-stack dev) few times, say: list of banks, list of loan rates, and so on. It's good thing that the data is available. It's bad thing that they are in PDF or HTML table. My system only works with CSV or JSON. So it is wrangling time!
There are many tools available for wrangling data. Nearly all of them are sophisticated and packed with features. Too many features means too many resource needed. I don't want to open Jupyter Notebooks or Tableau just to add commas or parse words. I need my resource for that 10 tabs of Stack Overflow, you know.
Finally, I resort to VSCode. It is software used by me and millions of people, has powerful editing features, and has various extensions. I can tailor it to wrangle data quickly.
Now, I will show how to wrangle data from PDF or HTML and transform it to CSV or JSON, using only VSCode.
Skills
Wrangling data in VSCode comes with a caveat: you need to understand regular expressions a.k.a regex. Don't be intimidated. Treat regex as a friend. Play around with regex tester here, or check out the documentation and cheat-sheet.
Extensions
Before going further, I recommend installing at least three extensions on VSCode.
- Rainbow CSV : it highlights columns by colors, detects inconsistency in data, and provides RBQL (basically SQL for CSV file).
- JSON to CSV : it converts JSON format to CSV and vice-versa.
- Prettify JSON : it makes JSON easier to read.
Data Source
I'm going to use list of registered lending fintech companies in Indonesia, downloadable here.
HTML
I choose Wikipedia list of national heroes of Indonesia as example. Available both in English and Bahasa Indonesia.
From HTML to CSV
- First, note that the first output format is always CSV because of its similarity with natural structure of table (e.g columns, rows, header).
- Open the HTML data source. Then, copy text from HTML. On this case, I will copy the whole table. It is also useful to note which column is needed.
- On VSCode, create new untitled file
ctrl + N
- Paste the text. Now we have nice list. Next, we want to remove unnecessary lines. Things like title, page number might get copied accidentally. It can be removed.
- Now we need to remove unusable columns. For this case, column
Ref(s)
is not unusable and meaningless. So we need to get rid of it. If you need to filter columns, there is a way to do it later. - Access replace text menu
ctrl + H
, then pressalt + R
for regex search. - Now, what do these refs have in common? They have square brackets. Then, inside square bracket is number or letter. There are some possible regex. Try to approach it.
The picture above with
\[[0-9]\]
regex works for single digit, but failed for two or more digits and letters. I also prefer to remove trailing whitespaces. Going further, I acknowledged that\[[0-9a-z]*\]
works.\t?\[..?.?\]
also works (it removes whitespace as well). Then, replace with''
a.k.a nothing. - Replace all occurences :
ctrl + alt + enter
. Now the refs are no more. -
Replace characters that confuses CSV, like comma (
,
) or double-tick ("
). It is important to avoid mistakes when parsing CSV. No need for regex now.- Replace text
ctrl + H
- Use text :
,
, replace with;
or leave it empty - Replace all occurences :
ctrl + alt + enter
- Repeat on other characters
- Replace text
-
Setup header manually to match standard CSV format.
This is my recommendation:
- avoid spaces
- use
snake_case
- add commas manually
-
Add proper separator, in this case: comma.
- For this data, I am a bit lucky. Each data is consistently separated by tab or whitespace. Just need to convert whitespace to comma. This is also why removing unnecessary whitespace on step 7 is important. Extra commas will be confusing.
- In some cases, data is separated by spaces. It can be confusing, especially when the column is string with spaces. There are strategies to overcome it below.
- Replace text
ctrl + H
, then pressalt + R
for regex search option - Use text :
\t
, replace with,
- Replace all occurences :
ctrl + alt + enter
-
et voila! Data should be ready by now. Save it as
csv
file. Rainbow CSV should automatically highlight it afterwards.
From PDF to CSV
The basic is similar to HTML data source
- Copy all data from PDF
- Paste into VSCode workspace
- Remove unnecessary lines
- Remove unusable columns
- Regex search and replace
- Remove special characters (commas, double-ticks, etc.)
- Setup header manually
- Add proper separators
- Save as CSV
Let's dive into action.
- PDF data often returns a mess. Unnecessary lines, inconsistent line breaks, space separators, are all present here. It is better to observe this pile of mess for a while. Also check with original PDF file to figure out "how did this line ended up here?"
- Check header and footer of each PDF page. Something like page number, repeated header, or repeated title might slip in. Clean them all.
-
Nomor
or row number column has no value, so it can be removed. Now there's a little trick. Normal number regex like[0-9. ]+
is not accurate since there are matches at other places. But I know number column is always at the beginning of row. So I add newline\n
to my number regex, then replace it with\n
. - A lot of regex find-and-replace are required.
- Replace
Android\ndan iOS
withAndroid dan iOS
- Replace
-\n
(it means web based fintech, not mobile-based) withWeb\n
- Replace
Konvensional\ndan Syariah
withKonvensional dan Syariah
- Replace
- Fortunately, there's no character that might confuse CSV. Move on to next step.
- Handling header might require domain knowledge. So I basically change the header into this:
platform_name, website, company_name, official_register, date_of_registration, lending_type, platform_type
. - Now here comes the fun part of adding separators. Remember that we cannot change data, and regex find-and-replace always replace pattern to constant value. Thus, the key is identifying constant values.
- Many URLs begin with either
http
,https
, orwww
. So, I can replacehttp
with,http
andwww
(spaced) with,www
. Just add comma before replaced value. Now that's progress. - Some URLs have no prefix, but have
.id
suffix. Find it with[a-z]+\.id
regex, then pressalt + enter
to have multiple cursors working on them. Move to leftmost string withctrl + ←
, then add commas manually. - All companies in the list begin with
PT
. So, replacePT
(spaced) with,PT
- All official registers begin with
KEP
orS
. There are some inconsistencies about spacing and dash/minus character. So, replaceKEP ?-?–?
with,KEP-
. Also replaceS ?[-?–?]+
with,S-
. - Lending type is either 'Konvensional', 'Syariah', or 'Konvensional dan Syariah'. Replace
Konvensional
with,Konvensional,
. ReplaceSyariah
with,Syariah,
. Then replaceKonvensional,dan,Syariah
withKonvensional dan Syariah
- Official register ends with year. So I can replace
/2021
to/2021,
and/2020
to/2020,
and so on. - Use
CSV Lint
fromctrl + shift + P
or menu at bottom of VSCode window to check any incorrect row.
- Many URLs begin with either
- Save as
csv
file, and we're done!
Extra Miles
Select Specific CSV Columns
Rainbow CSV
has RBQL feature, which works like SQL. It can be used to select specific columns if needed.
- Click on query menu, a new tab will open.
- Select columns like in SQL query, then click Run
- The result appears in new tab. Save as new CSV using
ctrl + shift + s
Save as JSON
Here's the tips to get JSON from CSV file, completely within VSCode.
- Use
JSON to CSV
extension. Open the CSV file, then pressF1 > Convert CSV to JSON
- Use
Prettify JSON
extension. PressF1 > Prettify JSON
. Now it looks good! - On Windows, there are trailing whitespace
\r
. Simply find\r
and replace with empty string `` - Save as JSON file. Do not save directly using
ctrl + S
, it will overwrite the CSV file.
Handle Separator Placing
To summarize, here are tips and trick to handle difficult separator placing:
- Utilize constant value
- Utilize newline
- Utilize multiple cursors like
alt + enter
when searching - Utilize manual multiple cursors with
ctrl + alt + ↑
orctrl + alt + ↓
. Note that only maximum of 10000 parallel cursors are allowed on VSCode. - Utilize CSV Lint
Summary
I've been using this method quite often on various data source. Once, I wrangle 80-page PDF file and transform it to 40000-row CSV in about 20 minutes. Just keep in mind these three things:
- Follow these steps in order:
- Copy all data then paste it into VSCode workspace
- Remove unnecessary lines
- Remove unusable columns
- Regex search and replace
- Remove special characters (commas, double-ticks, etc.)
- Setup header manually
- Add proper separators
- Save as CSV
- Add or remove column (optional)
- Convert to JSON (optional)
- Be nimble on regex and its behaviors
- Don't be afraid to undo your steps when things get messy. Maybe your regex wasn't accurate enough. Go back and fix your regex. It will save you time rather than handling data manually. Manual wrangling should be exception, not mandatory step.
I am Yogi Saputro, a full stack developer at Pinjammodal.id. I’m also passionate in building system, pipeline, and provide value to businesses. Check out my Github and portfolio as well.
Top comments (0)