DEV Community

Chillar Anand
Chillar Anand

Posted on • Originally published at avilpage.com on

Common Crawl on Laptop - Building Web Directory

This series of posts discuss processing of common crawl dataset on laptop.

  1. Extracting Subset of Common Crawl
  2. Building web directory (this post)

Introduction

In the earlier post, we have extracted all telugu web page urls to a csv file. In this post, let's explore these urls and build a web directory from it.

Explore Data

Let's see how many urls are present in the extracted subset of data.

$ wc -l telugu.csv
  852025 telugu.csv 

Enter fullscreen mode Exit fullscreen mode

In the earlier post, we have installed duckdb and used it for processing parquet files. duckdb can execute SQL queries directly on csv file. Let's use it to explore the data stored in telugu.csv.

Let's see how many unique domains are present in the data.

$ duckdb -c """
 SELECT COUNT(DISTINCT url\_host\_name\_reversed) as unique\_sites
 FROM read\_csv('telugu.csv', auto\_detect = TRUE);
"""
┌──────────────┐
│ unique_sites │
├──────────────┤
│ 13632 │
└──────────────┘

Enter fullscreen mode Exit fullscreen mode

There ~14k unique domains. Let's see page density across these domains.

$ duckdb -c """
SELECT count AS page\_count,
COUNT(\*) AS sites
FROM (SELECT url\_host\_name\_reversed, COUNT(\*) AS count
FROM read\_csv('te.csv', auto\_detect = TRUE)
GROUP BY url\_host\_name\_reversed) AS t
GROUP BY page\_count
ORDER BY page\_count;
"""
┌────────────┬───────┐
│ page_count │ sites │
├────────────┼───────┤
│ 1 │ 6326 │
│ 2 │ 1904 │
│ 3 │ 733 │
│ 4 │ 459 │
│ 5 │ 315 │

Enter fullscreen mode Exit fullscreen mode

About ~75% of the sites have less than 5 pages. It is highly unlikely that these sites complete content is in Telugu language. After manually checking a few of these sites, I found that there are a lot of false positives.

In the earlier post, we have extracted all pages where there is Telugu language content. Let's filter out pages where Telugu is primary language.

$ duckdb -c """
 COPY (
 SELECT \* FROM read\_csv('cct.csv', auto\_detect=true) 
 WHERE content\_languages like 'tel%'
 ) TO 'te\_primary.csv' (DELIMITER ',', HEADER TRUE);
"""


$ wc -l te_primary.csv
  573130 te_primary.csv


$ duckdb -c "SELECT COUNT(DISTINCT url\_host\_name\_reversed) as unique\_sites FROM read\_csv('te\_primary.csv', auto\_detect = TRUE)"                           
┌──────────────┐
│ unique_sites │
├──────────────┤
│ 5666 │
└──────────────┘    

Enter fullscreen mode Exit fullscreen mode

Let's see how page density per domain has changed.

$ duckdb -c """
SELECT count AS page\_count,
COUNT(\*) AS sites
FROM (SELECT url\_host\_name\_reversed, COUNT(\*) AS count
FROM read\_csv('te\_primary.csv', auto\_detect = TRUE)
GROUP BY url\_host\_name\_reversed) AS t
GROUP BY page\_count
ORDER BY page\_count
;
"""
┌────────────┬───────┐
│ page_count │ sites │
├────────────┼───────┤
│ 1 │ 2183 │
│ 2 │ 843 │
│ 3 │ 235 │
│ 4 │ 146 │
│ 5 │ 98 │

Enter fullscreen mode Exit fullscreen mode

Page density remains almost the same.

Let's filter out sites which have at least 5 pages in Telugu. This will eliminate a lot of false positives. Let's look at the most popular sites from the results.

   1 │ Rank,Domain,Open Page Rank
   2 │ 25,support.google.com,8.55
   3 │ 57,t.me,7.76
   4 │ 76,chrome.google.com,7.49
   5 │ 163,support.mozilla.org,6.99
   6 │ 170,groups.google.com,6.94

Enter fullscreen mode Exit fullscreen mode

A lot of unrelated domains are present here because there might be 10+ pages in telugu in these domains as well. But we don't need these.

Let's look at only home page(or translated home page) where primary content language is telugu.

$ duckdb -c """
 SELECT COUNT(distinct url) 
 FROM read\_csv('te\_primary.csv', auto\_detect=true) 
 WHERE (url\_path = '/' or url\_path = '/te/') and url\_query is null;
"""

Enter fullscreen mode Exit fullscreen mode

Now the domain count has reduced to 6k. Let's export these domains to csv file.

To categorize these domains, Common-crawl doesn't yet provide any kind of categorisation. For now, we can use Open PageRank to sort these domains based on rank.

We can download top 10 million domains from Open PageRank3. Here is a simple python script to extract telugu domains from the list.

import pandas as pd

domains\_file = 'domains.csv'
with open(domains\_file, 'r') as f:
    telugu\_domains = [line.strip() for line in f.readlines()]

telugu\_domains = ['.'.join(reversed(domain.split('.'))) for domain in telugu\_domains]

df = pd.read\_csv('t10m.csv')
df = df[df['Domain'].isin(telugu\_domains)]

df.to\_csv('t10m\_telugu.csv', index=False)

Enter fullscreen mode Exit fullscreen mode

Now, we have list of all telugu domains sorted by rank. In the next post, we will use this list to categorize the domains.


  1. https://commoncrawl.org

  2. https://en.wikipedia.org/wiki/Column-oriented_DBMS

  3. https://www.domcop.com/openpagerank

  4. https://duckdb.org

Top comments (0)