DEV Community

Cover image for Wikipedia Article Crawler & Clustering: Text Classification with Spacy
Sebastian
Sebastian

Posted on

Wikipedia Article Crawler & Clustering: Text Classification with Spacy

Spacy is a powerful NLP library that performs many NLP tasks in its default configuration, including tokenization, stemming and part-of-speech tagging. These steps can be extended with a text classification task as well, in which training data in the form of preprocessed text and expected categories as dictionary objects are provided. Both multi-label and single-label classification is supported.

This article shows how to apply text classification on Wikipedia articles using Spacy. You will learn how to process Wikipedia articles, extract their categories, and transform them into Spacy-compatible format. Furthermore, you will see hot to create a classification task configuration file, customize it, and perform the training with Spacy CLI commands.

The technical context of this article is Python v3.11 and spacy v3.7.2. All examples should work with newer library versions too.

This article originally appeared at my blog admantium.com.

Context

This article is part of a blog series about NLP with Python. My previous article provided the core implementation of a WikipediaReader object that crawls and downloads articles. With this, a corpus of several hundred or thousand documents can be created, which is then abstracted by a WikipediaCorpus object for coarse corpus statistics data as well as access to individual articles. For all articles, their title, raw text, tokenized text, and vector representation are created and stored in a Pandas data frame object. This object is then consumed for successive clustering or classification tasks.

Prerequisite: Extract Wikipedia Article Categories

The first step is to provide the necessary data. Continuing with the program and features provided in my earlier articles, a Pandas DataFrame object is created that contains in each row the article name, its extracted category, and the raw text.

And here starts the first extension: Proper category names. On a closer look, categories extracted with the help of the Python library wikipedia-api contains both the intended, lexical categories as shown on an articles footer, as well as internal "tags" that specify if an article needs to be reviewed or otherwise modified.

To illustrate this point, consider the categories of the article about Vehicular automation. At the bottom of the article page, you see two categories: Uncrewed vehicles and Vehicular automation. However, when using the library, a total of 13 categories are returned. See the following code snippet.

import wikipediaapi as wiki_api

wiki = wiki_api.Wikipedia(
                language = 'en',
                extract_format=wiki_api.ExtractFormat.WIKI)

p = wiki.page('Vehicular_automation')

print(len(p.categories))
#13

print([name.replace('Category:', '') for name in p.categories.keys()])
#['All articles containing potentially dated statements', 'All articles with unsourced statements', 'Articles containing potentially dated statements from December 2021', 'Articles using small message boxes', 'Articles with short description', 'Articles with unsourced statements from June 2021', 'CS1 Japanese-language sources (ja)', 'Incomplete lists from April 2023', 'Incomplete lists from June 2021', 'Short description is different from Wikidata', 'Uncrewed vehicles', 'Use dmy dates from April 2023', 'Vehicular automation']
Enter fullscreen mode Exit fullscreen mode

Initially, I tried to filter the categories in such a way that they do not start with Articles using or All articles with, but I always found new and unexpected categories. Therefore, the categories need to be extracted from the articles content. Unfortunately, for both export formats, ExtractFormat.WIKI and ExtractFormat.HTML, the categories are not included.

Therefore, a new function is required. Consider the following target HTML structure:

<div id="catlinks" class="catlinks" data-mw="interface">
  <div id="mw-normal-catlinks" class="mw-normal-catlinks"><a href="/wiki/Help:Category"
      title="Help:Category">Categories</a>: <ul>
      <li><a href="/wiki/Category:Uncrewed_vehicles" title="Category:Uncrewed vehicles">Uncrewed vehicles</a></li>
      <li><a href="/wiki/Category:Vehicular_automation" title="Category:Vehicular automation">Vehicular automation</a>
      </li>
    </ul>
  </div>
  <div id="mw-hidden-catlinks" class="mw-hidden-catlinks mw-hidden-cats-hidden">Hidden categories: <ul>
      <li><a href="/wiki/Category:CS1_Japanese-language_sources_(ja)"
          title="Category:CS1 Japanese-language sources (ja)">CS1 Japanese-language sources (ja)</a></li>
      <li><a href="/wiki/Category:Articles_with_short_description"
          title="Category:Articles with short description">Articles with short description</a></li>
      <li><a href="/wiki/Category:Short_description_is_different_from_Wikidata"
          title="Category:Short description is different from Wikidata">Short description is different from Wikidata</a>
      </li>
      <li><a href="/wiki/Category:Use_dmy_dates_from_April_2023" title="Category:Use dmy dates from April 2023">Use dmy
          dates from April 2023</a></li>
      <li><a href="/wiki/Category:Articles_containing_potentially_dated_statements_from_December_2021"
          title="Category:Articles containing potentially dated statements from December 2021">Articles containing
          potentially dated statements from December 2021</a></li>
      <li><a href="/wiki/Category:All_articles_containing_potentially_dated_statements"
          title="Category:All articles containing potentially dated statements">All articles containing potentially
          dated statements</a></li>
      <li><a href="/wiki/Category:All_articles_with_unsourced_statements"
          title="Category:All articles with unsourced statements">All articles with unsourced statements</a></li>
      <li><a href="/wiki/Category:Articles_with_unsourced_statements_from_June_2021"
          title="Category:Articles with unsourced statements from June 2021">Articles with unsourced statements from
          June 2021</a></li>
      <li><a href="/wiki/Category:Articles_using_small_message_boxes"
          title="Category:Articles using small message boxes">Articles using small message boxes</a></li>
      <li><a href="/wiki/Category:Incomplete_lists_from_April_2023"
          title="Category:Incomplete lists from April 2023">Incomplete lists from April 2023</a></li>
      <li><a href="/wiki/Category:Incomplete_lists_from_June_2021"
          title="Category:Incomplete lists from June 2021">Incomplete lists from June 2021</a></li>
    </ul>
  </div>
</div>
Enter fullscreen mode Exit fullscreen mode

Only the categories from the first list, identified by mw-normal-catlinks, should be parsed. To solve this, I wrote the following function that uses the requests library to load the full web page, and then uses regular expressions to extract the categories:

def get_categories(article_name):
    wiki = wiki_api.Wikipedia(language = 'en')

    p = wiki.page(article_name)

    if not p.exists:
        return None

    r = requests.get(p.fullurl)
    html = r.text.replace('\r', '').replace('\n', '')

    catlinks_regexp = re.compile(r'class="mw-normal-catlinks".*?<\/div>')
    catnames_regexp = re.compile(r'<a.*?>(.*?)<\/a>')

    cat_src = catlinks_regexp.findall(html)[0]
    cats = catnames_regexp.findall(cat_src)

    if len(cats) == 0:
        return ['Uncategorized']
    else
        return cats[1:len(cats)]
Enter fullscreen mode Exit fullscreen mode

Let’s see the categories for "Vehicular automation" and "Artificial intelligence".

get_categories('Vehicular_automation')
#['Uncrewed vehicles', 'Vehicular automation']

get_categories('Artificial_intelligence')
#['Artificial intelligence',
# 'Cybernetics',
# 'Computational neuroscience',
# 'Computational fields of study',
# 'Data science',
# 'Emerging technologies',
# 'Formal sciences',
# 'Intelligence by type',
# 'Unsolved problems in computer science']
Enter fullscreen mode Exit fullscreen mode

With this function, only relevant categories are extracted.

Step 1: Data Preparation

To extract the Wikipedia articles and their categories properly, a modified SciKit learn pipeline is used. It’s based on the same unmodified WikipediaCorpusTransformer object - see my article about Wikipedia article corpus transformation - and a modified Categorizer that uses the above described method.

The pipeline definition is as follows:

pipeline = Pipeline([
    ('corpus', WikipediaCorpusTransformer(root_path=root_path)),
    ('categorizer', Categorizer()),
])
Enter fullscreen mode Exit fullscreen mode

The updated Categorizer implementation uses the code from the previous section, but also processes articles which have no categories, which is then represented by the dictionary object {'Uncategorized':1}.

class Categorizer(SciKitTransformer):
    def __init__(self):
        self.wiki = wiki_api.Wikipedia(language = 'en')
        self.catlinks_regexp = re.compile(r'class="mw-normal-catlinks".*?<\/div>')
        self.catnames_regexp = re.compile(r'<a.*?>(.*?)<\/a>')

    def get_categories(self, article_name):
        p = self.wiki.page(article_name)

        if p.exists:
            try:
                r = requests.get(p.fullurl)
                html = r.text.replace('\r', '').replace('\n', '')

                cat_src = self.catlinks_regexp.findall(html)[0]
                cats = self.catnames_regexp.findall(cat_src)

                if len(cats) > 0:
                    return dict.fromkeys(cats[1:len(cats)], 1)
            except Exception as e:
                print(e)

        return {'Uncategorized':1}


    def transform(self, X):
        X['categories'] = X['title'].apply(lambda title: self.get_categories(title))
        return X
Enter fullscreen mode Exit fullscreen mode

When running the pipeline, the following DataFrame is created:

Step 2: Data Transformation For Spacy

The next step is to take the DataFrame object, perform a test-train split, and create the Spacy DocBin objects.

For the data split, SciKit Learn’s handy train_test_split method is used, with its helpful initial_state variable to ensure repeatable results.

train, test = train_test_split(X, test_size=0.2, random_state=42, shuffle=True)
Enter fullscreen mode Exit fullscreen mode

Next, the data sets need to be transformed to a Spacy DocBin object and persisted on disk.
The individual steps are as follows:

  • Create the DocBin object
  • Load spacy default model en_core_web_lg for vectorizing the text
  • Create a dictionary of all categories from all articles
  • For each category, apply text vectorization to its raw text, and in the category dictionary object, set all applicable categories to the value 1
  • Finally, create a Spacy Doc object and store it in the DocBin object

Let’s examine the category creation a bit more, because it is complex in itself:

  • Transform the pandas Series object for all data in the "categories" column to a list
  • Join all lists together
  • Create a set from all lists (to reduce duplicate entries)
  • Create the dictionary with a dictionary comprehension for each entry of the list

Considering all of these aspects, the final functions is as follows:

def convert_multi_label(df, filename):
    db = DocBin()
    nlp = spacy.load('en_core_web_lg')
    total = len(df.index)
    print(f'{time()}: start processing {filename} with {total} files')

    categories_list = set(list(chain.from_iterable([list(d) for d in df["categories"].tolist()])))
    categories_dict = { cat: 0 for cat in categories_list }
    #print(categories_dict)

    count = 0
    for _, row in df.iterrows():
        count += 1
        print(f'Processing {count}/{total}')
        doc = nlp(row['raw'])
        cats = copy(categories_dict)
        for cat in row['categories']:
            cats[cat] = 1

        doc.cats = cats
        #print(doc.cats)
        db.add(doc)

    print(f'{time()}: finish processing {filename}')
    db.to_disk(filename)
Enter fullscreen mode Exit fullscreen mode

Then, running it on the splited data:

convert_multi_label(train, 'wikipedia_multi_label_train2.spacy')
2023-07-06 19:22:47.588476: start processing wikipedia_multi_label_train3.spacy with 217 files
# Processing 1/217
# Processing 2/217
# ...
# Processing 4/217
# Processing 217/217
# 2023-07-06 19:25:06.319614: finish processing wikipedia_multi_label_train3.spacy

convert_multi_label(test, 'wikipedia_multi_label_test2.spacy')
# 2023-07-06 19:25:10.557678: start processing wikipedia_multi_label_test3.spacy with 55 files# files
# Processing 1/55
# Processing 2/55
# Processing 3/55
# # ...
# Processing 55/55
# 2023-07-06 19:25:55.963293: finish processing wikipedia_multi_label_test3.spacy
Enter fullscreen mode Exit fullscreen mode

Training

Training is started by running the command line interface spacy train with the following parameters:

  • training pipeline configuration file
  • training and testing input data
  • output directory where the trained model is stored

Example command:

spacy train spacy_wikipedia_multi_label_categorization_pipeline.cfg \
    --paths.train wikipedia_multi_label_train2.spacy \
    --paths.dev wikipedia_multi_label_test2.spacy \
    --output wikipedia_multi_label_model
Enter fullscreen mode Exit fullscreen mode

During training, several information is shown: Details about the configuration itself, hardware capabilities, and a table containing the statistical data that is produced. (For a full explanation, see my previous article.)

2023-07-06 18:47:11.934948: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
✔ Created output directory: wikipedia_multi_label_model
ℹ Saving to output directory: wikipedia_multi_label_model
ℹ Using CPU

=========================== Initializing pipeline ===========================
[2023-07-06 18:47:16,428] [INFO] Set up nlp object from config
[2023-07-06 18:47:16,441] [INFO] Pipeline: ['tok2vec', 'textcat_multilabel']
[2023-07-06 18:47:16,445] [INFO] Created vocabulary
[2023-07-06 18:47:18,144] [INFO] Added vectors: en_core_web_lg
[2023-07-06 18:47:18,144] [INFO] Finished initializing nlp object
[2023-07-06 18:48:01,253] [INFO] Initialized pipeline components: ['tok2vec', 'textcat_multilabel']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'textcat_multilabel']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE
---  ------  ------------  -------------  ----------  ------
  0       0          7.66           0.28        2.53    0.03
  0     200        148.32           1.37        2.55    0.03
  1     400          0.00           1.04        2.83    0.03
  2     600          0.01           1.04        2.65    0.03
  3     800          0.00           1.21        2.68    0.03
  4    1000          0.04           1.13        2.74    0.03
  5    1200          0.00           1.09        2.62    0.03
  6    1400          0.00           1.11        2.59    0.03
  7    1600          0.00           1.27        2.49    0.02
  8    1800          0.00           1.26        2.47    0.02
  9    2000          0.00           1.23        2.56    0.03
Enter fullscreen mode Exit fullscreen mode

Training Logbook

After some initial training and testing with multilabel classification, I decided to add two additional trained models as well and compare their results. In total, the following models were trained:

  • single-label categorization
  • multi-label categorization
  • multi-label categorization with single label categories

All models were trained with the same optimizer:

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001
Enter fullscreen mode Exit fullscreen mode

The following tables summarize the training process and duration.

Training Articles Training Categories Testing Articles Testing Categories
217 1252 55 298
model Category Score duration
single-label categorization 0.19 8h
multi-label categorization 2.83 2h10m
multi-label categorization with single label categories 0.56 2h40m

Comparison

Finally, let’s compare the model’s category estimation. The following models extract the article name, raw text, and category from the DataFrame object, and then creates a Spacy nlp object to process the raw text and generate categories.

def estimate_cats(model, article):
    name = article["title"]
    text = article["raw"]
    expected_cats = article["categories"]

    nlp = spacy.load(f'{model}/model-best')
    doc = nlp(text)
    estimated_cats = (sorted(doc.cats.items(), key=lambda i:float(i[1]), reverse=True))

    print(f'Article {name} || model {model}"')
    print(expected_cats)
    print(estimated_cats)
Enter fullscreen mode Exit fullscreen mode

The following table summarizes the results for the articles about artificial intelligence and spacecraft.

Outer pipes Cell padding

Article Model expected estimated
Artificial_intelligence wikipedia_single_label_model {'Artificial intelligence': 1, 'Cybernetics': 1, 'Computational neuroscience': 1, 'Computational fields of study': 1, 'Data science': 1, 'Emerging technologies': 1, 'Formal sciences': 1, 'Intelligence by type': 1, 'Unsolved problems in computer science': 1} ('Artificial intelligence', 0.9999957084655762), ('Algorithms', 1.2372579476505052e-06), ('Data types', 7.351681574618851e-07), ('Machine learning', 3.5465578207549697e-07), ('Optimal decisions', 2.7041042471864785e-07), ('Computational statistics', 2.517348320907331e-07)
wikipedia_multi_label_model {'Artificial intelligence': 1, 'Cybernetics': 1, 'Computational neuroscience': 1, 'Computational fields of study': 1, 'Data science': 1, 'Emerging technologies': 1, 'Formal sciences': 1, 'Intelligence by type': 1, 'Unsolved problems in computer science': 1} ('Computer programming folklore', 0.0018510001245886087), ('Dynamically typed programming languages', 0.0018117608269676566), ('Near Eastern countries', 0.0012830592459067702), ('Test items in computer languages', 0.0011254188138991594), ('High-level programming languages', 0.0009476889972575009), ('Earth observation satellites', 0.0008560022106394172)
wikipedia_multi_label_model_single_label_input {'Artificial intelligence': 1, 'Cybernetics': 1, 'Computational neuroscience': 1, 'Computational fields of study': 1, 'Data science': 1, 'Emerging technologies': 1, 'Formal sciences': 1, 'Intelligence by type': 1, 'Unsolved problems in computer science': 1} ('Philosophers of psychology', 0.8293902277946472), ('Theoretical computer scientists', 0.8165731430053711), ('20th-century American psychologists', 0.8153121471405029), ('History of technology', 0.8115885257720947), ('Films that won the Best Visual Effects Academy Award', 0.8109292387962341), ('AI accelerators', 0.8050633072853088)
Spacecraft wikipedia_single_label_model {'Spacecraft': 1, 'Astronautics': 1, 'Pressure vessels': 1} ('Automation', 0.19181914627552032), ('Free software programmed in Python', 0.0869596004486084), ('Uncategorized', 0.08303041756153107), ('Aircraft categories', 0.0706774964928627), ('Uncrewed vehicles', 0.06570713222026825)
wikipedia_multi_label_model {'Spacecraft': 1, 'Astronautics': 1, 'Pressure vessels': 1} ('Dynamically typed programming languages', 0.002786380937322974), ('Backward compatibility', 0.0009597839671187103), ('Near Eastern countries', 0.0009351804037578404), ('Member states of the United Nations', 0.0008954961667768657), ('Computer programming folklore', 0.0006171380518935621)
wikipedia_multi_label_model_single_label_input {'Spacecraft': 1, 'Astronautics': 1, 'Pressure vessels': 1} ('Motor control', 0.8138118386268616), ('19th-century English philosophers', 0.7775416970252991), ('Theoretical computer scientists', 0.766461193561554), ('Near Eastern countries', 0.7577261328697205), ('Computer science literature', 0.7563655972480774), ('Programming languages with an ISO standard', 0.7425990104675293)

Summary

Spacy is a NLP library in which pre-built models accomplish NLP tasks, like tokenization and lemmatization, and NLP semantics, like part-of-speech tagging, out-of-the-box. Internally, these tasks are represented and abstracted by a pipeline object, which can be extended with additional tasks. This article showed how to add text categorization to Spacy for processing Wikipedia articles. You learned how to extract Wikipedia article categories from the raw HTML representation of the articles, how to extend Spacy with a text categorization task, how to prepare the training data and execute the training with Spacy. Overall, three different models were trained: single-label categorization, multi-label categorization, and multi-label categorization with single label categories. The final section compared the categorization results for two articles.

Top comments (0)