DEV Community

Cover image for Analyzing Wikipedia's Articles Classification Chain With Apache AGE
Matheus Farias de Oliveira Matsumoto
Matheus Farias de Oliveira Matsumoto

Posted on

Analyzing Wikipedia's Articles Classification Chain With Apache AGE

There is a phenomena on Wikipedia known as "Getting to Philosophy", which is based on clicking on the very first link in the main text of an English Wikipedia article, and repeating this processes for subsequent articles. Usually, this process ends when we get to the Philosophy article.

The most popular theory regarding this phenomena is that Wikipedia pages have a propensity to move up a "classification chain". The Wikipedia Manual of Style describes how to write the lead section of an article, recommending that articles begin by defining the topic of the article. As a result, the first sentence of an article usually always provides a definition, directly answering the question, "What is [the subject]?".

Mathematician Hannah Fry demonstrates the method of how most of Wikipedia's articles get to Philosophy in ‘Marmalade’, ‘socks’ and ‘One Direction’. In the video, she tells how one of Wikipedia user's Mark J started this out in early 2008.

Connecting the Dots

In Hannah's video, she shows how we can get from the "Data" article until "Philosophy" and does this with other articles whilst a graph is drawn at the screen, representing the many articles on Wikipedia and how they are connected with one-another.

wikipidia articles graph

We can replicate part of this graph with Apache AGE, which is a graph extension for PostgreSQL. If you do not have any experience with Apache AGE and do not have installed it in your machine, it is recommended to follow one of these tutorials: macOS or Windows.

Let's start creating the graph and some vertices and edges to represent the path from Data to Philosophy:

-- Creating the Graph
SELECT ag_catalog.create_graph('Wikipedia');
NOTICE:  graph "Wikipedia" has been created
 create_graph 
--------------

(1 row)

-- (Data)-(...)->(Philosophy)
SELECT * FROM cypher('Wikipedia', $$
    CREATE (:Article {name: 'Data'})-[:RELATED_TO]->(:Article {name: 'Knowledge'})
    -[:RELATED_TO]->(:Article {name: 'Descriptive knowledge'})-[:RELATED_TO]->
    (:Article {name: 'Epistemology'})-[:RELATED_TO]->(:Article {name: 'Outline of philosophy'})
    -[:RELATED_TO]->(:Article {name: 'Philosophy'})
$$) as (a agtype);

 a 
---
(0 rows)
Enter fullscreen mode Exit fullscreen mode

Now let's trace the path from Tulpa to Knowledge (we are going to do just until Knowledge because we know that there is already a path from Knowledge to Philosophy):

SELECT * FROM cypher('Wikipedia', $$
MATCH (k)
WHERE k.name = 'Knowledge'
CREATE (:Article {name: 'Tulpa'})-[:RELATED_TO]->
(:Article {name: 'Theosophy'})-[:RELATED_TO]->
(:Article {name: 'Religion'})-[:RELATED_TO]->
(:Article {name: 'Social system'})-[:RELATED_TO]->
(:Article {name: 'Sociology'})-[:RELATED_TO]->
(:Article {name: 'Social science'})-[:RELATED_TO]->
(:Article {name: 'Branches of science'})-[:RELATED_TO]->
(:Article {name: 'Science'})-[:RELATED_TO]->
(:Article {name: 'Scientific method'})-[:RELATED_TO]->
(:Article {name: 'Empirical evidence'})-[:RELATED_TO]->
(:Article {name: 'Proposition'})-[:RELATED_TO]->
(:Article {name: 'Philosophy of language'})-[:RELATED_TO]->
(:Article {name: 'Analytic philosophy'})-[:RELATED_TO]->
(:Article {name: 'Academic discipline'})-[:RELATED_TO]->
(k)
$$) as (k agtype);
Enter fullscreen mode Exit fullscreen mode

Now with AGE Viewer, we can visualize the graph on a browser:

graph on AGE Viewer

Notice that the Knowledge vertex has two edges that come to it, one from Data and another from Academic discipline, which leads to Tulpa.

Next Steps

From here on, you could continue to do this with other articles and include them like this on your graph. Although, a more efficient method would be to use a web crawler program to get the links and store their content in AGE.

Also, other stuff you could do with AGE would include creating a graph to analyze the six degrees of separation, which is the idea that all people are six or fewer social connections away from each other. Or even create a database for a game like Contexto which is a game that you need to guess a secret word and, according to your guesses, it tells how distant your guess is related to the secret word.

More About Apache AGE

Overall, Apache AGE is a great graph extension for PostgreSQL if you want to add graphs to it, allowing you to perform openCypher queries, which makes complex queries much easier to write. It also allows you to perform graph queries that are the basis for many next-level web services such as fraud detection, master data management, product recommendations, identity and relationship management, experience personalization, knowledge management, and more.

If you want to check out this extension and use it in your database, check out the links bellow:

Top comments (0)