DEV Community

mormatalon
mormatalon

Posted on

Three Ways To Evaluate Your Partition via Community Detection

Community Detection is the task of partitioning networks into meaningful clusters.

The most intuitive type of network is the social one. Every one of us is a part of one or more communities, such as our home town, our work, our hobbies, and even a favorite sport's team. In a way, the combination of these communities defines who we are.

As we already know, the social platforms have a significant interest in classifying us in the right communities to suggest the perfect advertisement or, in my case, puppy videos.

The essential rule in Community Detection is that there is no one definition of a community, i.e., there is no one golden partition. For the mathematicians in the crowd it might sound disappointing - and for me, at first, this was very confusing, as I didn't know how to proceed. But then I realized two things:

  1. This lack of definition is actually the beauty of Community Detection. To find meaningful communities, you have to know your network, understand how it behaves, and how you can distinguish between communities.
  2. There is a rule of thumb: in most cases, a good community maximizes the number of within-group connections and minimizes the number of between-group connections. This rule expresses our desire to find nodes that are more likely to be in the same community rather than others.

With this rule in mind and familiarity with your network, you can start to examine algorithms that will help partition your network.

Let's take a deep dive into three concepts that will help you evaluate how good the communities you have found are. For each concept, we will learn the intuition behind it, along with the math. Note that in the blog post I assume a basic knowledge in graph theory, so if you feel like you need some recap I recommend this Wikipedia entry.

#1 - Internal Density

Internal Density is the most intuitive and straightforward concept. It measures the number of edges inside the community out of the total possible number of edges the community could have. The more, the merrier.

🧮 How Do We Calculate Internal Density?

f(C)=∣EC∣∣C∣(∣C∣−1)/2 \displaystyle f(C) = \frac{|E_C|}{|C|(|C|-1)/2}
  • C⊆VC ⊆ V is a set of nodes grouped in a community.

  • ∣C∣(∣C∣−1)|C|(|C|-1) represents the maximum number of edges between nodes in the community CC . Note that we divide this number by 2 because every edge exists twice within the community but we want to count each edge only once.

  • EC={(u,v):u∈C,v∈C}E_C = \lbrace{(u,v): u \isin C, v \isin C} \rbrace . This expression represents the set of edges that are entirely inside of community CC .

As promised, f(C)f(C) measures the fraction of the existing edges inside the community out of the total possible number of edges the community could have.

Note that you can't build an algorithm that maximizes internal density because you will only find cliques in your networks. Remember, a clique is a subset of the graph G=(V,E)G = (V, E) so that for each pair of different nodes in the clique, there is an edge that connects between them. Thus, by definition, cliques have a high internal density.

#2 - Conductance

Conductance is a concept borrowed from the world of electricity. To explain this concept, we will use guided imagery. Let's imagine that our community is a network that transmits electricity from one point to another. We want the conductivity to be high within the community, meaning that the points can share electricity between them. But between different communities, we want the conductivity to be low, meaning the communities will not transfer electricity between them. In Community Detection, we measure conductivity by the number of edges pointing outside the community, and we strive to minimize it.

🧮 How Do We Calculate Conductance?

f(C)=∣EB,C∣2∣EC∣+∣EB,C∣\displaystyle f(C) = \frac{|E_B,_C|}{2|E_C| + |E_B,_C|}

Let's explain it step by step.

  • EC={(u,v):u∈C,v∈C}E_C = \lbrace{(u,v): u \isin C, v \isin C} \rbrace , as before, it represents the set of edges that are entirely inside of community CC .

  • EB,C={(u,v):u∈C,v∉C}E_B,_C = \lbrace{(u,v): u \isin C, v \notin C} \rbrace , meaning it is the set of edges attached to one node in the community CC and one node outside of it.

Thus f(C)f(C) measures the proportion of the edges pointing outside the community, out of the total amount of edges in and outside the community.

Conductance aims to minimize f(C)f(C) because doing so will reduce the relative part of edges that point outside the community.

Note that Conductance doesn't care about maximizing the number of internal edges. Instead, it wants to make sure that the community is separated from the rest of the graph. I encourage you to think about an example that illustrates this claim.

#3 - Modularity

To understand the concept of modularity, we first need to distinguish between two definitions: original graph and random graph. The original graph is the graph that we want to partition into communities. The random graph is a graph that we are building out of the original graph. Both have the same nodes with the same degrees, but in the random graph, the edges are shuffled.

Modularity "compares" the original number of edges inside a community with the expected number of edges in the random graph. This comparison is interesting because we need a way to know that the connection between nodes is not random but meaningful.

🧮 How Do We Calculate Modularity?

M=12∣E∣∑u,v∈V[Auv−kvku2∣E∣]δ(cv,cu) \displaystyle M = \frac {1}{2|E|} \sum_{u,v \in V} {\bigg[A_{uv} - \frac {k_vk_u}{2|E|}\bigg]\delta(c_v,c_u)}

It looks scary, but not for long. Let's break it down into parts and then explain the entire formula. Stay with me, I guarantee you will understand everything in less than 5 minutes.

  • EE represents the set of edges in the graph.

  • VV represents the group of nodes in the graph.

  • uu and vv are a pair of nodes in VV .

  • AA is the adjacency matrix of the original graph.

  • AuvA_{uv} represents the value in the adjacency matrix in the place uvuv . For simplicity AuvA_{uv} will be 1 if there is an edge from uu to vv , and zero otherwise.

  • δ(cv,cu)\delta(c_v,c_u) is a function that returns one if uu and vv are in the same community (cu=cv)(c_u = c_v) in the original graph, and zero otherwise.

  • kvk_v represents the degree of node vv in the graph.

  • The fraction kvku2∣E∣\displaystyle \frac {k_vk_u}{2|E|} is the probability that uu and vv are connected in the random graph.

So now let's read the modularity function carefully. For every pair of nodes uu and vv that belongs to the same community, subtract from their observed relation the expected number of relations. Sum it up and normalize it so that the maximum is one.

The modularity score can be either positive or negative, where positive values indicate the possible presence of community structure. The modularity score can help us compare between partitions by comparing the scores of the partitions. A higher score implies a better partition.

Modularity is not only a way to evaluate your partition, there are Community Detection algorithms that use modularity to find the partition with the highest score, where the most famous example is Luvain.

Conclusion

We have covered three concepts that you can use to evaluate how good the communities you have found are, and explained the math behind them.

Less formally, I tried to build with you the intuition for each concept. In my opinion, a good way to remember a new subject is not by understanding every piece of the equation but by understanding the idea behind it and what it strives to achieve.

As you have probably guessed, the Community Detection world is broader than the length of this blog post, so I encourage you to dive into more concepts and algorithms to "discover" how to find the almost-best partition in your graph. For starters, you can read The Atlas for the Aspiring Network Scientist which covers everything about Community Detection in particular and Network Science in general. If you prefer some more hands-on experience, I recommend using CDLib, which stands for Community Discovery Library. It is a Python software package that allows to extract, compare and evaluate communities from complex networks.

Discussion (0)