Easy, flexible framework for community detection

October 25, 2013Postsbisection, community detection, igraph, python, resolution, significanceVincent Traag

\(\)
Let me start off by saying that working with igraph through its python bindings has been a great relieve! Although its R bindings seem to be much popular, I much prefer python over R. It’s not that R is bad—on the contrary, it’s very good—but it’s simply not very much a programmer’s tool, it’s more a statistician’s type of tool. I guess I’m not the latter.

Although igraph already implements a wide range of different methods for community detection, I thought it lacked a more general interface. That’s why I decided to create a new framework for community detection, which should provide quite some flexibility. The general algorithm behind it, is the well-known Louvain algorithm. This algorithm is known to perform quite well, both in terms of computational time and in terms of quality of the found partition.

The general idea of the Louvain algorithm is very simple. We start out with a partition in which each node is in its own community. We then move nodes to other communities as long as this increases our quality function. If we can no longer improve the partition by moving around nodes, we aggregate the graph, and then repeat the same procedure on this larger graph.

So, given this most general setup of the framework, there is only one function that needs to be changed if we want to optimise some other quality function: how much do we improve the quality if we move a node.

This is exactly the setup I chose here. I implemented a base class VertexPartition that takes care of most of the basic administration of a partition (e.g. how many links there are inside a community, how big the community is, etc…). If we want to move a node, we call the function move_node(v, c) to move node v to another community c and all the administration is updated automatically. Internally, in the optimisation procedure, the Louvain algorithm calls diff_move(v, c) to know what is the improvement of the quality function if were to actually call move_node(v,c). The Louvain algorithm simply choses the community which maximises this improvement. In summary, the following example illustrates exactly how these different functions should work:

diff = partition.diff_move(v,c);
q1 = partition.quality();
partition.move_node(v,c);
q2 = partition.quality();

Now diff == q2 - q1. Always. Or at least that should be the case.

Simply taking this base class, deriving from it, and implementing a different diff_move function and a different quality function, allows to quickly implement your favorite method, and optimise it using the Louvain algorithm.

I’ve now implemented four different classes that derived from VertexPartition:

ModularityVertexPartition, which implements modularity.
CPMVertexPartition, which implements the CPM method.
SignificanceVertexPartition, which optimises significance.
RBConfigurationVertexPartition, which is the Reichardt and Bornholdt model with a configuration model, which basically comes down to modularity but with a multiplicative resolution parameter.

So how would you use this in python? Well first, we import the package of course:

import igraph as ig
import louvain as louvain

What then remains is very simple:

G = ig.Graph.Tree(n=100, children=3);
opt = louvain.Optimiser();
partition = opt.find_partition(graph=G,
    partition_class=louvain.SignificanceVertexPartition);

Now partition contains the partition which optimises significance. Simply providing a different type of class to the optimiser will use that class to construct the partitions, and hence will use that class’ diff_move function to move around nodes.

But wait, there’s more. Some methods, such as CPM, implement resolution parameters. Given some conditions, these can be effectively scanned using bisectioning. This is also implemented in the optimiser. Again, usage is very simple, and should be nothing more than this:

res_results = opt.bisect(graph=G, 
    partition_class=louvain.CPMVertexPartition,
    resolution_range=(0,1));

Now res_results contains all the relevant resolution values (i.e. where the optimal partition changes) including all the partitions.

So, here you have it, everything I mentioned in my most recent paper implemented in python. In fact, this idea started simply to have a nice framework which, once done, I could easily translate to C++. Of course, implementing it in C++ should make it run much faster. On the other hand, the current implementation allows for very fast, flexible and easy extension of the framework with new methods. Also, this python version should work quite well up to 1000 nodes or so. For everything bigger, wait a bit until the C++ version is out.

Almost forgot, you can download it from launchpad: https://launchpad.net/pylouvain-igraph. Oh, and it also works on directed graphs and weighted graphs (if the method supports it).

References

18 thoughts on “Easy, flexible framework for community detection”

Tamás Nepusz says:

October 28, 2013 at 11:03

Great job! Yes, it would have been great if igraph allowed arbitrary quality functions in its implementation of the Louvain method (and also the fast greedy community detection algorithm), but unfortunately it’s more complicated to implement that in the C layer.

I have looked into the source code on Launchpad and I was wondering whether you considered using the contract_vertices() method of igraph’s Graph objects instead of implementing your own in the collapse_graph() method?

Also, you could probably also make use of igraph’s compare_communities() function in your implementation in place of the community comparison methods of your class.

Reply
1. Vincent Traag says:
  
  October 28, 2013 at 11:23
  
  Yes, I already feared it would be slightly more difficult to implement in the C layer. That’s why I hoped that starting out with a python prototype would allow me do to it a bit faster… 🙂
  
  I did consider using the collapse_graph method, but for some graphs it crashed. Furthermore, the collapse_graph method does it in-place, so that if you wouldn’t like to change the original graph you’d have to copy the original graph. This consumes more memory than needed (which, admittedly, is only relevant for large graphs, which are impossible to analyse currently anyway with this code, but I’m just thinking ahead…) But indeed, that should be one possible improvement, especially if you could get a new graph returned by collapse_graph! I really like the flexible way it is implemented! By the way, if you want details for the bug in the collapse_graph method, I would have to check out what makes it crash. I know it has something to do with the vertex attributes that are present in the graph, because if I remove those attributes, the collapse_graph doesn’t crahs.
  
  In fact, ideally I would simply derive from the VertexClustering class, upon which I could then implement the additional framework. Unfortunately, the membership vector in that class seems to be unmutable, which makes moving around nodes in communities impossible. I agree that it shouldn’t be possible to arbitrarily assign a new community to a node, but rather that it should happen through a method that also updates the rest of the administration (as I do now in the move_node method).
  
  What do you think? Would it be possible to udpate the collapse_graph function so that it returns a new graph? And would it be possible to update the membership vector (even if only through some method)?
  
  Reply
Tamás Nepusz says:

October 28, 2013 at 12:04

collapse_vertices should definitely not crash so please send me a small example on which I could reproduce the problem. Once I managed to fix this, we can think about making collapse_vertices return a new graph instead of copying, although I would probably avoid premature optimization and try simply copying the graph first using the copy() method.

Re the mutability of VertexClustering: yes, it is true that the class was not meant to be mutable (especially because some properties of that class, such as the modularity score, are cached internally, and mutating the object should therefore invalidate all the cached values). However, Python does not stop you from messing around with the internals of the VertexClustering class if you know what you are doing, so you can simply modify the _membership property (which is just an ordinary list).

A nicer solution would probably be to derive your own MutableVertexClustering class from VertexClustering, add your own mutator methods such as move_node(), and make the method invalidate the cached _modularity property.

Reply
1. Vincent Traag says:
  
  October 28, 2013 at 12:14
  
  Yeah, I’ll have a closer look at what made it crash when I have some time. True that for now it’s not necesarry to re-implement the collapse_vertices method for optimisation purposes. If I re-use the VertexClustering by the way, I could of course just use the cluster_graph of that class.
  
  Good idea to derive a MutableVertexClustering class first! I didn’t look further into the code to see if I could mutate the membership vector some other way, good to know I can use _membership. I’ll update the code once I get a chance!
  
  Reply
2. Vincent Traag says:
  
  November 1, 2013 at 11:24
  
  I just pushed the new code to launchpad. The MutableVertexPartition now derives form the VertexClustering, so all previous code should be compatible with it. I haven’t looked yet at the error for cluster_graph.
  
  Reply
Marco says:

March 30, 2014 at 19:13

Hello,

I am studying your Python code. Thank you for sharing it!
I was looking for the Reichardt-Bornholdt implementation and I have noticed two things:

1) In the RBConfigurationVertexPartition class, in the diff_move function, you write:

if (not self._graph.is_directed()): diff /= 2.0;

In the modularity, you commented these lines. Is it right, or should it be commented, too?

2) In the page http://www.traag.net/code/ you write that the Reichardt-Bornholdt measure is implemented with an Erdos-Renyi null-model.
In the RBConfigurationVertexPartition class, it seems that in the diff_move function you use a configuration null-model, while in the RBConfiguration function, instead, it seems that you use a different null-model. Could you explain me if I am wrong, please?

Thank you.
Have a nice evening!

Reply
1. Vincent Traag says:
  
  March 30, 2014 at 19:32
  Hi Marco,
  
  Hopefully you can use the code!
  
  To answer your questions:
  1. I am not sure, it might be that I made a mistake there, which I corrected in the one, but not in the other file, I’ll have to check it out. I’ll get back to you on this.
  2. Those comments on the http://www.traag.net/code/ page refer to the C++ code available from https://launchpad.net/louvain. I should make that more clear. I am not sure what you mean exactly with the RBConfiguration function, as compared to the diff_move function. Could you perhaps elaborate?
  Reply
  1. Marco says:
    
    April 1, 2014 at 22:44
    
    Hi,
    
    Sorry for my late answer, but I have been busy 🙂
    
    1. Ok, thank you. Maybe I am wrong, but I think that function should be very similar to modulary, except for the resolution parameter.
    
    2. Yes, maybe I could have explained better. In your RBConfiguration function, you do:
    
    mod = mod + w - \ self.resolution*\ w_out*w_in/(float(4 - 3*self._graph.is_directed())*self._total_weight);
    
    If self._graph.is_directed() is 0, then it sums:
    mod = mod + w – self.resolution * w_out * w_in / 4 * self._total_weight
    
    Why do you divide by 4?
    
    I tried to write a sort of the derivation of the Reichardt-Bornholdt measure by communities and it should be something like the last equation of this PDF:
    https://www.dropbox.com/s/mrabdwsctgrzgwu/ReichardtBornholdtProof.pdf
    In the equation that I have derived, I don’t divide by 4. Am I wrong?
    
    Thank you, goodbye!
    
    Reply
    1. Marco says:
      
      April 1, 2014 at 23:27
      
      Sorry, I was wrong with my derivation. I think the RB Measure is right 🙂
      
      Reply
    2. Vincent Traag says:
      
      April 2, 2014 at 08:30
      
      Hi Marco,
      
      I believe you’re right, I made some changes in the Modularity class at some point, which I didn’t do in the RBConfiguration class. I believe you are in fact right about the formula in the RBConfiguration class, it should now be consistent with the original modularity formula. The thing is that some things depend on whether the graph is directed or not, and for some properties you do need to take this into account, whereas for others you do not. Anyway, thanks for spotting those errors!
      
      I’ve pushed the latest code to launchpad, you should be able to find it there.
      
      Cheers,
      
      Vincent
      
      Reply
Marco says:

April 5, 2014 at 21:47

Hi Vincent,

Thank you for your quick help!
Your project has been very helpful.

Best Regards,
Have a nice weekend!

Reply
praveen kumar says:

June 22, 2014 at 05:47

hi.i was intrested in detecting communities in twitter…please any one help me how to start work

Reply
1. Vincent Traag says:
  
  June 24, 2014 at 08:13
  
  You’re probably better off checking out a website such as https://dev.twitter.com/docs/twitter-libraries for getting some twitter data. Once you have that, resort to libraries such as igraph for help with creating networks. You can then either use the method from igraph itself or use the framework described here.
  
  Reply
Vincent Traag says:

March 2, 2015 at 09:22

Should you be interested, I just created a C++ version of this implementation, which is available from https://github.com/vtraag/louvain-igraph. The python package (including Windows binaries) can be downloaded from https://pypi.python.org/pypi/louvain/.

Reply
Reza says:

June 4, 2015 at 01:24

Hi Vincent,

There is also a quite famous Louvain implementation for Networkx.
http://perso.crans.org/aynaud/communities/

However, that implementation does not accept directed graphs. I was wondering if you had to redefine any metrics for directed graphs for your algorithm to work.

Reply
1. Vincent Traag says:
  
  June 4, 2015 at 20:39
  
  Hi Reza,
  
  Yes, I know that Thomas Aynaud also has an implementation, but I never used it. Although it is interesting to have a pure python package, it also makes networkx in general not very suitable for larger graphs. Similarly with the package referred to in this post, it is nice as idea, but the actual implementation is generally too slow to work with larger graphs.
  
  Therefore, please notice that it is superseded by a C++ version, which is much, much faster. It can deal with directed, undirected, weighted and unweighted graphs. Additionally, it can deal with multiple layers (some of which may contain negative links for example), and implements currently four different methods (Modularity variants, Significance and Surprise). So you should be able to use this package for your purpose.
  
  Best,
  
  Vincent
  
  Reply
Pablo says:

September 28, 2015 at 14:19

Hi Vincent,

Thank you for sharing your code. I was wondering what are the methods that allow community detection in weighted directed graphs. Thank you.

Reply
1. Vincent Traag says:
  
  September 28, 2015 at 14:27
  
  Hi Pablo,
  
  All of them support directed graphs, only Significance doesn’t support weighted graphs. Please note that you should use the latest python package built in C++ not the pure Python package (unless for some simple conceptual testing).
  
  Best,
  
  Vincent
  
  Reply

V.A. Traag

Computational Social Sciences

Easy, flexible framework for community detection

References

18 thoughts on “Easy, flexible framework for community detection”

Leave a Reply to Vincent Traag Cancel reply