Code

\(\)
Source code for community detection can be found at https://github.com/vtraag/louvain-igraph, but it’s easier to simply use the python package directly from https://pypi.python.org/pypi/louvain/. Here are some of its features:

  • Constant Potts Model (CPM) resolution free community detection
  • Modularity optimization
  • Reichardt-Bornholdt model with Erdos-Renyi null-model
  • Dealing with negative links (i.e. with negative weights)
  • Tuneable resolution parameters
  • Multislice community detection

This implementation is designed to be used with python and igraph. You are highly recommended to use this implementation. It is fast and flexible, implements a variety of different methods, and is easy to use.

An earlier implementation used pure Python, so wasn’t nearly as fast. Even earlier is another C++ implementation, but that was quite a headache to use because of some weird formats. I only refer to these should you have some interest in them still, but normally you shouldn’t use these.

Finally, here is an important small note concerning the significance of a partition. The significance of a partition \(\sigma\) is defined as follows

\[\mathcal{S}(\sigma) = \sum_c {n_c \choose 2} D(p_c \parallel p),\]

where \(n_c\) is the number of nodes in community \(c\), \(p_c\) is the
density of community \(c\) and \(p\) is the density of the graph as a whole.
\(D(p_c \parallel p)\) represent the binary Kullback-Leibler divergence:

\[D(q \parallel p) = q \log \frac{q}{p} + (1 – q) \log \frac{1 – q}{1 – p}.\]

In practice we used

\[\mathcal{S}(\sigma) = \sum_c n_c(n_c – 1) D(p_c \parallel p),\]

because the division by \(2\) is simply a scaling factor. This significance of a random graph is expected to behave as \(\mathcal{S}(\sigma) \sim n \log n\). So you should compare the significance of the observed partition to \(n \log n\) using this definition or to \(\frac{1}{2} n \log n\) using the earlier definition.

Share

16 thoughts on “Code

  1. Hi, Doc. Vincent Traag,

    I am using your python code Louvain recently. As you cited in the Mucha P’s science paper in 2010, I want to use weight coupling strengths ω. But there is a problem that it’s barbarized in the code. Is there any way that I can add weight to the inter-slice links?

    Looking forward to your reply!

    Thank you so much!

    1. Dear Peng Fang,

      You can indeed use this implementation to use Mucha et al.’s idea for community detection. However, it is a bit more trickier, and you have to prepare some data yourself. This code works with layers, not with slices, the difference being that all layers need to be the same graphs, i.e. all the nodes should be contained in all graphs and only the edges can differ. To use time slices for example in this module, you should first join everything together so that all edges for all times are contained within one graph, and also include the interslice edges. Then create a layer for each subgraph that only includes the appropriate edges for that particular time. Finally, also add the layer containing only the interslice edges.

      In a bit more detail, suppose you have you have graph G1 at time 1 and G2 at time 2. Then create a graph which combines G1 and G2, and add the the interslice links. Graph G thus has n = n1 + n2 nodes and m = m1 + m2 edges. Hence, if node i appears both in G1 and G2, it appears twice in G. Also add the interslice edges to G (i.e. connecting node i whenever it is both in G1 and G2). Then take subgraph of the edges (but leave the nodes in place) for time 1 and time 2, and for the interslice edges, and use these as layer.

      Perhaps an explicit example makes it more clear even:

      import igraph as ig
      import louvain
      import numpy as np
      import pandas as pd
      #%%
      n = 100;
      k = 10;
      p = k/float(n);
      G1 = ig.Graph.Erdos_Renyi(n, p);
      G2 = ig.Graph.Erdos_Renyi(n, p);
      G1.vs['id'] = np.arange(n);
      G2.vs['id'] = np.arange(n);
      #%%
      E1 = pd.DataFrame(G1.get_edgelist(), columns=['source', 'target']);
      E2 = pd.DataFrame(G2.get_edgelist(), columns=['source', 'target']);
      
      # Set time for edges
      E1['t'] = 1;
      E2['t'] = 2;
      
      # Concatenate
      E = pd.concat([E1, E2]);
      
      # Make sure that the id's are different for different slices
      E['source'] = ['{0}-{1}'.format(row['source'],row['t']) for idx, row in E.iterrows()];
      E['target'] = ['{0}-{1}'.format(row['target'],row['t']) for idx, row in E.iterrows()];
      
      # Add weight
      E['weight'] = 1.0;
      
      #%%
      # Add interslice links
      w = 0.1;
      for v in G1.vs:
        for u in G2.vs:
          if v['id'] == u['id']:
            source = '{0}-1'.format(v.index);
            target = '{0}-2'.format(u.index);
            E = E.append([{'source': source, 'target': target, 'weight': w, 't': 'interslice'}]);
      #%%
      def GraphFromPandasEdgeList(edges_df,
          edge_cols = ['source', 'target'],
          **kwargs):
      
        def renumber_edges(edges, ids):
          for e in edges:
            if not e[0] in ids:
              raise Exception('Couldn\'t find {0} in nodes dataframe'.format(e[0]));
            e0 = ids[e[0]];
            e1 = ids[e[1]];
            yield (e0, e1);
      
        # Renumber nodes
        node_id = ig.UniqueIdGenerator();
        nodes = edges_df[edge_cols].stack().unique();
        for node in nodes:
          node_id.add(node);
      
        # Create graph
        G = ig.Graph(**kwargs);
        G.add_vertices(len(nodes));
        G.vs['id'] = node_id.values();
        
        # Renumber edges so that they are consistent
        # with the node numbering.
        edges = renumber_edges(
                     edges_df[edge_cols].itertuples(index=False), 
                     node_id);
        G.add_edges(edges);
      
        for k, v in edges_df.iteritems():
          if not k in edge_cols:
            G.es[k] = list(v);
            
        return G;
      
      #%%
      G = GraphFromPandasEdgeList(E);
      
      H1 = G.subgraph_edges(G.es.select(t_eq = 1), delete_vertices=False);
      H2 = G.subgraph_edges(G.es.select(t_eq = 2), delete_vertices=False);
      H_interslice = G.subgraph_edges(G.es.select(t_eq = 'interslice'), delete_vertices=False);
      
      #%%
      membership, quality = louvain.find_partition_multiplex(
        [louvain.Layer(H1, 'Modularity'),
         louvain.Layer(H2, 'Modularity'),
         louvain.Layer(H_interslice, 'CPM', resolution_parameter=0)])
      
      1. Dear Vincent,

        Hi,
        Thank you very much for your codes! I tried your script on a set of weighted graphs. The “E” that I create, looks correct, and shows the weights of links both for within and between slices. But when making “G”, it seems that in the GraphFromPandasEdgeList it only look at the edgelist and disregard the weights and binarize the input (if weight>0 : weight=1). How may I modify the script, so for finding partition it uses the weights?

        Thanks,
        Mahshid

        1. Hi Mashid,

          Sorry for the long delay in replying (and approving your comment), your message got lost among other messages.

          There was a bug with the passing of the weights from python to c++: they were parsed as integers, which is now fixed. However, this is not (yet) part of any release, so you have to download the source and compile yourself. Alternatively, you can simply take the most recent development version (which also includes other changes).

          I am currently working towards a release version, but it will take some time still before that is done. So if compiling if a bridge too far, you may simply want to wait for that release.

          Best,

          Vincent

  2. Dear Vincent, (dear Peng)
    that is exactly what I am looking for ! I can get the example running nicely. However, I am now stuck adapting it to my own data. Basically, I have exported a multi-slice adjanceny matrix (i.e. [50*50*20] – 20 being the temporal samples from matlab, which is where I used to do things. I read that in as a numpy array, but then get stuck trying to set up the multi-layers (as opposed to the slices, which would be easier). Somehow I cannot figure out the right way through numpy->pandas->igraph->subgraph that will prepare the data for analysis. Any hints would be appreciated.

    Best, ralf.

    1. Dear Ralf,

      I have some code lying around for making this easier, assuming you have a sequence of graphs Gt = [G_0, G_1, ....]. I am currently working on a re-implementation of my Python code. The code will probably find its way in the public package, but I’m not completely done (yet) with the re-implementation.

      If you want, I can send you the code via mail. I’ll first have to tidy it up a bit, and to make sure it is working with the current public implementation. But perhaps you don’t have a sequence of graphs yet, so that you need some other help first? Let me know.

      Best,

      Vincent

  3. Dear Dr. Vincent Traag,

    Thank you for your fantastic library! May I ask you two questions?

    I checked the quality function for RBConfiguration, which is below. Could you tell me how you found this function (reference)? It looks very different from what I can find I failed to understand it.

    mod += w - this->resolution_parameter*w_out*w_in/((this->graph->is_directed() ? 1.0 : 4.0)*this->graph->total_weight()); 
    q = (2.0 - this->graph->is_directed())*mod;
    

    When I call this function, I write

    for (k, v) in comm.items():
        louvain.quality(graph, v, method="RBConfiguration", resolution_parameter=k)
    

    But in this case, graphs which have only one community always get the highest quality. How does this resolution_parameter argument work?

    Thank you in advance.

    Sincerely,
    Keisuke Daimon

    1. Sorry, but let me add one thing. This is how I calculate the variable, comm:

      comm = louvain.bisect(graph, method="RBConfiguration", weight="weight", resolution_range=[0,1])
      
    2. Dear Keisuke,

      The formulation for modularity is a generalised form by Reichardt and Bornholdt [1] of the modularity introduced by Newman and Girvan [2]. They also introduced the resolution parameter (among other things). In a sense, the formulation also builds on some minor variations for directed networks [3] and weighted networks [4], which are straightforward extensions of the definition in [1]. The formulation as used in the code is applicable to directed and non-directed networks and weighted and non-weighted networks.

      A higher resolution parameter essentially puts a higher penalty on community size, so that you get smaller communities for higher resolution parameters, and larger communities for lower resolution parameters. Since for a very low (near zero) resolution parameter you essentially incur (near) zero cost, you only obtain the benefit of putting all links within that single community, and so the modularity equals the number of edges \(m\) within the graph (or the total weight). For a very high resolution parameter you get a singleton partition, and obtain zero benefit (assuming no loops in the network), and incur a cost of the community size \(- \frac{\gamma}{4m} \sum_i k_i^2\). Hence, the modularity decreases from \(m\) to \(- \frac{\gamma}{4m} \sum_i k_i^2\) from a low resolution parameter to a high resolution parameter.

      In summary, you can’t rely on the modularity values to choose the ‘best’ partition, since it always decreases, and you need to rely on other ways to choose the right resolution. This is what we tried to do in [5]. But more profoundly, automatically selecting the ‘right’ partition is bound to create some other issues (similar to the resolution limit). In the end, choosing the ‘right’ resolution depends on your goals and should probably also be guided by more substantive concerns.

      Hope this helps.

      Best,

      Vincent

      References

      1. Reichardt, J., & Bornholdt, S. (2006). Statistical mechanics of community detection. Physical Review E, 74(1), 016110+. doi:10.1103/PhysRevE.74.016110
      2. Newman, M. E. J., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical Review E, 69(2), 026113+. doi:10.1103/PhysRevE.69.026113
      3. Leicht, E. a., & Newman, M. E. J. (2008). Community structure in directed networks. Physical Review Letters, 100(11), 1–4. doi: 10.1103/PhysRevLett.100.118703
      4. Newman, M. E. J. (2004). Analysis of weighted networks. Physical Review E, 70(5), 056131. doi:10.1103/PhysRevE.70.056131
      5. Traag, V. A., Krings, G., & Van Dooren, P. (2013). Significant scales in community structure. Scientific Reports, 3, 2930. doi:10.1038/srep02930
      1. Dear Dr. Vincent Traag,

        Thank you for your answer. I actually confused Significance vertex partition with Significance for evaluation of structures. What methods do you have for louvain.quality? I tried modularity, RBConfiguration and Significance. Do you have more? Also, do you recommend Significance?

        1. Dear Kuisuke,

          There are two different possible uses of Significance (or any other quality function for that matter). You can optimize Significance directly, which is what you do if you call

          louvain.find_partition(G, method='Significance');
          

          You can always evaluate the quality of any given partition using

          louvain.quality(G, partition, method='Significance');
          

          The available methods are listed on the PyPI package page and on the GitHub repository. They can all be used to optimize directly or to simply evaluate the quality of a given partition.

          The combination of doing the bisectioning (as you do) and using Significance to guide your selection of interesting resolutions should work quite well, I think. Optimizing Significance immediately works quite well as well, but seems to give somewhat smaller communities, similar to Surprise (see also [1]).

          Best,

          Vincent

          References

          1. Traag, V. A., Aldecoa, R., & Delvenne, J.-C. (2015). Detecting communities using asymptotical surprise. Physical Review E, 92(2), 22816. doi:10.1103/PhysRevE.92.022816
          1. I am trying methods other than Significance because my data includes weight and direction. I will use it as a quality function.
            Thank you for the explanation.

          2. You’re quite correct to use other methods than Significance if you have a weighted network. However, whether you use Significance as a quality function (i.e. determine the quality of a given partition) or for optimization (i.e. try to find the partition with the best quality), it really is one and the same thing. So you can’t use Significance at all if you have a weighted network (you can use Surprise though). Note that Significance is well-defined for directed networks by the way.

            Good luck!

  4. I’m sorry for the late reply.
    As far as I understand, Significance is useful to know what resolution is the best for making clusters. If Significance is not applicable to weighed networks, is there no method to choose the best resolution?
    If so, I am thinking to choose the value of resolution such that I have 4 clusters. Is this a good idea?

    1. Yes, you could use Significance to guide your search for “good” resolutions. Surprise is a very similar measure, see [1], which you could use for the same purpose.

      If you want 4 clusters, then selecting a resolution such that you get 4 clusters seems like a good idea.

      1. Traag, V. A., Aldecoa, R., & Delvenne, J.-C. (2015). Detecting communities using asymptotical surprise. Physical Review E, 92(2), 22816. doi:10.1103/PhysRevE.92.022816
      1. Thank you for your help. I will read the paper about Surprise.
        I will just avoid Significance because my data is weighed, and try to decide how many clusters I should get and what method I should use.

Leave a Reply

Your email address will not be published. Required fields are marked *

Captcha *
Time limit is exhausted. Please reload CAPTCHA.

This site uses Akismet to reduce spam. Learn how your comment data is processed.