The paper I would like to look at today is a 60 page long study of communities in complex graph by Leskovec, Lang, Dasgupta, and Mahoney — an earlier version was presented at WWW08 in Beijing.
Jure Leskovec is a rising star on complex networks: he graduated his PhD earlier this year (see a presentation he gave at Google on Complex network dynamics), and already has published striking results thanks to internships at several major internet companies; I don’t know the other three, but the width and the extend of the paper speaks highly of their work.
For those unable to commit to read though that forest of reference and information:
- Complex graphs are large database of relations (links) between entities, individuals (summit); they can be flights between airports, links between websites, phone calls, etc; those have been the hottest topic in science for almost ten years because they share many fascinating properties;
- The current focus in on community clustering: find lumps of summit that have more links inside then outside;
- The authors consider a “network community plot (NCP)” a function φ(k) that is the minimum (best) conductance (#OutLinks / # InLinks) of all possible communities of size k;
- Up to a three figure scale (~100) such communities are generally “whiskers”: outlying lumps connected by only one thread to the rest;
- Beyond that, they found that communities are increasingly less easy to isolate;
- Doing so, they also suggest another way to simulate graph with realistic properties, the « forest fire »: start from scratch, and introduce agents one by one:
- choose at random (Bernoulli) the number of relations he’ll make arriving;
- choose at random a first relation and then,
- follow this “ambassador” first contacts, and establish relationships with the relation of your new relation, until you run out of “contact budget”.
This is a new way to understand how complex graphs work, and both the ‘scan’ and the simulation are very promissing analysis tool; neither are completely new, but the increase is essential, and justifies decisively forest-fire instead of preferential attachement. Forest fire has parameters, and the simulation matches observed NCP only for a narrow spectrum: there definitely is more to that to be explored.
Reading it, I was not sure that whiskers are not simply element connected ‘by one mistake’ to the largest component, that ‘should’ have been removed with the other un-connected component — but the authors address that: all clusters are independent clusters, connected by one, two or several more ‘mistakes’: those few links are an essential property to consider in complex graphs. If those were not simply so few, it would be a hyper-connected graph: something rarely observed without setting links as very tenuous, often meaningless, relations.
However, I would like to say that three things are missing in that work: actually large graphs, parallel with hierarchical clustering and a more in-depth study of conductance, modularity and declared groups — a first lead on these would be found in a rather unnoticed draft by Blondel et al. about both a fast heretical modularity algorithm and Belgium current linguistic woes [I should make a post on that one]. Leskovec has published on MSN (aka Live Messenger) so he had access to larger graphs the those in Table 1, 2 & 3 — one fairly large graph labelled « Messenger » is used in Figure 7 (top middle plot) but I can’t find a mention of it in Table 1. Blondel et al. or Onnela et al., the authors of this great paper that also considers a ten-million-strong phone database [Yet another paper I need to comment in this blog] would all certainly be interested to see the NCP of those graphs.
My main though while I was reading this was about a possible connection with hierarchical clustering (once you have found nice, small, communities, you pool their summits and outer links together to make a new meta-graph and find clusters of clusters): I once asked J-L. Guillaume about the ratio of links that were “explained”, i.e. within a community, at each (of the five) steps; apparently (based on my memory of a conversation) the ratio is rather constant, about half of the remaining ties are accounted for each time — to be checked on several graphs and with a significant number of different processing order. It would be interesting to investigate how this merging into the core is seen from that perspective, if the information ratio drops passed a certain threshold, or if hierarchical clustering separates whiskers and only includes them in the already large core cluster at the last minute.
Of course, all this would need work on the distinction between conductance and modularity; actually, there is a key discussion to have with sociologists — and user-interface expert about what makes sense, depending on the goal we have: the distinction between “Common bond vs. common identity communities” is of course essential, but doesn’t go far enough, and more interdisciplinary efforts are needed. I was surprised to see that official groups appear not just slightly less relevant are larger size then algorithmically found one, but often an order of magnitude less conductive! Spontaneous network have always proved efficient and resilient: the Darwinian network analyst in me was shocked. Several explanations are possible: conductance could be far from the best measure of group relevance; or on the other hand, groups compensate strong bond-based social ties by offering a support for weak, identity-based ties. Based on my experience and papers on the subject, ties are strongly correlated to shared-group belonging, so I would rather ask myself ‘what makes a good measure of group relevance?’ first.
In social networks, should we try to make clusters of “Everybody I know and who they know” (conductance) or coherent groups of “How I know them” and accept significant over-lapping of different life contexts, or foci as suggested by Feld? What would be a measure of the relevance of such groups, beside user adoption? A high proportion of explained links would be essential, and relevance would be measured by how many links are effective in each context: #InLink/tr(n). Does it make sense to partition a graph with k-cores, LS sets or k-clusters (a minimum number of them, and with k as large as possible)? How to scale that to larger communities: how to make hierarchies over-lap? Do we consider that two contexts with only one common individual share something, or that those are two separate facets of his life and he is trying to keep them that way?
Finally, a great quote for management science: considering two graphs based on Enron’s e-mail database, one with and the other excluding addresses from outside the company, the relevance of the official organisation appear to become context-dependant:
[The] organizational structure, (e.g., departments) manifest themselves in the internal communication network, but as soon as we put the organization into the broader context (i.e., how it communicates to the rest of the world) then the internal department structure seems to disappear.
There is most likely an essential and hard to grasp property in that: do we have company ambassadors that represent their team, and their set forms a different group, interconnected through their common correspondents? Or should we interpret the links between those grouped-though-outside as weak clusters, notoriously hard to find, revealing themselves? Maybe think of it as a personal vs. professional shift in balance: are their findings that (Enron-like co-workers) tend to be friends with colleagues, but further down the hall then direct team mates? Anyway, this pleads in favour or over-lapping clustering, and harder, but needed effort.
What do you think?