Blog

The place for sharing thoughts on Complexity Science.

This blog is non-commercial. Everyone can read and re-use the content for free based on the Creative Commons licence.

Disclaimer: This blog may contain opinions and views, which do not necessarily reflect the opinions or position of the Complexity & Networks Group of Imperial College London.

Networks, Geometry and Clustering

Posted by on Feb 13, 2016 in Network | 0 comments

Clustering is a vital tool when handling data making it a central part of data science. By grouping similar objects together, it helps us find what we are looking for. I don’t go to a bakery to find a book. Clustering is part of a wider idea in science as we are always faced with thousands of potential or actual measurements but we need to focus on the few which are relevant to the process we are trying to understand. I do not need to know the nuclear properties of the constituents of a gas to understand its properties, while measuring temperature, pressure and volume do throw a lot of light on that problem. In whatever branch of science we are working in, we are always trying to reduce the dimensionality of our data, to use the language of statistics and data analysis. Many of the techniques we use will need a measure of distance and it is most natural to call upon the everyday distance as defined by any ruler – formally the Euclidean distances d where for example d2 = x2 + y2 + z2 for the distance between the origin and a point at (x,y,z) in 3-dimensions. However, what if time is present? Time is very different from space. Mathematically it leads to new types of geometry for space-times, Lorentzian rather than Euclidean. The simplest example is the Minkowski space-time used for studying special relativity. James Clough and I have been using Minkowski space as part of our study of networks which have a sense of time built into them – Directed Acyclic Graphs (see my blog on Time Constrained Networks for instance). Essentially these networks have a time associated with each vertex and then any edges present always point in one direction in time, say from the more recent vertex to an older one. Typically the time is a real physical time but for these types of network one can always construct an effective if artificial time coordinate. There are many types of data with a directed acyclic graph structure. Citation networks are excellent examples and we will use them to illustrate our ideas in the rest of this article. Each node in a citation network is a document. The edges represent the entries in the bibliography of one document which always reference older documents – our arrow of time. We have worked with several different types of citation network: academic paper networks based on sections of the arXiv paper repository, US Supreme court judgements, and patents. My blog on citation network modelling gives some more background and how I think about citation networks in general. Combining these two concepts James Clough and I have adapted a well known clustering method, MDS (Multidimensional scaling), so that it works for directed acyclic graphs (Clough and Evans 2016b). Traditional MDS is usually applied to data sets where you have a matrix of distances between each object. For a network, this would usually be the length of the shortest path between each node. MDS then assumes that these objects/nodes are embedded in a Euclidean space and suggests the best set of coordinates for the objects in that space. Clustering can then be performed by looking at which points are close together in this space. We found a way...

read more

Exploring Big Historical Data

Posted by on Feb 10, 2016 in Network | 0 comments

I’ve really enjoyed reading my copy of Exploring Big Historical Data: The Historian’s Macroscope (Macroscope for short here) by Shawn Graham, Ian Milligan and Scott Weingart. As the authors suggest the book will be ideal for students or researchers from humanities asking if they can use big data ideas in their work. While history is the underlying context here, most of the questions and tools are relevant whenever you have text based data, large or small. For physical scientists, many of whom are not used to text data, Macroscope prompts you to ask all the right questions. So this is a book which can really cross the disciplines. Even if some readers are like me and they find some aspects of the book very familiar, they will still find some new stimulating ideas. Failing that, will be able to draw on the simplicity of the explanations in Macroscope for their own work. I know enough about text and network analysis to see the details of the methods were skipped over but enough of a broad overview was given for someone to start using the tools. PageRank and tf-idf (term frequency–inverse document frequency) are examples where that practical approach was followed. Humanities has lot of experience of working with texts and a physical scientist like myself can learn a lot from their experience. I have heard this piecemeal in talks and articles over the last ten years or so but I enjoyed having them reinforced in a coherent way in one place. I worry a bit that that the details in Macroscope of how to use one tool or another will rapidly date but on the other hand it means a novice has a real chance to be able to try these ideas out just from this book alone. It is also where the on line resources will come into their own. So I am already planning to recommend this text to my final year physics students tackling projects involving text. My students can handle the technical aspects without the book but even there they will find this book gives them a quick way in. Staff inthe Physics Department of Imperial College London clustered on the basis of the abstracts of their recent papers. I can see that this book works as I picked up some of the simpler suggestions and used it on a pet project which is to look at the way that the staff in my department are related through their research interests. I want to see if any bottom-up structure of staff research I can produce from texts written by staff matches up to existing and proposed top-down structures of faculties – departments – research groups. I started using by using python to access to the Scopus api. I’m not sure you can call Elsevier’s pages on this api documentation and even stackoverflow struggled to help me but the blog Getting data from the Scopus API helped a lot. A hand collected list of Scopus author ids enabled me to collect all the abstracts from recent papers coauthored by each staff member. I used python libraries to cluster and display the data, following a couple of useful blogs on this process, and got some very acceptable results. However I then realised that I could use the...

read more

Modelling the Footprints of Innovation: Citation Networks

Posted by on Oct 10, 2015 in Network | 0 comments

When one document refers to another in it’s text, this is called a citation. The pattern of these citations is most naturally represented as a network where the nodes are the documents and the links are the citations between the documents. When documents were physical documents, printed or written on paper, then these citations must (almost always) always point back in time to older documents. This arrow of time is imprinted on these citation networks and it leads to interesting mathematical properties. One of the most interesting features of citations is that they have been carefully curated, sometimes for hundreds of years. The data I use on the US supreme court judgments goes back to the founding of the USA. So citation data is one of the oldest and continuous ‘big data’ sets to study. The reason why records of citations have been maintained so carefully is that they record the process of innovation, be it in patents, law or in academic study. When you try to claim a patent you must by law indicate the prior art, earlier patents with relevant (but presumably less advanced) ideas. When a judge makes a judgement in a case in the USA, they draw on earlier cases which have interpreted, and so created, the law needed to come to a conclusion in the current case. And of course, academics can’t discuss the whole of existing science when explaining their own new ideas so they have to refer back to papers where previous researchers have set out a key idea needed in the current work. Citations are therefore a vital part of knowledge transfer, new ideas build on all the work done in earlier judgments, previous patents or older papers. That is why citations have been so carefully recorded. The network formed by documents and their citations show whose giant shoulders we are standing on when innovations are made, to paraphrase Newton. From a theoretical point of view there are many interesting features in these networks. If you follow citations from one document to another to another and so on, at each step you will always reach an older paper. So you can never come back to the starting point and such paths. There are no cycles in a citation network (it is an example of a directed acyclic graph). If you look at the number of citations each document gets, how many newer documents refer back to one document, then they follow a fat-tailed distribution – a few documents have most of these citations, while most documents have very few citations each. Derek de Solla Price’s 1965 paper for an early discussion of this feature. Moreover, if you look at documents in the same year, you get roughly the same shape for the number of documents with a given number of citations (see Radicchi et al 2008, Evans et al 2012), at least for well cited documents. Since these networks are of such great interest,many other features have been noted too. One way for a theorist to understand what is happening is to build a model from a few simple rules which captures as many of the features as possible. One of the first was that of Derek de Solla Price (1965) whose theory of “cumulative advantage” suggested that as new documents...

read more

Elucidating a Complex Rhythm with a Simple Model

Posted by on Jan 16, 2015 in Complexity Science | 0 comments

Burning the heart can cure it from beating abnormally, however, the right bit of heart tissue needs to be destroyed. Atrial fibrillation (AF) is the most common abnormal heart rhythm, affecting 1% of the population, and can cause stroke. Clinicians find it much more difficult to treat AF by burning despite it being the most common abnormal heart rhythm. This is because it is not clear which part of the heart is responsible for the disease. My supervisors, Kim Christensen and Nicholas Peters, and I have recently developed and studied a mathematical model (published in Physical Review Letters) which provides insight into where clinicians should burn and why the disease is difficult to treat. Normally, the heart beats in synchrony so that the upper chambers pump blood into the lower chambers first and then the lower chambers pump the blood out of the heart. Pacemaker cells (found in the upper chambers) rhythmically send out electrical signals to neighbouring muscle cells, telling them to contract. This electrical wave travels across heart muscle cells from the upper chambers to the lower chambers like a smooth Mexican wave (see this video). An abnormal heart rhythm is when these waves travel abnormally, which removes the synchrony of the pumping action. As we age we all develop fibrosis, which is connective tissue in the heart that disrupts connections between muscle cells and as a result can disrupt the electrical waves. Our risk of AF also increases with age, but the mechanism of this is not fully understood. The increase in fibrosis with age is a potential explanation as to why the risk of AF also increases with age. We developed a mathematical model that represents how cells are organised and connected within heart muscle tissue and how that changes with age. As the amount of fibrosis increased to a critical point the electrical waves would spontaneously re-organise into circular and spiral patterns, mirroring atrial fibrillation (see video below). We found that “burning” particular regions in the model where the cells were structured in a certain way could stop the fibrillation, however, when there was too much fibrosis the burning was unsuccessful. The model was able to reproduce features of AF observed in patients, namely, that an increase in fibrosis was related to an increase in the prevalence of AF and that destroying specific regions of tissue could sometimes stop the abnormal rhythm. In addition to this the model also reflects how the disease develops in time: AF initially occurs for short periods and gradually gets longer as the condition progresses. However, medical imaging is unable to identify the structure of tissue at this scale in patients undergoing treatment. The mathematical model highlights where future research could be targeted. A lot of work and caution is required to translate results from a mathematical model to biological and clinical studies. Furthermore, the study is an interesting example of how simple models can reproduce features of real complex systems. References K. Christensen, K. A. Manani, N.S. Peters. Simple Model For Identifying Critical Regions in Atrial Fibrillation, 2015. Physical Review Letters 114:028104. DOI: http://dx.doi.org/10.1103/PhysRevLett.114.028104 (open access) U. Schotten, S. Verheule, P. Kirchhof, A. Goette. Pathophysiological mechanisms of atrial fibrillation: a translational appraisal, 2011. Physiological Reviews, 91(1), 265-325. Press Releases http://www3.imperial.ac.uk/newsandeventspggrp/imperialcollege/newssummary/news_22-12-2014-15-54-17...

read more

Time Constrained Networks

Posted by on Sep 14, 2014 in Network | 0 comments

One of the key features of complex networks is that they capture interactions which have no limitations. In most electronic systems, be they Facebook, emails or web pages, we can make connections across the world with little if any cost. However what if there are constraints on the links made in a network? Surely we should change the way we study networks if space, time or some other constraint is having a significant effect on the formation or use of the network. This has been a major interest of mine over the last few years. Space is one obvious limitation as in some cases long distance are less likely to be made. There has been a lot of work in this area over many decades but I will leave this constraint for another blog. It is only more recently that the role of time in networks has began to receive more attention. A lot of this recent interest in how to deal with networks where the connections, are made at one time. That is because most communication networks, emails, phone calls and so forth, are of this type. The recent review by Holmes and Saramäki (2012) is such temporal edge networks. Yet networks are made of two parts: vertices and edges. My recent work has focussed on the case where it is the vertices, not the edges, which are created at a definite time. In such temporal vertex networks, causality forces the interactions between nodes to always point in one direction. For example consider a citation network formed by academic papers. The nodes in our network are the academic papers and the links are formed by their bibliographies. So if paper A refers to another paper B then we can be (almost) sure that A was written after B. Information can therefore flow only from B to A. In fact any set of documents can only refer to older ones such networks are common. In law, judges refer to previous judgements to support their arguments. When registering a patent, prior art needs to be cited, that is other previously granted work which may have some relevance to the current claim. The same types of structure occur in several other situations. Any situation where there is a logical flow has the same causal structure. If we have a project where the nodes represent individual tasks then an edge from task S to task T could represent the fact that task T requires task S to have been completed before task T is started. This has been the focus of work on temporal vertex networks in computer science. The logical flow of a mathematical argument or an excel spreadsheet show the same properties. These networks define what is called a partially ordered set or poset and it is under this title that you find relevant work coming from mathematicians. A final example comes from the Causal Sets approach to quantum gravity (see Dowker 2006 for a review). Here space-time is discrete not continuous, and these discrete points are the nodes of the network. The nodes are connected by edges only if they are causally connected and causality again gives these a common direction. All of these temporal vertex networks have a key property. That is they contain no loops if you...

read more