Basic Network Analysis in R

using igraph and related packages

David Schoch

Introduction

Throughout the tutorial, a basic knowledge of R and network analysis is assumed

The main focus of this tutorial is empirical analysis of networks and skips a lot of additional functionality of igraph. If you are interested in, e.g. creating random networks, consult the literature at the end of this document.

After this tutorial, you should have a basic understanding on how to import network data into R and perform first analyses. These include computing descriptive statistics, centrality scores and clustering.

Required libraries

To run all the code in this tutorial, you need to install and load two packages.

install.packages(c("igraph","netrankr"))

igraph is used for the majority of analytic tasks and for its data structures. netrankr is a package for network centrality. Its main functionality is explained in another tutorial. Here, we only need some indices that are implemented.

library(igraph)
library(netrankr)

Loading network data into R

igraph can deal with many different foreign network formats with the function read_graph. The rgexf package can be used to import Gephi files.

read_graph(file, format = c("edgelist", "pajek", "ncol", "lgl",
  "graphml", "dimacs", "graphdb", "gml", "dl"), ...)

I personally have encountered some issue when importing graphml files that were produced in visone. Also dl files sometimes through an error.

If your network data is in one of the above formats you will find it easy to import your network. Below, we read in the Game of Thrones interaction network from the first season, which is stored as a graphml file.

gotS1 <- read_graph("data/GoT/gotS1.graphml",format="graphml")
gotS1
## IGRAPH 8a078ab UNW- 127 550 -- 
## + attr: title (g/c), name (v/c), clu (v/c), size (v/n), id (v/c),
## | Season (e/n), weight (e/n)
## + edges from 8a078ab (vertex names):
##  [1] Ned         --Robert       Daenerys    --Jorah       
##  [3] Jon         --Sam          Ned         --Littlefinger
##  [5] Ned         --Varys        Daenerys    --Drogo       
##  [7] Ned         --Arya         Catelyn     --Robb        
##  [9] Bronn       --Tyrion       Ned         --Cersei      
## [11] Cersei      --Robert       Littlefinger--Varys       
## [13] Shae        --Tyrion       Ned         --Catelyn     
## + ... omitted several edges

The output shows a summary of the network. The first line contains the number of nodes (127) and the number of edges (550). These can also be obtained via vcount(gotS1) and ecount(gotS1).

The following rows list the graph, vertex, and edge attributes of the graph. The attributes can be distinguished by the string in parentheses. (g/c) are character graph attributes, and (v/n)/(v/c) numeric/character vertex attributes. The same holds for edges ((e/n) and (e/c)). The remaining rows show an excerpt from the edges.

One of the advantages of using R over other software tools is that you can easily work with complete collections of networks. The below code reads all Game of Thrones networks at once and stores the network in a list.

file_names <- c("data/GoT/gotS1.graphml","data/GoT/gotS2.graphml",
                "data/GoT/gotS2.graphml","data/GoT/gotS4.graphml",
                "data/GoT/gotS3.graphml","data/GoT/gotS6.graphml",
                "data/GoT/gotS7.graphml")

gotS1to7 <- lapply(file_names,read_graph,format = "graphml")

If your network data is not in a network file format, you will need one of the following functions to turn raw network data into an igraph object: graph_from_edgelist(), graph_from_adjacency_matrix(), graph_from_adj_list() and graph_from_data_frame().

Before using these functions, however, you still need to get the raw data into R. The procedure depends on the file format. If your data is stored as an excel spreadsheet, you need additional packages. If you are familiar with the tidyverse, you can use the readxl package. Other options are, e.g. the xlsx package. However, I recommend to refrain from using excel spreadsheets, especially when your network is large.

Most network data you’ll find is in a plain text format (csv or tsv), either as an edgelist or adjacency matrix. To read in such data, you can use read.table().1 This is the base R solution. Further options are given at the end of this section.

Make sure you check the following before trying to load a file: Does it contain a header (e.g. row/column names of an adjacency matrix)? How are values delimited (comma, whitespace or tab)?

The data repository for the Game of Thrones networks contains two adjacency matrices of the first season. The file gotS1_matrix.dat has a header (and row names) and values are separated by a single space. The file gotS1_matrix1.dat on the other hand has no header (and no row names) and values are separated by a comma.

Knowing this, you can read the files as follows.

#file with header, rownames and separated by a space
A1 <- read.table(file = "data/GoT/gotS1_matrix.dat",
                 sep = " ",header = TRUE,row.names = 1)

#file without header and rownames, and comma separated 
A2 <- read.table("data/GoT/gotS1_matrix1.dat",
                 header = FALSE,sep=",")

Note that read.table() reads the data as a data frame, so we need to convert the result into a matrix.

A1 <- as.matrix(A1)
A2 <- as.matrix(A2)
A1[1:5,1:5]
##              Ned Daenerys Jon Littlefinger Arya
## Ned            0        3  29          107   90
## Daenerys       3        0   0            0    0
## Jon           29        0   0            0   20
## Littlefinger 107        0   0            0    5
## Arya          90        0  20            5    0
A2[1:5,1:5]
##       V1 V2 V3  V4 V5
## [1,]   0  3 29 107 90
## [2,]   3  0  0   0  0
## [3,]  29  0  0   0 20
## [4,] 107  0  0   0  5
## [5,]  90  0 20   5  0

The procedure is very similar if your data is an edgelist. We thus skip this here.

The data is now ready to be converted into an igraph object. The networks are both undirected and weighted, so we need to adjust the corresponding parameters.

gA1 <- graph_from_adjacency_matrix(A1,mode = "undirected",weighted = TRUE)
gA2 <- graph_from_adjacency_matrix(A2,mode = "undirected",weighted = TRUE)

gA1
## IGRAPH 82bdbb0 UNW- 127 550 -- 
## + attr: name (v/c), weight (e/n)
## + edges from 82bdbb0 (vertex names):
##  [1] Ned--Daenerys       Ned--Jon            Ned--Littlefinger  
##  [4] Ned--Arya           Ned--Catelyn        Ned--Cersei        
##  [7] Ned--Joffrey        Ned--Sansa          Ned--Tyrion        
## [10] Ned--Jeor           Ned--Robb           Ned--Bran          
## [13] Ned--Jaime          Ned--Loras          Ned--Jory.Cassel   
## [16] Ned--Jorah          Ned--Ros            Ned--Benjen        
## [19] Ned--Greatjon.Umber Ned--Pyp            Ned--Renly         
## [22] Ned--Barristan      Ned--Hound          Ned--Theon         
## + ... omitted several edges
gA2
## IGRAPH 84ea4f6 UNW- 127 550 -- 
## + attr: name (v/c), weight (e/n)
## + edges from 84ea4f6 (vertex names):
##  [1] V1--V2   V1--V3   V1--V4   V1--V5   V1--V6   V1--V8   V1--V10 
##  [8] V1--V11  V1--V12  V1--V13  V1--V14  V1--V15  V1--V16  V1--V17 
## [15] V1--V18  V1--V19  V1--V20  V1--V22  V1--V23  V1--V27  V1--V28 
## [22] V1--V30  V1--V35  V1--V36  V1--V37  V1--V38  V1--V39  V1--V41 
## [29] V1--V43  V1--V44  V1--V46  V1--V48  V1--V49  V1--V51  V1--V52 
## [36] V1--V53  V1--V56  V1--V64  V1--V65  V1--V66  V1--V67  V1--V76 
## [43] V1--V77  V1--V85  V1--V90  V1--V102 V1--V105 V1--V106 V1--V107
## [50] V1--V109 V1--V114 V1--V116 V1--V117 V1--V118 V1--V125 V1--V126
## + ... omitted several edges

Note that gA1 and gA2 are basically the same network, only that the second one is missing the real name attribute.

Speaking of attributes: both functions graph_from_edgelist() and graph_from_adjacency_matrix() come with a drawback. You can’t add node or edge attributes automatically.

In general, it is always a good idea to organize network data in two different data frames. One for the nodes (with attributes) and one for the edges (edgelist+attributes). This will save you a lot of pain when trying to convert the data into an igraph object.

If you are wondering what this network represents: Nodes are characters from Grey’s Anatomy and links indicate who slept with whom…

edges <- read.csv("data/greys-edges.csv", header = FALSE,stringsAsFactors = FALSE)
nodes <- read.csv("data/greys-nodes.csv", header = TRUE,stringsAsFactors = FALSE)

ga <- graph_from_data_frame(d = edges,directed = FALSE,vertices = nodes)
ga
## IGRAPH fe3c46c UN-- 54 57 -- 
## + attr: name (v/c), sex (v/c), race (v/c), birthyear (v/n),
## | position (v/c), season (v/n), sign (v/c)
## + edges from fe3c46c (vertex names):
##  [1] Arizona Robbins--Leah Murphy     Alex Karev     --Leah Murphy    
##  [3] Arizona Robbins--Lauren Boswell  Arizona Robbins--Callie Torres  
##  [5] Erica Hahn     --Callie Torres   Alex Karev     --Callie Torres  
##  [7] Mark Sloan     --Callie Torres   George O'Malley--Callie Torres  
##  [9] Izzie Stevens  --George O'Malley Meredith Grey  --George O'Malley
## [11] Denny Duqutte  --Izzie Stevens   Izzie Stevens  --Alex Karev     
## [13] Derek Sheperd  --Meredith Grey   Preston Burke  --Cristina Yang  
## + ... omitted several edges

All code in this section is sufficient if your networks only have several hundreds of nodes. But at one point, read.table() may become a bottleneck. Consider switching to readr::read_csv(), data.table::fread() or vroom::vroom() instead.

Basic descriptive statistics

Once you have properly read in your network you can start by looking at somee very basic network statistics.

Node and edge counts can be computed as follows There appears to be quite a lot of action in Grey’s Anatomy

vcount(ga)
## [1] 54
ecount(ga)
## [1] 57

To do this for a list of networks, use the sapply() function.

sapply(gotS1to7,vcount)
## [1] 127 129 129 172 124 143  81
sapply(gotS1to7,ecount)
## [1] 550 486 486 667 504 542 412

The density and diameter of networks are calculated with graph.density() and diameter()

graph.density(ga)
## [1] 0.03983229
diameter(ga)
## [1] 8
sapply(gotS1to7,graph.density)
## [1] 0.06874141 0.05886628 0.05886628 0.04535564 0.06608969 0.05338324
## [7] 0.12716049
sapply(gotS1to7,diameter,weights = NA) #remember that the networks are weighted
## [1] 6 6 6 8 7 7 5

The degree distribution can be visualised with the hist() function.

hist(degree(ga), col="firebrick", xlab="Vertex Degree", 
     ylab="Frequency", main="")

The neighbors of a specific node can be extracted with the ego() function. Below, we are looking for all characters that “were intimate” with Alex Karev.

karev <- which(V(ga)$name=="Alex Karev")
ego(ga,order = 1,nodes = karev, mode = "all",mindist = 1)
## [[1]]
## + 11/54 vertices, named, from fe3c46c:
##  [1] Addison Montgomery Rebecca Pope       Izzie Stevens     
##  [4] Lexie Grey         Lucy Fields        Dana Seabury      
##  [7] Nurse Olivia       Callie Torres      Heather Brooks    
## [10] Jo Wilson          Leah Murphy

The clustering coefficient of a network can be computed with the transitivity() function.

Why is the clustering coefficient of Grey’s Anatomy zero?

transitivity(ga)
## [1] 0
sapply(gotS1to7,transitivity)
## [1] 0.3818514 0.4157044 0.4157044 0.4355249 0.4810657 0.4632628 0.4669358

Example: triad census of football leagues

In a directed network, there are 16 possible triads. The triad census of a network gives the number of occurrences of each triad. Triads are labelled xyzL where x is the number of reciprocated ties, y is the number of unreciprocated ties and z is the number of null ties. The L term is a letter (U,C,D or T) which differentiates between different triads in which these numbers are the same.

If you were at Sunbelt 2019, you may have learned how the triad census shows that social licking among cows is similar to legislators sponsoring bills.

One of the many applications of the triad census is to compare a set of networks. In this example, we are tackling the question of “how transitive is football?” and asses structural differences among a set of 112 football leagues. If you played football as a kid, you for sure have heard the sentence: “We won against B and B won against C so naturally we will win against C”. Of course reality is not that simple.

RDS files store R objects directly without needing to read from csv files. The data repository contains football.RDS which we can read with readRDS().

fb_graphs <- readRDS("data/football.RDS")

fb_graphs is a list which contains networks of 112 football leagues as igraph objects. A directed link between team A and B indicates that A won a match against B. Note that there can also be an edge from B to A, since most leagues play a double round robin. For the sake of simplicity, all draws were deleted so that there could also be null ties between two teams if both games ended in a draw.

Below, we calculate the triad census for all network at once using lapply(). The function returns the triad census for each network as a list, which we turn into a matrix in the second step. Afterwards, we manually add the row and column names of the matrix. The name of the league is stored as a graph attribute and we gather all names with sapply().

fb_triads <- lapply(fb_graphs,triad_census) 
fb_triads <- matrix(unlist(fb_triads),ncol=16,byrow = T)
rownames(fb_triads) <- sapply(fb_graphs,function(x) x$name)
colnames(fb_triads) <- c("003","012","102","021D","021U","021C","111D","111U",
                         "030T","030C","201","120D","120U","120C","210","300")

#normalize to make proportions comparable across leagues
fb_triads_norm <- fb_triads/rowSums(fb_triads)

#check the Top 5 leagues
idx <- which(rownames(fb_triads)%in%c("england","spain","germany",
                                      "italy","france"))
fb_triads[idx,]
##         003 012 102 021D 021U 021C 111D 111U 030T 030C 201 120D 120U 120C
## england   2  10   0   58   31   40   34   44  338   29  19  118  129  143
## france    1  23   5   30   33   44   48   40  332   41  16  132  108  160
## germany   0  21   6   27   19   49   38   46  165   16  23   77   79  117
## italy     1   4   2   35   43   30   30   22  419   38   5  164  116  118
## spain     0   8   4   27   42   45   32   35  364   43  11  126  105  148
##         210 300
## england 131  14
## france  114  13
## germany 120  13
## italy    99  14
## spain   130  20

The figure below shows the fraction of transitive triads per league (Big five European leagues in red).

It would be interesting to compare these values to other sports. On average, 28% of all triads are transitive. The most transitive league is Macao with more than 60%. From the top European leagues, the German Bundesliga is the least transitive (and thus the most competitive?). The most transitive league is the Italian Serie A (and thus the least competitive?).

In empirical studies, we are not necessarily only interested in transitive triads, but rather how the triad census profiles compare across networks. Katherine Faust suggests to use correspondence analysis (CA) for this purpose. There are several possibilities to do a CA in R. We here use the packages FactoMineR and factoextra. You may also need to install ggrepel to use repel = TRUE in fviz_ca_row() to reduce label overlaps. The function CA() from FactoMineR is used to do the CA and fviz_ca_row() from factoextra to visualize the first two dimensions. The parameter ncp controls the number of dimensions to extract. We here follow Katherine Faust and use four.

#install.packages(c("FactoMineR","factoextra"))

library(FactoMineR)
library(factoextra)
ca_triads <- CA(fb_triads,ncp = 4,graph = FALSE)
fviz_ca_row(ca_triads,repel = TRUE)

Unfortunately we have a huge overplotting problem. It may in this case be better to not plot the labels.

fviz_ca_row(ca_triads,repel = TRUE,geom = "point")

How to interpret the dimensions? To investigate this question, we take a closer look at the first two dimensions of the CA solution and compare it to some network descriptives. For the sake of brevity, we here only look at the density and proportion of 030T ties. In principle, any node/dyad/triad level statistic could be used.

plot(
  x = sapply(fb_graphs,graph.density),
  y = ca_triads$row$coord[,"Dim 1"],
  xlab = "density",ylab = "dimension 1"
  )

The correlation is -0.96. We thus have a strong association between the first dimension of the CA and the density of the networks.

plot(
  x = fb_triads_norm[,"030T"],
  y = ca_triads$row$coord[,"Dim 2"],
  xlab = "proportion of 030T",ylab = "dimension 2"
  )

In this case, the correlation is -0.83 and we have a strong association between the second dimension of the CA and the proportion of transitive triads.

For a more extensive treatment of the triad census, see this paper by Katherine Faust.

Besides the triad census, there is also the dyad census, which counts the number of mutual asymmetric and null ties. The code below computes the dyad census for all leagues and prints the result for the big five European leagues.

fb_dyads <- lapply(fb_graphs,dyad_census) 
fb_dyads <- matrix(unlist(fb_dyads),ncol=3,byrow = T)

rownames(fb_dyads) <- sapply(fb_graphs,function(x) x$name)
colnames(fb_dyads) <- c("mutual","asym","null")

fb_dyads[idx,]
##         mutual asym null
## england     45  131   14
## france      44  131   15
## germany     43   94   16
## italy       39  141   10
## spain       44  134   12

Centrality

If you want more indices, check out the centiserve and CINNA package. However, the standard indices are usually more than enough.

igraph implements all of the commonly used centrality indices:

Although using these functions is fairly straightforward, there are several potential pitfalls to be aware of. The functions eigen_centrality, page_rank(), subgraph_centrality(), authority_score() and hub_score() always return the centrality scores for all vertices. For the others, you can control for which vertices the scores should be calculated. The parameter names are not consistent though. Check the help if it is either v or vids.

If a graph has an edge attribute weight, then most centrality functions will use the weights by default. An exception is degree(), since its weighted version is graph.strength().

If the graph is directed, then the mode attribute controls which version to compute. Possible values are “in”, “out” and “all”. However, some functions have a logical parameter directed that controls if directed or undirected scores should be calculated.

Given these potential pitfalls, it is always a good idea to check the help before applying any index.

The below example shows the difference between weighted and unweighted betweenness.

rbind(betweenness(gotS1)[1:10], betweenness(gotS1,weights = NA)[1:10])
##           Ned Daenerys      Jon Littlefinger     Arya   Catelyn    Bronn
## [1,] 1599.531 294.8472 209.5429     209.6917 325.0375  778.6164 75.56667
## [2,] 2364.557 798.4258 672.8643     123.3748 516.7144 1041.7014 23.41364
##        Cersei Shae  Joffrey
## [1,] 543.6563    0 345.0673
## [2,] 138.9682    0 224.6781

Similar to descriptive statistics, we can also compute centrality scores for a network collection stored in a list. If you want to compare centrality scores across networks, you need to normalize the scores. Most indices have a built-in normalization parameter.

wdeg_GoT <- lapply(gotS1to7,graph.strength)
wdeg_GoT_norm <- lapply(wdeg_GoT,function(x) x/sum(x))

# most central characters per Season
lapply(wdeg_GoT_norm,function(x) sort(x,decreasing = TRUE)[1:5])
## [[1]]
##        Ned     Tyrion    Catelyn     Robert   Daenerys 
## 0.09885057 0.05425287 0.04475096 0.04314176 0.04099617 
## 
## [[2]]
##     Tyrion     Cersei       Arya      Theon    Joffrey 
## 0.07784625 0.04451833 0.03900422 0.03624716 0.03446319 
## 
## [[3]]
##     Tyrion     Cersei       Arya      Theon    Joffrey 
## 0.07784625 0.04451833 0.03900422 0.03624716 0.03446319 
## 
## [[4]]
##     Tyrion     Cersei      Jaime      Tywin        Jon 
## 0.06255356 0.04420166 0.04270209 0.04184519 0.03870323 
## 
## [[5]]
##     Tyrion       Robb      Tywin      Sansa     Cersei 
## 0.05458953 0.03134927 0.03059689 0.03034610 0.02959371 
## 
## [[6]]
##        Jon      Sansa     Tyrion      Jaime      Davos 
## 0.05159620 0.04736842 0.03813632 0.03511648 0.03399482 
## 
## [[7]]
##        Jon   Daenerys     Tyrion     Cersei      Sansa 
## 0.09738431 0.07374245 0.06971831 0.05915493 0.05452716
# same for betweenness
betw_GoT <- lapply(gotS1to7,betweenness,normalize=TRUE)
lapply(betw_GoT,function(x) sort(x,decreasing = TRUE)[1:5])
## [[1]]
##        Ned      Jorah      Aerys    Catelyn     Tyrion 
## 0.20311506 0.15498075 0.10114458 0.09887193 0.09744218 
## 
## [[2]]
##    Tyrion       Ned     Tywin    Robert      Arya 
## 0.1600906 0.1179841 0.1047071 0.1026387 0.1008107 
## 
## [[3]]
##    Tyrion       Ned     Tywin    Robert      Arya 
## 0.1600906 0.1179841 0.1047071 0.1026387 0.1008107 
## 
## [[4]]
##   Stannis       Ned   Joffrey     Varys      Arya 
## 0.2377589 0.2005887 0.1960489 0.1518714 0.1287810 
## 
## [[5]]
##      Robb       Jon     Tywin       Ned   Catelyn 
## 0.2214522 0.2100088 0.2003533 0.1980826 0.1811807 
## 
## [[6]]
##       Jon     Varys     Sansa     Jaime     Theon 
## 0.2154613 0.2145440 0.1559249 0.1403925 0.1370767 
## 
## [[7]]
##    Robert  Daenerys       Jon       Sam     Sansa 
## 0.1996923 0.1976556 0.1360417 0.1072521 0.1031329
# an example of eigenvector centrality with the Grey's Anatomy data
sort(eigen_centrality(ga)$vector)
##       Adele Webber     Miranda Bailey         Ellis Grey 
##       0.000000e+00       0.000000e+00       0.000000e+00 
##      Megan Nowland       Steve Mostow      Pierce Halley 
##       0.000000e+00       0.000000e+00       0.000000e+00 
##         Susan Grey      Thatcher Grey     Richard Webber 
##       0.000000e+00       0.000000e+00       0.000000e+00 
##    Catherine Avery              Lloyd          Eli James 
##       1.693033e-17       2.119297e-17       2.236721e-17 
##       Tucker Jones         Ben Warren      Preston Burke 
##       2.435691e-17       2.984658e-17       9.227332e-03 
##         Shane Ross       Colin Marlow               Emma 
##       9.227332e-03       9.227332e-03       2.868554e-02 
##  Stephanie Edwards       April Kepner      Cristina Yang 
##       3.312497e-02       3.312497e-02       3.581708e-02 
##               Rose       Steve Murphy     Finn Dandridge 
##       4.489222e-02       4.814684e-02       4.814684e-02 
##       Nathan Riggs       Henry Burton     Andrew Perkins 
##       4.814684e-02       5.073166e-02       5.073166e-02 
##     Lauren Boswell      Denny Duqutte               Hank 
##       6.652186e-02       1.076552e-01       1.076552e-01 
##          Owen Hunt      Jackson Avery      Nancy Shepard 
##       1.113466e-01       1.285788e-01       1.420966e-01 
##       Reed Adamson         Erica Hahn    Amelia Shepherd 
##       1.420966e-01       1.575679e-01       1.707821e-01 
##      Derek Sheperd      Meredith Grey       Teddy Altman 
##       1.742549e-01       1.868882e-01       1.969215e-01 
##       Rebecca Pope       Dana Seabury        Lucy Fields 
##       2.576238e-01       2.576238e-01       2.576238e-01 
##     Heather Brooks          Jo Wilson    Arizona Robbins 
##       2.576238e-01       2.576238e-01       2.582132e-01 
##        Leah Murphy       Nurse Olivia    George O'Malley 
##       3.241457e-01       3.624084e-01       4.067350e-01 
##      Izzie Stevens         Lexie Grey Addison Montgomery 
##       4.178775e-01       4.328453e-01       4.446126e-01 
##         Mark Sloan      Callie Torres         Alex Karev 
##       5.515661e-01       6.116201e-01       1.000000e+00

Example: which index to choose?

This example should make you aware of some general issues of centrality indices and make you think about the question “how do I choose an appropriate index”?

The data repository contains two small networks that are used in this example.

g1 <- readRDS("data/cent_example_1.rds")
g2 <- readRDS("data/cent_example_2.rds")

Both networks have the same number of nodes (11) and a similar density (0.31 and 0.35).

The networks are undirected and unweighted. We start by calculating all scores of indices that are applicable for such networks.

cent1_df <- data.frame(
  vertex = 1:11,
  dc = degree(g1),
  bc = betweenness(g1),
  cc = closeness(g1),
  ec = eigen_centrality(g1)$vector,
  sc = subgraph_centrality(g1))

cent2_df <- data.frame(
  vertex = 1:11,
  dc = degree(g2),
  bc = betweenness(g2),
  cc = closeness(g2),
  ec = eigen_centrality(g2)$vector,
  sc = subgraph_centrality(g2))

The table below shows the scores for the first network.

vertex dc bc cc ec sc
1 1 0 0.037 0.226 1.8251
2 1 0 0.0294 0.0646 1.5954
3 2 0 0.04 0.3786 3.1486
4 2 9 0.04 0.2415 2.4231
5 3 3.8333 0.05 0.5709 4.3871
6 4 9.8333 0.0588 0.9847 7.8073
7 4 2.6667 0.0526 1 7.9394
8 4 16.3333 0.0556 0.8386 6.6728
9 4 7.3333 0.0556 0.9114 7.0327
10 4 1.3333 0.0526 0.9986 8.2421
11 5 14.6667 0.0556 0.845 7.3896

Notice how every index considers a different vertex to be the most central one (in bold). Let’s check out the second network.

vertex dc bc cc ec sc
1 3 0 0.0588 0.557 10.337
2 3 0 0.0588 0.557 10.337
3 3 0 0.0588 0.557 10.337
4 5 1 0.0667 0.751 18.0023
5 2 0 0.0556 0.3986 5.9211
6 2 0 0.0556 0.3986 5.9211
7 7 5.5 0.0769 0.8898 24.7149
8 1 0 0.0526 0.2109 2.5587
9 1 0 0.0526 0.2109 2.5587
10 1 0 0.0526 0.2109 2.5587
11 10 29.5 0.1 1 31.159

Here, all indices agree on the most central node! If you look more closely, you will even see that they all induce the same node ranking.

You may be wondering, why we are only looking at the ranking and not the actual values. Effectively, the values themselves don’t have any meaning. There is no such thing as a “unit of centrality”, if we look at it from a measurement perspective. For instance, we can’t say that a node is “twice as between” as another if its betweenness value is twice as high. Centrality should thus not be considered to be on an interval scale, but rather an ordinal one. This might seem like a restriction at first, but we will see later on2 this is part of another tutorial that it facilitates many theoretical examinations.

The two networks illustrate the big problem of choice. We have only tried 5 different indices, so we actually can’t make any conclusive statements about central nodes. After all, 5 indices can in the best case produce 5 completely different rankings. But theoretically, there are 11!=39,916,800 possibilities to rank the nodes of the network without allowing ties, which indices actually do. So, what if we missed hundreds of thousands of potential indices that would rank, say, node nine on top for network 1? What if those 5 indices are exactly the ones that rank node eleven on top for network 2, but no other index does that? In the next example, we add some (made up) empirical context to illustrate the problem of how to validate the appropriateness of chosen indices.

Centrality indices are commonly used as an explanatory variable for some observed phenomenon or node attribute in a network. Let’s say we have the following abstract research question. Given a network where each node is equipped with a binary attribute, which could signify the presence or absence of some property. Can a centrality index explain the presence of this attribute?

The data repository contains a network that is equipped with such a binary attribute. Since the network is rather large, there is also an RDS file which contains the precomputed centrality scores.

g3 <- readRDS("data/cent_example_3.rds")
cent3_df <- readRDS("data/cent_example3_scores.rds")
head(cent3_df)
##   attr degree betweenness    closeness        eigen    subgraph  infocent
## 1    0      1      0.0000 9.448224e-05 9.356028e-03  830.570056 0.5634474
## 2    0      1      0.0000 9.559316e-05 6.160046e-03  364.526673 0.5843892
## 3    0      1      0.0000 7.303535e-05 5.509022e-05    1.632701 0.3409455
## 4    0      3    112.0519 1.089562e-04 1.107902e-02 2814.242127 1.0096304
## 5    1      1      0.0000 7.750136e-05 5.010141e-05    1.898569 0.4697716
## 6    0      5   3181.8374 1.013993e-04 3.520925e-03  167.154082 1.0072708

If we assume that one of the indices is somehow related with the attribute, then we should see that nodes with the attribute tend to be ranked on top of the induced ranking. Below, we calculate the number of nodes having the attribute that are ranked in the top 150 for each index.

apply(cent3_df[,2:7],2,function(x) 
  sum(cent3_df$attr[order(x,decreasing = TRUE)][1:150]))
##      degree betweenness   closeness       eigen    subgraph    infocent 
##          63          53          56          72          76          62

According to this crude evaluation, subgraph centrality is best in explaining the node attribute. But how conclusive is this now? Note that we did not specify any real hypothesis so basically any index could be a valid choice. Instead of trying out one of the other mentioned ones though, we now try to design a new index which hopefully gives us an even better fit. After some wild math, we may end up with something like this: \[ c(u)= ccoef(u) \left[\sum\limits_{v \in N(u)} \sum\limits_{k=0}^{\infty} \frac{(A^{[u]})_{vv}^{2k+1}}{(2k+1)!} \right] \]

Ok, so what is happening here? ccoef(u) is the clustering coefficient of the node u (transitivity(g,type="local")). The first sum is over all neighbors v of u. The second sum is used to sum up all closed walks of even length weighted by the inverse factorial of the length.

We can directly invent a second one, based on the walks of odd length. \[ c(u)= ccoef(u) \left[\sum\limits_{v \in N(u)} \sum\limits_{k=0}^{\infty} \frac{(A^{[u]})_{vv}^{2k+1}}{(2k+1)!} \right] \]

Mathematically fascinating, yet both indices defy any rational meaning and have not yet been considered in the literature.3 please do not write a paper about them! The indices are implemented in netrankr as the hyperbolic index.

How do they compare to subgraph centrality, the best performing index so far?

cent3_df$hyp_eve <- hyperbolic_index(g3,type = "even")
cent3_df$hyp_odd <- hyperbolic_index(g3,type = "odd")
apply(cent3_df[,c(6,8:9)],2,function(x) 
  sum(cent3_df$attr[order(x,decreasing = TRUE)][1:150]))
## subgraph  hyp_eve  hyp_odd 
##       76       99      102

Both indices are far superior. Around 66% of the top 150 nodes are equipped with the attribute, compared to 50% for subgraph centrality. A better evaluation may be to treat the problem as a binary classification problem and calculate the area under the ROC curve as a performance estimate. You have to trust me, that the indices also outperform the others with that method.

Obviously, this is a very contrived example, yet it emphasizes some important points. First, it is relatively easy to design an index that gives you the results you intend to get and hence justify the importance of the index. Second, you can never be sure, though, that you found “the best” index for the task. There may well be some even more obscure index that gives you better results. Third, if you do not find a fitting index, you can not be sure that there does not exist one after all.

Clustering

The goal of clustering is to find cohesive subgroups within a network.4 also referred to as “community detection” The following algorithms for graph clustering are implemented in igraph.

## [1] "cluster_edge_betweenness" "cluster_fast_greedy"     
## [3] "cluster_infomap"          "cluster_label_prop"      
## [5] "cluster_leading_eigen"    "cluster_louvain"         
## [7] "cluster_optimal"          "cluster_spinglass"       
## [9] "cluster_walktrap"

It is important to note that there is no real theoretically basis for what constitutes a cluster in a network. Only the vague “internally dense” versus “externally sparse” argument. As such, there is no convincing argument for or against certain algorithms/methods. In this tutorial, we follow the “modularity maximization” principle.

As for centrality, if a weight attribute is present, it will be used by default!

No matter which algorithm is chosen, the workflow is always the same.

clu <- cluster_louvain(ga)

#membership vector
mem <- membership(clu)
head(mem)
## Addison Montgomery       Adele Webber       Teddy Altman 
##                  4                  8                  7 
##    Amelia Shepherd    Arizona Robbins       Rebecca Pope 
##                  7                  6                  6
#communities as list
com <- communities(clu)
com[[1]]
## [1] "Miranda Bailey" "Ben Warren"     "Lloyd"          "Tucker Jones"  
## [5] "Eli James"

An example for the infamous karate network is shown below.

karate <- make_graph("Zachary")
imc <- cluster_infomap(karate)
lec <- cluster_leading_eigen(karate)
loc <- cluster_louvain(karate)
sgc <- cluster_spinglass(karate)
wtc <- cluster_walktrap(karate)
scores <- c(infomap = modularity(karate,membership(imc)),
            eigen = modularity(karate,membership(lec)),
            louvain = modularity(karate,membership(loc)),
            spinglass = modularity(karate,membership(sgc)),
            walk = modularity(karate,membership(wtc)))
scores
##   infomap     eigen   louvain spinglass      walk 
## 0.4020381 0.3934089 0.4188034 0.4197896 0.3532216

For the karate network, cluster_spinglass() produces the highest modularity score. In general, though, it is advisable to use cluster_louvain() since it has the best speed/performance trade-off.

Miscellaneous

This tutorial focused on the package igraph yet there are many other packages for more specific network analytic tasks:

Further reading

The book “Statistical Analysis of Network Data with R” by Eric Kolaczyk and Gábor Csárdi is an excellent source for SNA material in R.

There also exists a variety of tutorials on the web that cover basic and more advanced network analysis using R. One of the most extensive ones is “Network Analysis and Visualization with R and igraph” by Katherine Ognyanova (link). Her tutorial covers most of the functionality of igraph with in-depth explanations of the built-in plotting function.

The “Network Analysis R Cookbook” by Sacha Epskamp (link) makes use of the qgraph package which implements several methods to analyze psychometric networks.

“Introduction to Network Analysis with R” by Jesse Sadler (link) provides a short comparison of igraph, network and tidygraph.