Re: [igraph] Real Network Analysis

Dear All,
An apology for this long email, but I am planning to use extensively the
igraph library this year and beyond and I have gathered together a few
questions.
I would now apply the igraph library to study a real network, e.g. the
US airport network.
I found out that some data are freely available at

http://cxnets.googlepages.com/US_largest500_airportnetwork.txt

There a few topological quantities of interest I would like to calculate
(I hope that this may be useful to other igraph learners)

(1) The degree distribution P(k) [which I estimate using an empirical
frequency] as a function of k (node degree)
(2) Fit (1) to a power law
(3) The clustering coefficients (sometimes called transitivity) as a
function of k
(4) The distribution of the vertex betweenness
(5) The average nearest neighbour degree as a function of k
(6) The mean shortest path in the network
(7) The average number of nodes within a distance l from any given node
(as a function of l)

I have written an R code which hopefully performs these tasks (see
below), but I have a few questions

(a) Do I have any problems if there are some missing degrees in my
distribution (other than getting some warnings in a few log-log plots)?

I don't think so. What kind of problems do you fancy?

(b) When performing (2), whatever xmin I set by hand, I get an error...why?

I don't get an error, only a warning that some values of the input are omitted. This is fine. If you want to get rid of the warning, then remove the values that are less than xmin from the input vector.

(c) In (3), I set equal to zero by hand the NaN returned by the igraph transitivity routine (I think it is part of the definition of the clustering coefficients when the numerator is zero).

If you feel like it. I don't think that there is much consensus about this case, but 0/0=NaN seemed like a good choice.

(d) I have no idea of how to calculate the quantity in (7).

Use neighborhood.size and then mean. It should be OK for a small graph like this, the diameter of the graph is 7 anyway, so in seven steps you always reach all nodes.

(e) Can I tune the parameters (e.g. number of bins) of the histogram
generated by path.length.hist ?

path.length.hist doesn't do binning, it returns a separate number for every different distance. You can then bin that, if you like.

I think this really amounts to knowing the graph topology in detail. The last questions are really about the R/Python bindings and the speed penalty.

(f) I resorted to the R bindings of the igraph library, but can I perform the same tasks with the Python bindings?

I think so.

(g) Approximately, what is the penalty in using the R/Python bindings rather than the C library directly? A factor 2 or three is not big deal, but one order of magnitude would be.

Personally I never measured this, because I never felt that the R bindings are slow (and don't use the Python bindings).

It depends on the operations you are using. In R there is only copying for the results that you obtain from igraph. I.e. if you create a new graph, that is created in the C address space first and then copied to the R address space. The same is true for all results.

I would the very surprised if during general use you would have a performance penalty factor of two. I think it more like 1.1 or even less. But this doesn't mean much without actual measurements.

Best,
Gabor

Many thanks for any suggestions

Lorenzo

rm(list=ls())

library(Cairo)
library(igraph)

###########################################################
# http://cxnets.googlepages.com/US_largest500_airportnetwork.txt
###########################################################

graph_input<-read.table("US_largest500_airportnetwork.txt",header=FALSE)
graph_input<-as.matrix(graph_input)

graph_input_unweighted <- graph_input[ ,1:2]

# Now I have the graph as a two-column matrix

g <- graph(t(graph_input_unweighted), directed=FALSE)

g <- simplify(g)

#save the graph in a convenient format (e.g. edgelist)

write.graph(g, "edgelist-airport", format="edgelist")

#Now plot the graph

l <- layout.fruchterman.reingold(g)
l <- layout.norm(l, -1,1, -1,1)

E(g)$color <- "grey"
E(g)$width <- 1
V(g)$label.color <- "blue"
V(g)$color <- "SkyBlue2"

pdf("airport-network.pdf")

plot(g, layout=layout.fruchterman.reingold,
vertex.label.dist=0.5,vertex.label=NA, vertex.size=5)

dev.off()

############################################################################################
############################################################################################
############################################################################################

#Now some useful statistics

#degree of each vertex

dd <- degree(g)

print("The mean degree is, ")
print(mean(dd))

#degree distribution

dd2 <- degree.distribution(g)

#dd2 is calculated for a node degree ranging from 0 to the max degree found in the network, hence I
#define the following

k_range <- 0:(length(dd2)-1)

#Get also the cumulative distribution

dd2_cum <- degree.distribution(g, cumulative=TRUE)

#From now on, I get some warnings in the log-log plots, but it should not be anything to worry about

pdf("frequency_degree2.pdf")
par( mar = c(4.5,5, 2, 1) + 0.1)

plot(k_range,dd2, xlab="Degree",
ylab="Frequency", pch=3, col=3, type="b",cex.axis=1.4,cex.lab=1.6)

dev.off()

pdf("frequency_degree2-loglog.pdf")

par( mar = c(4.5,5, 2, 1) + 0.1)

plot(k_range,dd2, xlab="Degree",
ylab="Frequency",log="xy", pch=3, col=3, type="b", ylim=c(2e-3,3e-1), yaxt="n",cex.axis=1.4,cex.lab=1.6)

axis(side=2, at=c( 2e-3, 3e-2, 3e-1),
labels=_expression_(2%*%10^3,3%*%10^2,3%*%10^1 ),cex.lab=1.4,cex.axis=1.6)

dev.off()

pdf("frequency_degree2-cumulative-loglog.pdf")

par( mar = c(4.5,5, 2, 1) + 0.1)

plot(0:(length(dd2)-1),dd2_cum, xlab="Degree",
ylab="Cumulative frequency",log="xy", pch=3, col=3, type="b",cex.axis=1.4,cex.lab=1.6)
dev.off()

pdf("frequency_degree2-cumulative.pdf")

par( mar = c(4.5,5, 2, 1) + 0.1)

plot(0:(length(dd2)-1),dd2_cum, xlab="Degree",
ylab="Cumulative frequency", pch=3, col=3, type="b",cex.axis=1.4,cex.lab=1.6)
dev.off()

#Now I fit the degree distribution to a power law, but I get an error if I try to set xmin. Why????

pfit <- power.law.fit(dd)

pdf("frequency_degree2-loglog-and-fit.pdf")

par( mar = c(4.5,5, 2, 1) + 0.1)

plot(k_range,dd2, xlab="Degree",
ylab="Frequency",log="xy", ylim=c(0.002,0.5), pch=3, col=3, type="b",cex.axis=1.4,cex.lab=1.6, yaxt="n")

axis(side=2, at=c( 2e-3, 3e-2, 3e-1),
labels=_expression_(2%*%10^3,3%*%10^2,3%*%10^1 ),cex.lab=1.4,cex.axis=1.6)

lines(2:20,1*(2:20)^(-coef(pfit)) )

dev.off()

#Now calculate graph diameter

d <- get.diameter(g)

E(g, path=d)$color <- "red"
E(g, path=d)$width <- 1

V(g)[ d ]$label.color <- "black"
V(g)[ d ]$color <- "red"

pdf("airport_graph_diameter.pdf")

plot(g, layout=layout.fruchterman.reingold,
vertex.label.dist=0.5,vertex.label=NA, vertex.size=5)

title(main="Diameter of the random graph",
xlab="created by igraph 0.5")
axis(1, labels=FALSE, tick=TRUE)
axis(2, labels=FALSE, tick=TRUE)

dev.off()

#Now calculate the vertex betweenness
#and find the node with the highest betweenness

ver_bet <- betweenness(g)

## max_bet <- ver_bet[which(ver_bet==max(ver_bet))]

## print("The maximumum value of the betweenness is, ")
## print (max_bet)

## node_max_bet <- which(ver_bet==max(ver_bet))-1 #consistent with the labelling of vertices starting from 0 and not from 1

## print("The node with the maximumum value of the betweenness is, ")
## print (node_max_bet)

print("The mean vertex betweenness is, ")
print(mean(ver_bet))

#Now calculate the clustering coefficients (also called transitivity of a graph)

clust <- transitivity(g, type="local")

#Now there is a problem: the definition of clustering coefficient may lead to a 0/0
# which becomes a NaN for R. The idea is then to remove this NaN and set them all to zero (as they should)

bad <- sapply(clust, function(x) all(is.nan(x)))
clust[bad] <- 0 #done!

print("The mean clustering coefficient is, ")
print(mean(clust))

#Now I can plot the clustering coefficients as a function of the node degrees

#The following figure looks awful, am I doing anything wrong?

pdf("clustering-vs-degree.pdf")
par( mar = c(4.5,5, 2, 1) + 0.1)

plot(dd,clust, xlab="Degree",
ylab="Clustering coefficient", pch=4, col="blue", type="p",cex.axis=1.4,cex.lab=1.6)

dev.off()

# calculate average nearest-neighbour degree

knn <- graph.knn(g)

#knn$knnk lists the average nearest neighbour degree for vertices having at least one neighbour, hence
# the vector is one element shorter than dd2

pdf("nearest-neighbour-frequency.pdf")
par( mar = c(4.5,5, 2, 1) + 0.1)

plot(knn$knnk, xlab="Degree",
ylab="Frequency", pch=3, col=3, type="b",cex.axis=1.4,cex.lab=1.6)

dev.off()

pdf("nearest-neighbour-frequency-log-log.pdf")
par( mar = c(4.5,5, 2, 1) + 0.1)

plot(knn$knnk, xlab="Degree",
ylab="Frequency", pch=3, col=3,log="xy", type="b",cex.axis=1.4,cex.lab=1.6)

dev.off()

#Now get the shortest paths in the network

mean_shortest_path <- average.path.length(g)

print("The mean shortest path is, ")
print(mean_shortest_path)

h_list <- path.length.hist(g)

print ("So far so good")

_______________________________________________
igraph-help mailing list
address@hidden
http://lists.nongnu.org/mailman/listinfo/igraph-help

From:	Gábor Csárdi
Subject:	Re: [igraph] Real Network Analysis
Date:	Fri, 19 Jun 2009 00:35:47 +0200