3 Higher-order analysis of path data in pathpy

Ingo Scholtes
Data Analytics Group
Department of Informatics (IfI)
University of Zurich

September 5 2018

So far, we have focused on network models, but the real purpose of pathpy is to fit and analyse higher-order models for paths in complex networks. For this we can use the class HigherOrderNetwork. Let us look at the docstring of this class.

**TODO:** Import the module pathpy and use the `help` function to print the documentation of class `pp.HigherOrderNetwork`.

In [1]:
import pathpy as pp
help(pp.HigherOrderNetwork)
Help on class HigherOrderNetwork in module pathpy.classes.higher_order_network:

class HigherOrderNetwork(pathpy.classes.network.Network)
 |  A higher-order graphical model of path statistics with order k.
 |  
 |  Attributes:
 |  -----------
 |  
 |  edges: dict
 |      In a higher-order network, edge weights as well as in- and out
 |      weights of nodes are numpy arrays consisting of two weight components [w0, w1].
 |      w0 counts the weight of an edge based on its occurrence in a subpaths
 |      while w1 counts the weight of an edge based on its occurrence in
 |      a longest path. As an illustrating example, consider the single
 |      path a -> b -> c. In the first-order network, the weights of edges
 |      # (a,b) and (b,c) are both (1,0). In the second-order network, the
 |      weight of edge (a-b, b-c) is (0,1).
 |      Here, we will store these weights (as well as in- and out-degrees in
 |      node and edge attributes)
 |  
 |  Method resolution order:
 |      HigherOrderNetwork
 |      pathpy.classes.network.Network
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, paths, k=1, null_model=False, separator=None)
 |      Generates a k-th-order representation based on the given path statistics.
 |      
 |      Parameters
 |      ----------
 |      paths: Path
 |          An instance of class Paths, which contains the path statistics to be used in
 |          the generation of the k-th order representation
 |      k: int
 |          The order of the network representation to generate. For the default case of
 |          k=1, the resulting representation corresponds to the usual (first-order)
 |          aggregate network, i.e. links connect nodes and link weights are given by the
 |          frequency of each interaction. For k>1, a k-th order node corresponds to a
 |          sequence of k nodes. The weight of a k-th order link captures the frequency
 |          of a path of length k.
 |      null_model: bool
 |          For the default value False, link weights capture the frequencies of paths of length k 
 |          in the underlying paths object. If True, link weights capture expected frequencies
 |          under the assumption of independent links (i.e. corresponding to a first-order
 |          Markov model).
 |      separator: str
 |          The separator character to be used in higher-order node names. If this parameter 
 |          is not specified, the separator character of the underlying paths object will be 
 |          used.
 |  
 |  adjacency_matrix(self, include_subpaths=True, weighted=True, transposed=False)
 |      Returns a sparse adjacency matrix of the higher-order network. By default,
 |      the entry corresponding to a directed link source -> target is stored in row s and
 |      column t and can be accessed via A[s,t].
 |      
 |      Parameters
 |      ----------
 |      include_subpaths: bool
 |          if set to True, the returned adjacency matrix will account for the occurrence
 |          of links of order k (i.e. paths of length k-1) as subpaths
 |      weighted: bool
 |          if set to False, the function returns a binary adjacency matrix.
 |          If set to True, adjacency matrix entries will contain the weight of an edge.
 |      transposed: bool
 |          whether to transpose the matrix or not.
 |      
 |      Returns
 |      -------
 |      numpy cooc matrix
 |  
 |  degrees_of_freedom(self, assumption='paths')
 |      Calculates the degrees of freedom (i.e. number of parameters) of
 |      this k-order model. Depending on the modeling assumptions, this either
 |      corresponds to the number of paths of length k in the first-order network
 |      or to the number of all possible k-grams. The degrees of freedom of a model
 |      can be used to assess the model complexity when calculating, e.g., the
 |      Bayesian Information Criterion (BIC).
 |      
 |      Parameters
 |      ----------
 |      assumption: str
 |          if set to 'paths', for the degree of freedom calculation in the BIC, only
 |          paths in the first-order network topology will be considered. This is needed
 |          whenever we are interested in a modeling of paths in a given network topology.
 |          If set to 'ngrams' all possible n-grams will be considered, independent of
 |          whether they are valid paths in the first-order network or not. The 'ngrams'
 |          and the 'paths' assumption coincide if the first-order network is fully
 |          connected.
 |      
 |      Returns
 |      -------
 |      int
 |  
 |  first_order_nodes(self)
 |      Returns a set of nodes projected to a first-order network
 |  
 |  higher_order_node_to_path(self, node)
 |      Helper function that transforms a node in a higher-order network of order k
 |      into a corresponding path of length k-1. For a higher-order node 'a-b-c-d'
 |      this function will return ('a','b','c','d')
 |      
 |      Parameters
 |      ----------
 |      node: str
 |          The higher-order node to be transformed to a path.
 |      
 |      Returns
 |      -------
 |      tuple
 |  
 |  higher_order_path_to_first_order(self, path)
 |      Maps a path in the higher-order network to a path in the first-order network.
 |      As an example, the second-order path ('a-b', 'b-c', 'c-d') of length two is mapped
 |      to the first-order path ('a','b','c','d') of length four.
 |      In general, a path of length l in a network of order k is mapped to a path of
 |      length l+k-1 in the first-order network.
 |      
 |      Parameters
 |      ----------
 |      path: str
 |          The higher-order path that shall be mapped to the first-order network
 |      
 |      Returns
 |      -------
 |      tuple
 |  
 |  laplacian_matrix(self, include_subpaths=True)
 |      Returns the transposed Laplacian matrix corresponding to the higher-order network.
 |      
 |      Parameters
 |      ----------
 |      include_subpaths: bool
 |          Whether or not subpath statistics shall be included in the calculation of
 |          matrix weights
 |      
 |      Returns
 |      -------
 |  
 |  likelihood(self, paths, log=True)
 |      Calculates the likelihood of this higher-order model under the observed path 
 |      statistics given in paths.
 |  
 |  model_size(self)
 |      Returns the number of non-zero elements in the adjacency matrix
 |      of the higher-order model.
 |  
 |  node_to_name_map(self)
 |      Returns a dictionary that can be used to map node names to matrix/vector indices
 |  
 |  path_to_higher_order_nodes(self, path, k=None)
 |      Helper function that transforms a path of first-order nodes into a
 |      sequence of k-order nodes using the separator character of the
 |      HigherOrderNetwork instance
 |      
 |      Parameters
 |      ----------
 |      path:
 |          the path tuple to turn into a sequence of higher-order nodes
 |      k: int
 |          the order of the representation to use (default: order of the
 |          HigherOrderNetwork instance)
 |      
 |      Returns
 |      -------
 |      list
 |      
 |      Examples
 |      --------
 |      
 |      Consider an example path (a,b,c,d) with a separator string '-'
 |      
 |      >>> from pathpy import Paths
 |      >>> path_tuple = ('a', 'b', 'c', 'd')
 |      >>> paths = Paths()
 |      >>> paths.add_path_tuple(path_tuple)
 |      >>> hon = HigherOrderNetwork(paths, separator='-')
 |      >>> hon.path_to_higher_order_nodes(path_tuple, k=1)
 |      ['a', 'b', 'c', 'd']
 |      >>> hon.path_to_higher_order_nodes(path_tuple, k=2)
 |      ['a-b', 'b-c', 'c-d']
 |      >>> hon.path_to_higher_order_nodes(path_tuple, k=3)
 |      ['a-b-c', 'b-c-d']
 |  
 |  summary(self)
 |      Returns a string summary of this higher-order
 |      network
 |  
 |  total_edge_weight(self)
 |      Returns the sum of all edge weights
 |  
 |  transition_matrix(self, include_subpaths=True)
 |      Returns a (transposed) random walk transition matrix corresponding to the
 |      higher-order network.
 |      
 |      Parameters
 |      ----------
 |      include_subpaths: bool
 |          whether or not to include subpath statistics in the transition probability
 |          calculation (default True)
 |      
 |      Returns
 |      -------
 |  
 |  ----------------------------------------------------------------------
 |  Static methods defined here:
 |  
 |  generate_possible_paths(network, k)
 |      Returns all paths of length k that can
 |      possibly exist in a given network
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from pathpy.classes.network.Network:
 |  
 |  __str__(self)
 |      Returns the default string representation of this network instance
 |  
 |  add_edge(self, v, w, **edge_attributes)
 |      Adds an edge to a network and assigns arbitrary
 |      key-value pairs as edge attribute.
 |      
 |      Parameters:
 |      -----------
 |      v: str
 |          String label of the source node
 |      w: str
 |          String label of the target node
 |      edge_attributes: dict
 |          Key-value pairs that will be stored as
 |          named edge attributes in a dictionary. An
 |          attribute set via network.add_edge(v, w, x=42) can be
 |          assessed via network.edges[(v,w)]['x'].
 |          Any edge will have the default attribute 'weight', which
 |          is set to 1.0 by default. Edge weights are overwritten
 |          if an additional weighted edge is added later
 |          (see example below).
 |      
 |      Example:
 |      --------
 |          >>> network.add_edge(v,w)
 |          >>> print(network.edges['weight'])
 |          >>> 1.0 
 |          >>> network.add_edge(v,w, weight=2.0)
 |          >>> 2.0
 |  
 |  add_node(self, v, **node_attributes)
 |      Adds a node to a network and assigns arbitrary
 |      node attributes.
 |      
 |      Parameters:
 |      -----------
 |      node_attributes: dict
 |          Key-value pairs that will be stored as
 |          named node attributes in a dictionary. An
 |          attribute set via network.add_node(v, x=42) can be
 |          assessed via network.nodes[v]['x']. Any node in an undirected
 |          network will have the default attributes 'degree', 'inweight',
 |          and 'outweight'. Any node in a directed network will have the 
 |          default attributes 'indegree', 'outdegree', 'inweight', and 'outweight'.
 |          See examples below.
 |      
 |      Example:
 |      --------
 |          >>> network = pathpy.Network(directed=False)
 |          >>> network.add_node(v)
 |          >>> print(network.nodes[v])
 |          >>> {'inweight': 0.0, 'outweight': 0.0, 'degree': 0}
 |          
 |          >>> network = pathpy.Network(directed=True)
 |          >>> network.add_node(v)
 |          >>> print(network.nodes[v])
 |          >>> {'inweight': 0.0, 'outweight': 0.0, 'indegree': 0, 'outdegree': 0}
 |  
 |  ecount(self)
 |      Returns the number of links
 |  
 |  find_edges(self, select_nodes=<function Network.<lambda> at 0x000001C4F8EA0AE8>, select_edges=<function Network.<lambda> at 0x000001C4F8EA0B70>)
 |      Returns all edges that satisfy a given condition. Edges can be selected based
 |      on attributes of the adjacent nodes as well as attributes of the edge. In the select_edges
 |      lambda function,.
 |      
 |      Parameters
 |      ----------
 |      select_nodes: 
 |          a lambda function that takes two parameters v, w corresponding to the source and 
 |          target node of an edge. All edges for which the lambda function returns True will be 
 |          selected. Default is lambda v,w: True.
 |      select_edges:
 |          a lambda function that takes a single parameter e corresponding to an edge tuple. 
 |          Edge attributes can be accessed by e['attr']. All edges for which the lambda function 
 |          returns True will be selected.  Default is lambda e: True.
 |      
 |      Example:
 |      >>> network.find_edges(select_nodes = lambda v,w: if v['desired_node_property'] return True, 
 |                             select_edges = lambda e: if e['desired_edge_propert'] return True)
 |  
 |  find_nodes(self, select_node=<function Network.<lambda> at 0x000001C4F8EA09D8>)
 |      Returns all nodes that satisfy a given condition. In the select_node
 |      lambda function, node attributes can be accessed by calling v['attr']
 |  
 |  ncount(self)
 |      Returns the number of nodes
 |  
 |  remove_edge(self, source, target)
 |      Remove an edge and all of its attributes from the network.
 |      
 |      Parameters
 |      ----------
 |      source: str
 |          Source node of the edge to remove
 |      target: str
 |          Target node of the edge to remove
 |  
 |  remove_node(self, v)
 |      Removes a node and all of its attributes from the network.
 |  
 |  to_undirected(self)
 |      Returns an undirected copy of the network, in which all
 |      node and edge properties are removed.
 |  
 |  to_unweighted(self)
 |      Returns an unweighted copy of a directed or undirected network.
 |      In this copy all edge and node properties of the original network
 |      are removed, but the directionality of links is retained.
 |  
 |  write_file(self, filename, separator=',', weighted=False, header=False)
 |      Writes a network to an edge file
 |  
 |  ----------------------------------------------------------------------
 |  Class methods inherited from pathpy.classes.network.Network:
 |  
 |  from_paths(paths) from builtins.type
 |  
 |  from_sqlite(cursor, directed=True) from builtins.type
 |      Returns a new Network instance generated from links obtained 
 |      from an SQLite cursor. The cursor must refer to a table with at least
 |      two columns
 |      
 |              source target
 |      
 |      in which each row contains one link. Additional columns will be used as
 |      named edge properties. Since columns are accessed by name this function requires that a
 |      row factory object is set for the SQLite connection prior to cursor creation,
 |      i.e. you should set
 |      
 |              connection.row_factory = sqlite3.Row
 |      
 |      Parameters
 |      ----------
 |      cursor:
 |          The SQLite cursor to fetch rows from. 
 |      directed: bool
 |          Whether or not links should be interpreted as directed. Default is True.
 |      
 |      Returns
 |      -------
 |      Network:
 |          A Network instance created from the SQLite database.
 |  
 |  from_temporal_network(tempnet) from builtins.type
 |      Returns a time-aggregated directed network representation
 |      of a temporal network. The number of occurrences of
 |      the same edge at different time stamps is captured
 |      by edge weights.
 |  
 |  read_file(filename, separator=',', weighted=False, directed=False, header=False) from builtins.type
 |      Reads a network from an edge list file
 |      
 |      Reads data from a file containing multiple lines of *edges* of the
 |      form "v,w,frequency,X" (where frequency is optional and X are
 |      arbitrary additional columns). The default separating character ','
 |      can be changed. In order to calculate the statistics of paths of any length,
 |      by default all subpaths of length 0 (i.e. single nodes) contained in an edge
 |      will be considered.
 |      
 |      Parameters
 |      ----------
 |      filename : str
 |          path to edgelist file
 |      separator : str
 |          character separating the nodes
 |      weighted : bool
 |          is a weight given? if ``True`` it is the last element in the edge
 |          (i.e. ``a,b,2``)
 |      directed : bool
 |          are the edges directed or undirected
 |      header: bool
 |          if true skip the first row, useful if header row in file
 |      
 |      Returns
 |      -------
 |      Network:
 |          a ``Network`` object obtained from the edgelist
 |  
 |  ----------------------------------------------------------------------
 |  Static methods inherited from pathpy.classes.network.Network:
 |  
 |  leading_eigenvector(A, normalized=True, lanczos_vecs=None, maxiter=None)
 |      Compute normalized leading eigenvector of a given matrix A.
 |      
 |      Parameters
 |      ----------
 |      A:
 |          sparse matrix for which leading eigenvector will be computed
 |      normalized: bool
 |          whether or not to normalize, default is True
 |      lanczos_vecs: int
 |          number of Lanczos vectors to be used in the approximate
 |          calculation of eigenvectors and eigenvalues. This maps to the ncv parameter
 |          of scipy's underlying function eigs.
 |      maxiter: int
 |          scaling factor for the number of iterations to be used in the
 |          approximate calculation of eigenvectors and eigenvalues.
 |      
 |      Returns
 |      -------
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from pathpy.classes.network.Network:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

From a higher-order network analytic point of view, standard graphs or networks are first-order probabilistic generative models for paths in complex networks. As we have seen in 1.2, they can be viewed as maximum entropy models that consider first-order dyad statistics (i.e. edge frequencies), while ignoring higher-order dependencies in real-world path, sequence, or time series data.

In the following works, we have studied measures for higher-order correlations in such data and we generalised network models to higher-order models with arbitrary order:

  • R Pfitzner, I Scholtes, A Garas, CJ Tessone, F Schweitzer: Betweenness Preference: Quantifying Correlations in the Topological Dynamics of Temporal Networks, In Physical Review Letters, May 2013, arXiv 1208.0588

  • I Scholtes, N Wider, R Pfitzner, A Garas, CJ Tessone, F Schweitzer: Causality-driven slow-down and speed-up of diffusion in non-Markovian temporal networks, In Nature Communications, September 2014, arXiv 1307.4030

  • I Scholtes, N Wider, A Garas: Higher-Order Aggregate Networks in the Analysis of Temporal Networks: Path structures and centralities, In The European Physical Journal B, March 2016, arXiv 1508.06467

  • Yan Zhang, Antonios Garas, Ingo Scholtes: Controllability of temporal networks: An analysis using higher-order networks, preprint, January 2017, arXiv 1701.06331

  • I Scholtes: When is a Network a Network? Multi-Order Graphical Model Selection in Pathways and Temporal Networks, In KDD'17, February 2017, arXiv 1702.05499

A broader overview of the research on higher-order models for complex systems is available the following preprint:

  • R Lambiotte, M Rosvall, I Scholtes: Understanding Complex Systems: From Networks to Optimal Higher-Order Models, preprint, June 2018, arXiv 1806.05977

The data analysis and modelling framework outlined in these works builds on a generalisation of standard, first-order networks to $k$-dimensional De Bruijn graph models for paths in complex networks.

The class HigherOrderNetwork allows us to generate such higher-order network models of paths. In the documentation, we find that the constructor takes a parameter paths, i.e. the statistics of the observed paths that we want to model. With the parameter k we specify the order $k$ of the higher-order model that we want to fit. To understand this better, let us do this for our toy example.

**TODO:** Read the toy example from unit 1.2 from the file `data/toy_paths.ngram`, generate a **first-order** model instance `hon_1` and print a summary of the resulting instance.

In [2]:
toy_paths = pp.Paths.read_file('../data/toy_paths.ngram')
print(toy_paths)

hon_1 = pp.HigherOrderNetwork(toy_paths, k=1)
print(hon_1)
2018-09-04 22:07:08 [Severity.INFO]	Reading ngram data ... 
2018-09-04 22:07:08 [Severity.INFO]	finished. Read 2 paths with maximum length 2
2018-09-04 22:07:08 [Severity.INFO]	Calculating sub path statistics ... 
2018-09-04 22:07:08 [Severity.INFO]	finished.
Total path count: 		20.0 
[Unique / Sub paths / Total]: 	[2.0 / 100.0 / 120.0]
Nodes:				5 
Edges:				4
Max. path length:		2
Avg path length:		2.0 
Paths of length k = 0		0.0 [ 0.0 / 60.0 / 60.0 ]
Paths of length k = 1		0.0 [ 0.0 / 40.0 / 40.0 ]
Paths of length k = 2		20.0 [ 2.0 / 0.0 / 20.0 ]

Higher-order network of order k = 1

Nodes:				5
Links:				4
Total weight (subpaths/longest paths):	40.0/0.0

This generates a first-order model of our paths, with five nodes $a,b,c,d$ and $e$, and four links $(a,c), (b,c), (c,d), (c,e)$. It is identicaly to the Network instance that we have previously created using Network.from_paths. Indeed, each HigherOrderNetwork instance is derived from the class Network, which means we can store edge and node attributes and visualise it by exactly the same methods.

**TODO:** Plot the `HigherOrderModel` instance `hon_1` and print the weight of all edges.

In [3]:
style = { 'label_offset': [0,-1], 'label_color' : 'black', 'width': 800, 'height': 250}
pp.visualisation.plot(hon_1, **style)
for e in hon_1.edges:
    print(e, hon_1.edges[e]['weight'])
[save svg]
('c', 'e') [10.  0.]
('b', 'c') [10.  0.]
('a', 'c') [10.  0.]
('c', 'd') [10.  0.]

This output confirms that a HigherOrderModel with $k=1$ is identical to our Network model. WIth one exception: edge weights are vector-valued. Just like in Paths, the first entry captures the sub-path frequency while the second entry counts the occurrence of an edge as a longest path.

We can see this network as a first-order model for paths where edges are paths of length one. That is, in a model with order $k=1$ edge weights capture the statistics of paths of length $k=1$.

We can generalise this idea to k-th-order models for paths, where nodes are paths of length $k-1$ while edge weights capture the statistics of paths of length $k$. We can generate such a $k$-th order model by performing a line graph transformation on a model with order $k-1$. That is, edges in the model of order $k-1$ become nodes in the model with order $k$. We then draw edges between higher-order nodes whenever there is a possible path of length $k$ in the underlying network. The result is a $k$-dimensional De Bruijn graph model for paths. Let us try this in our example.

**TODO:** Create a second-order model `hon_2` for `toy_paths`. Visualise the model and print the weights of all edges.

In [4]:
hon_2 = pp.HigherOrderNetwork(toy_paths, k=2)
pp.visualisation.plot(hon_2, **style)

for e in hon_2.edges:
    print(e, hon_2.edges[e])
[save svg]
('b,c', 'c,e') {'weight': array([ 0., 10.])}
('a,c', 'c,d') {'weight': array([ 0., 10.])}

Each of the four edges in the first-order model is now represented by a node in the second-order model. We further have two directed edges $(a-c, c-d)$ and $(b-c,c-e)$ that represent the two paths of length two that occur in our data.

This is important because it captures to what extent the paths that we observe in our data deviate from what we would expect based on the (first-order) network topology of the system. Considering such a first-order model, all four paths $a \rightarrow c \rightarrow d, a \rightarrow c \rightarrow e, b \rightarrow c \rightarrow d$ and $b \rightarrow c \rightarrow e$ of length two are possible. If edges were statistically independent we would expect those four paths to occur with the same frequency.

Another way to express this independence assumption is to consider Markov chain models for the sequences of nodes traversed by a path. In this view, independently occurring edges translate to a memoryless first-order Markov process for the node sequence. In our example, we expect paths $a \rightarrow c \rightarrow d$ and $a \rightarrow c \rightarrow e$ to occur with the same probability, i.e. the next nodes $d$ or $e$ on a path through $c$ are independent from the previous node $a$, their probabilities only depending on the relative frequency of edges $(c,d)$ vs. $(c,e)$. In our toy example, we have a total of 20 observed paths of length two, so we expect each of the path to occur 5 times on average.

pathpy can actually generate this null-model for paths within the space of possible second-order models. This allows us to compare how the observed path statistics deviate from a (Markovian) expectation.

**TODO:** Use the `null_model` parameter in the constructor of `HigherOrderNetwork` to generate a second-order null model `hon_2_null` for `toy_paths`. Visualise the model and output all edge weights.

In [5]:
hon_2_null = pp.HigherOrderNetwork(toy_paths, k=2, null_model=True)
pp.visualisation.plot(hon_2_null, **style)

for e in hon_2_null.edges:
    print(e, hon_2_null.edges[e])
[save svg]
('b,c', 'c,e') {'weight': array([0., 5.])}
('a,c', 'c,d') {'weight': array([0., 5.])}
('a,c', 'c,e') {'weight': array([0., 5.])}
('b,c', 'c,d') {'weight': array([0., 5.])}

We can easily find out which of the paths of length two occur more or less often than expected under the null model. We can just subtract the adjacency matrices of the two instances hon_2 and hon_2_null.

**TODO:** For all egdes in `hon_2_null`, calculate how much the observed frequency in `hon_2` deviates from the random expectation.

**Hint:** Use the function `HigherOrderNetwork.node_to_name_map()` to map node names to matrix indices.

In [7]:
A_2 = hon_2.adjacency_matrix()
idx_2 = hon_2.node_to_name_map()

A_2n = hon_2_null.adjacency_matrix()
idx_2n = hon_2_null.node_to_name_map()

for (v,w) in hon_2_null.edges:
    print((v,w), A_2[idx_2[v],idx_2[w]] - A_2n[idx_2n[v],idx_2n[w]])
('b,c', 'c,e') 5.0
('a,c', 'c,d') 5.0
('a,c', 'c,e') -5.0
('b,c', 'c,d') -5.0

This analysis confirms that the paths $b \rightarrow c \rightarrow e$ and $a \rightarrow c \rightarrow d$ occur five more times than we would expect at random, while the other two paths do occur five less times than expected (i.e. not at all).This deviation from our expectation changes the causal topology of the system, i.e. who can influence whom. In a network model we implicitly assume that paths are transitive, i.e. since $a$ is connected to $c$ and $c$ is connected to $d$ we assume that there is a path by which $a$ can influence $d$ via node $c$. The second-order model of our toy example reveals that this transitivity assumption is misleading, highlighting higher-order dependencies in our data that result in the fact that neither $a$ can influence $d$ nor $b$ can influence $e$.

Ranking nodes in higher-order networks

Higher-order models capture deviations from the transitivity assumption that we often implicitly make when we apply standard graph mining or network analysis techniques. An important class of methods that rely on this assumption are centrality measures, which are often used to rank nodes in networked systems.

Let us consider the betweenness value of a node $v$, which (in its unnormalized variant) captures for how many pairs of nodes $x,y$ the shortest path from $x$ to $y$ passes through $v$. This measure is important, because is teaches us something about the role of nodes in information flow. It further captures to what extent removing a node will affect the shortest paths between other nodes. Let us try this in our example:

**TODO:** Use the method `pp.algorithms.centralities.betweenness` to calculate the betweenness of node `c` in the first-order network model of `toy_paths`.

In [8]:
print(pp.algorithms.centralities.betweenness(hon_1)['c'])
2018-09-04 22:10:31 [Severity.INFO]	Calculating betweenness (order k = 1) ...
2018-09-04 22:10:31 [Severity.INFO]	finished.
4.0

This is easy to understand, as there are four pairs $(b,e)$, $(b,d)$, $(a,e)$, $(a,d)$ for which the shortest path in the topology passes through node $c$. However, this notion of importance of node $c$ is based on the assumption that paths are transitive and Markovian, which is not the case in our observed paths. pathpy actually allows us to calculate the betweenness of node $c$ in the observed paths toy_paths.

**TODO:** Use the method `pp.algorithms.centralities.betweenness` to calculate the betweenness of node `c` in the paths observed in `toy_paths`.

In [9]:
print(pp.algorithms.centralities.betweenness(toy_paths)['c'])
2018-09-04 22:10:38 [Severity.INFO]	Calculating betweenness in paths ...
2018-09-04 22:10:38 [Severity.INFO]	finished.
2.0

Not surprisingly, the betweenness of node $c$ in the actual paths is two, as $c$ only connects two pairs of nodes are connected via a (shortest) path. This change in centrality is captured in the higher-order model and pathpy allows us to directly generalize centrality measures to higher orders, simply by calling the functions on the class HigherOrderNetwork.

**TODO:** Use the method `pp.algorithms.centralities.betweenness` to calculate the betweenness of node `c` in a second-order model for `toy_paths`.

In [10]:
print(pp.algorithms.centralities.betweenness(hon_2)['c'])
2018-09-04 22:10:50 [Severity.INFO]	Calculating betweenness (order k = 2) ...
2018-09-04 22:10:50 [Severity.INFO]	finished.
2.0

We see that in this example (since we only have second-order dependencies) the second-order model accurately captures the betweenness of node $c$. Let us contrast this to the second-order null model.

**TODO:** Use the method `pp.algorithms.centralities.betweenness` to calculate the betweenness of node `c` in the second-order null model for `toy_paths`.

In [11]:
print(pp.algorithms.centralities.betweenness(hon_2_null)['c'])
2018-09-04 22:10:57 [Severity.INFO]	Calculating betweenness (order k = 2) ...
2018-09-04 22:10:57 [Severity.INFO]	finished.
4.0

This is what we expect, since this null model is simply a second-order representation of the first-order model, in which paths are transitive and memoryless.

Predicting node rankings with higher-order models

Let us now explore how we can apply higher-order models in practical prediction tasks. Assume we are given the task to rank nodes in a network according to how often they are traversed by paths. This has practical applications, e.g. predicting the most frequently clicked Web pages based on the topology of the link graph, or predicting the most congested airports based on the topology of flight connections. Let us study the latter example based on a data set on the paths of airline passengers in the United Stated.

For your convenience, we have partitioned the data set into a training and a validation set of equal size. Our goal will be to use the training set to learn (higher-order) network models. We then calculate a higher-order generalisation of pagerank described in this paper in these models to predict the ranking of airports according to the number of passengers in the validation set.

**TODO:** Use the `Paths.read_file` function to read path statistics from the ngram data file `US_flights_train.ngram`. Set the parameter `frequency` of the read_file function to `False`. Generate a first-, second-, and third-order model for the paths. Use `pp.algorithms.centralities.pagerank` to calculate the PageRank of nodes in the higher-order models.

In [12]:
training_paths = pp.Paths.read_file('../data/US_flights_train.ngram', frequency=False)

hon_1 = pp.HigherOrderNetwork(training_paths, k=1)
hon_2 = pp.HigherOrderNetwork(training_paths, k=2)
hon_3 = pp.HigherOrderNetwork(training_paths, k=3)

pr_1 = pp.algorithms.centralities.pagerank(hon_1)
pr_2 = pp.algorithms.centralities.pagerank(hon_2)
pr_3 = pp.algorithms.centralities.pagerank(hon_3)

# with this, we generate 
pr_1 = [pr_1[v] for v in training_paths.nodes]
pr_2 = [pr_2[v] for v in training_paths.nodes]
pr_3 = [pr_3[v] for v in training_paths.nodes]
2018-09-04 22:11:19 [Severity.INFO]	Reading ngram data ... 
2018-09-04 22:11:20 [Severity.INFO]	finished. Read 143405 paths with maximum length 13
2018-09-04 22:11:20 [Severity.INFO]	Calculating sub path statistics ... 
2018-09-04 22:11:20 [Severity.INFO]	finished.
2018-09-04 22:11:26 [Severity.INFO]	Calculating PageRank in 1-th order network...
2018-09-04 22:11:26 [Severity.INFO]	Calculating PageRank in 2-th order network...
2018-09-04 22:11:26 [Severity.INFO]	finished.
2018-09-04 22:11:26 [Severity.INFO]	Calculating PageRank in 3-th order network...
2018-09-04 22:11:26 [Severity.INFO]	finished.

We can now use the values as a prediction for the ranking of airports based on a first-, second-, and third-order model for passenger itineraries. To validate the prediction performance of these models, we need a ground truth. We can calculate this in the validation data set. Specifically, we are interested in a ranking of airports in terms of the passenger itineraries passing through any given airport.

**TODO:** Use the `Paths.read_file` function to read path statistics from the ngram data file `US_flights_validate.ngram`. Set the parameter `frequency` of the read_file function to `False`. Use the method `pp.algorithms.centralities.node_traversals` to calculate the number of passengers that through each airport in the validation data set.

In [13]:
validate_paths = pp.Paths.read_file('../data/US_flights_validate.ngram', frequency=False)

traversals = pp.algorithms.centralities.node_traversals(validate_paths)
rank_true = [traversals[v] for v in training_paths.nodes]
2018-09-04 22:11:56 [Severity.INFO]	Reading ngram data ... 
2018-09-04 22:11:57 [Severity.INFO]	finished. Read 143405 paths with maximum length 10
2018-09-04 22:11:57 [Severity.INFO]	Calculating sub path statistics ... 
2018-09-04 22:11:57 [Severity.INFO]	finished.
2018-09-04 22:11:57 [Severity.INFO]	Calculating node traversals...
2018-09-04 22:11:57 [Severity.INFO]	finished.

We can validate our models by calculating the Kendall-Tau rank correlations between the ground-truth ranking of airports, and the rankings that we obtain based on the different higher-order models.

**TODO:** Use the function `kendalltau` from the package `scipy.stats` to calculate the rank correlation between (i) the ground truth ranking in the validation data, and (ii) the rankings obtained based on the first-, second-, and third-order models respectively.

In [14]:
from scipy.stats import kendalltau

print('tau(pr_1, traversals) = {0}'.format(kendalltau(pr_1, rank_true).correlation))
print('tau(pr_2, traversals) = {0}'.format(kendalltau(pr_2, rank_true).correlation))
print('tau(pr_3, traversals) = {0}'.format(kendalltau(pr_3, rank_true).correlation))
tau(pr_1, traversals) = 0.7765415880849217
tau(pr_2, traversals) = 0.8724501645117442
tau(pr_3, traversals) = 0.8751149821774835

We make three observations:

  1. The first-order topology of flight routes alone allows us to generate a fairly good ranking of airports.
  2. We can considerably increase the prediction performance of our model by taking into account second-order dependencies in the observed path statistics. This confirms that higher-order models are useful to improve predictions in real-world systems.
  3. Interestingly, moving to the third-order model does not translate to a further increase of prediction performance. This suggests that the data set does not contain strong third-order dependencies.

Considering that in reality we often do not have ground-truth that allows us to test which order performs best, this highlights the problem that we must decide which order to use for a given data set. We will solve this riddle in session 2, when we introduce a method to learn the optimal order for a given data set.

Path statistics from origin-destination data

In the example above, the data provide us with full knowledge about the exact itinerary taken by each passenger. However, we are often confronted with situations where we do not have such detailed information about paths. Nevertheless, we often have aggregate information that allows us to generate path statistics: Consider a setting where we know (1) the topology of a transportation network, and (2) the origin and destination stations of individual passengers, i.e. where passengers start and finish their journey. Under the assumption that passengers travel along shortest paths, we can now use this information to extract the path statistics that we need.

pathpy provides a number of path extraction methods that help you to deal with such situations. For the situation described above, we can use the pp.path_extraction.paths_from_origin_destination method to generate path statistics based on tuples capturing origin/destination statistics and an instance of the class Network. Let us try this in a toy example.

**TODO:** Generate a directed network with six nodes and six edges $(a,c), (b,c), (c,d), (d,f), (d,g)$. Plot the network. Based on a list of tuples $(a, f, 5), (b, g, 10)$ capturing origin destination statistics, use the method `pp.path_extraction.paths_from_origin_destination` to generate a `Paths` object and print the result.

In [16]:
n = pp.Network(directed=True)
n.add_edge('a', 'c')
n.add_edge('b', 'c')
n.add_edge('c', 'd')
n.add_edge('d', 'f')
n.add_edge('d', 'g')

pp.visualisation.plot(n)

od_stats = [('a', 'f', 5), ('b', 'g', 10)]

paths = pp.path_extraction.paths_from_origin_destination(od_stats, n)
print(paths)
[save svg]
2018-09-04 22:13:08 [Severity.INFO]	Starting origin destination path calculation ...
2018-09-04 22:13:08 [Severity.INFO]	finished.
Total path count: 		15.0 
[Unique / Sub paths / Total]: 	[2.0 / 135.0 / 150.0]
Nodes:				6 
Edges:				5
Max. path length:		3
Avg path length:		3.0 
Paths of length k = 0		0.0 [ 0.0 / 60.0 / 60.0 ]
Paths of length k = 1		0.0 [ 0.0 / 45.0 / 45.0 ]
Paths of length k = 2		0.0 [ 0.0 / 30.0 / 30.0 ]
Paths of length k = 3		15.0 [ 2.0 / 0.0 / 15.0 ]

As expected, we get an instance with $15$ paths of length four. The shortest path $a \rightarrow c \rightarrow d \rightarrow f$ occurs 5 times, while the shortest path $b \rightarrow c \rightarrow d \rightarrow g$ occurs 10 times.

With this we close this session, and we will move to the open-ended exploration, in which you can apply higher-order models to analyse data on paths and origin-destination statistics.