5 Exploration: Higher-order analysis of real-world pathway data

Ingo Scholtes
Data Analytics Group
Department of Informatics (IfI)
University of Zurich

September 5 2018

In the last (open-ended) exploration of this first tutorial session, you have the chance to use higher-order network analytics to study real data sets for yourself.

Details on the available data sets can be found here. Using these methods introduced in the previous unit, you can - for instance - address the following questions (in ascending order of difficulty):

  • Repeat the analysis of higher-order centralities in the toy example from 1.3 with the closeness centrality of nodes. What do you observe?
  • Test the prediction performance of higher-order models for the London Tube and/or the Wikipedia clickstream data set. Does the prediction performance saturate at k=2 as it does for the US Flight data?
  • Study the difference between higher- and first-order centralities in the dynamic social networks contained in the SQLite database file.
  • Use the higher-order framework to identify those paths of length k that show "anomalous statistics" (compared to a memoryless null model). Which are these paths and how can we interpret the result?

The data sets and questions above are mere suggestions for your exploration of higher-order network analytics. You are welcome to study other data sets or questions instead. Please reach out to me if you encounter any problems or questions (also after the tutorial). You can reach me at scholtes@ifi.uzh.ch.

In [1]:
import pathpy as pp

# Flight data  
flight_paths = pp.Paths.read_file('../data/US_flights.ngram', frequency=False)

# Clickstreams, ignore single path with more than 400 clicks
clickstreams = pp.Paths.read_file('../data/wikipedia_clickstreams.ngram', frequency=False, max_ngram_length=100)

# London Tube trips based on Oyster card checkin-checkouts
tube_net  = pp.Network.read_file('../data/tube.edges', separator=';')
od_stats = pp.path_extraction.read_origin_destination('../data/tube_od.csv', separator=';')
tube_trips = pp.path_extraction.paths_from_origin_destination(od_stats, tube_net)
2018-08-21 15:51:57 [Severity.INFO]	Reading ngram data ... 
2018-08-21 15:51:59 [Severity.INFO]	finished. Read 286810 paths with maximum length 13
2018-08-21 15:51:59 [Severity.INFO]	Calculating sub path statistics ... 
2018-08-21 15:51:59 [Severity.INFO]	finished.
2018-08-21 15:51:59 [Severity.INFO]	Reading ngram data ... 
2018-08-21 15:52:00 [Severity.INFO]	finished. Read 51318 paths with maximum length 99
2018-08-21 15:52:00 [Severity.INFO]	Calculating sub path statistics ... 
2018-08-21 15:52:02 [Severity.INFO]	finished.
2018-08-21 15:52:02 [Severity.INFO]	Reading edge list ... 
2018-08-21 15:52:02 [Severity.INFO]	finished.
2018-08-21 15:52:02 [Severity.INFO]	Reading origin/destination statistics from file ...
2018-08-21 15:52:02 [Severity.INFO]	Finished.
2018-08-21 15:52:19 [Severity.INFO]	Starting origin destination path calculation ...
2018-08-21 15:54:42 [Severity.INFO]	finished.