Graph Data from Wikipedia

Building a Network from Unclean Data

I’m working through Complex Network Analysis in Python and one of the earlier examples is just too good to pass up, wherein the author generates useful Network data by combing through links from a starting Wikipedia page.

This technique is called “Snowball Sampling” and entails starting from some seed, or starting point, and running the algorithm which “uses retrieved data to find more data,” hence snowball. It does this by executing a Breadth-First Search from the starting seed, with a few thoughtful checks and data cleaning steps that I’ll go through in the next section.

In the Code

The code, copied from the book repository here, looks pretty dense, but is actually quite elegant. Pasting in its entirety, but segmenting to make observations on what’s going on under the hood.

Core Objects

After the relevant library imports, we start with our main objects:

  • F: the blank network object that will hold our data
  • SEED: the title of the starting page
  • todo_set: A set (not list, for better retrieval!) that holds all pages we still need to crawl
  • done_set: A set holding all pages we’ve already been to
from operator import itemgetter
import networkx as nx
import wikipedia

F = nx.DiGraph()

SEED = "Complex network".title()

todo_set = set(SEED)   # The SEED itself
done_set = set()       # Nothing is done yet

For this next section, it’s extremely important that we keep track of what layer we’re currently operating on. Wikipedia is enormous, and if we’re not careful, our scraper would just go down the “one more link” rabbit hole and spend hours, if not days/months scraping until we finish the site.

Therefore, for each page we scrape, we want to catalog both its title and what layer away from our seed the node is. This gives us:

  • todo_lst: A list of (layer, page_title) tuples
  • layer: The current (integer) layer value, representing our distance from the seed
  • page: The current page name as a string

We initialize todo_lst with the below because we haven’t yet begun scraping. The first values of layer and page get updated accordingly.

todo_lst = [(0, SEED)] # The SEED is in the layer 0

layer, page = todo_lst[0]

A Clever Conditional

Specific to this particular use-case, the author also provided us with a list, STOPS, which represent links that are either of very little informational value, or show up on virtually every page and provide no real insight, into the network built around your seed

STOPS = ("International Standard Serial Number",
         "International Standard Book Number",
         "National Diet Library",
         "International Standard Name Identifier",
         "International Standard Book Number (Identifier)",
         "Pubmed Identifier", "Pubmed Central",
         "Digital Object Identifier", "Arxiv",
         "Proc Natl Acad Sci Usa", "Bibcode",
         "Library Of Congress Control Number", "Jstor",
         "Doi (Identifier)", "Isbn (Identifier)",
         "Pmid (Identifier)", "Arxiv (Identifier)",
         "Bibcode (Identifier)", "Pmc (Identifier)",
         "Issn (Identifier)", "S2Cid (Identifier)")

Data Collection Loop

With all of these parts together, the following algorithm basically:

  • Removes the next (layer, page_title) pair from the todo_lst
  • Uses this to open up the Wikipedia article corresponding to the page_title
  • Finds all of the links the article contains while

    • Filtering out any pages that appear in our STOPS list
    • Incrementing the layer value to be current_layer + 1
    • Adding the new pair to the end of our todo_list

This ordering is crucial, as it ensures that we scrape the pages in layer order (e.g. 0 before 1s, all 1s before any 2s, etc) until the only pages left in our todo_lst are those with a higher layer value than we’re interested in.

from warnings import filterwarnings

filterwarnings('ignore')

while layer < 2:
    del todo_lst[0] #(1)
    done_set.add(page)
    # print(layer, page)

    try: #(2)
        wiki = wikipedia.page(page)
    except:
        layer, page = todo_lst[0]
        # print("Could not load", page)
        continue

    for link in wiki.links: #(3)
        link = link.title()
        if link not in STOPS and not link.startswith("List Of"):
            if link not in todo_set and link not in done_set:
                todo_lst.append((layer + 1, link))
                todo_set.add(link)
            F.add_edge(page, link)

    layer, page = todo_lst[0] #(4)
print("{} nodes, {} edges".format(len(F), nx.number_of_edges(F)))
13506 nodes, 24391 edges

Some More Clever Data Fixes

The author also provides a nice chunk of code to resolve any naming inconsistencies between two pages of the same content (e.g. “Complex Network” vs the plural “Complex Networks”).

It does this by checking all pairwise combinations of page titles to see if they match the singular/plural of the other, collecting matches into a list, duplicates.

Then, they make use of nx.contracted_nodes(), whose middle argument is an iterable of node-key pairs telling networkx “treat these two nodes as one node”. Concretely, if we had a linear network (A - B - C - D) and called nx.contracted_nodes() with middle argument [(B, C)], then the network would make a new node, BC that squashed the two together, giving us a new network A - BC - D. Finally, the self_loops=False argument ensures that we don’t preserve the “connection/relationship between B and C” – they’re the same node and that self-referencing loop becomes redundant.

F.remove_edges_from(nx.selfloop_edges(F))
duplicates = [(node, node + "s") for node in F if node + "s" in F]
for dup in duplicates:
    F = nx.contracted_nodes(F, *dup, self_loops=False)

Similarly, they check for multi-word titles that are space-delimited vs hyphen-delimited but represent the same content.

duplicates = [(x, y) for x, y 
              in [(node, node.replace("-", " ")) for node in F]
              if x != y and y in F]
for dup in duplicates:
    F = nx.contracted_nodes(F, *dup, self_loops=False)

This line is a bit hand-wavy, but essentially any node that survives the deduping, gets stuck with a new contraction attribute that marks what node got deleted in the last step.

F.nodes['Complex Network']
{'contraction': {'Complex Networks': {}}}

And so this line just goes through and zeroes out all contraction values, because it’s a useless attribute in this instance, and makes exporting the network to a standard format a chore

nx.set_node_attributes(F, 0, "contraction")

Subgraphing

At this point, we have a ton of nodes, and even more edges connecting them.

print(len(F.nodes), len(F.edges))
13390 23223

But do we need all of this data?

Of the 14k nodes we pulled in our dataset, nearly 10k of them only have degree 1 (one in-edge from a neighbor, not no out-edge connecting to any data point in our first two layers). This means that we can lop off nearly 70% of the data we collected, without losing nodes that provide any real interesting properties.

import pandas as pd

node_degrees = pd.Series([degree for title, degree in F.degree()])
node_degrees.value_counts().head()
1    10436
2     1596
3      605
5      230
4      184
dtype: int64

The author does this by defining a core set of nodes that all have degree >= 2

core = [node for node, deg in dict(F.degree()).items() if deg >= 2]

Then defines a new, final, graph G, derived from our original graph F, but only for the nodes of interest

G = nx.subgraph(F, core)

Not only does the node count go down considerably, we also automatically rid ourselves of edges connecting to nodes outside of our core set

print(len(G.nodes), len(G.edges))
2954 12787

Writing

I’m leaving this commented out, because I have no use for it in this notebook, but writing the graph object to memory is as easy as calling the function that writes the data to (one of many) standard graph data formats

nx.write_graphml(G, "output.graphml")

All Together

Finally, we can use the in_degree values as a measure of “how fundamental is this page, in the world of our starting seed?”. Anyone with even a cursory understanding of Graph/Network Analysis won’t be surprised to see these results.

top_indegree = sorted(dict(G.in_degree()).items(),
                      reverse=True, key=itemgetter(1))[:100]

top_indegree[:10]
[('Graph (Discrete Mathematics)', 68),
 ('Vertex (Graph Theory)', 64),
 ('Directed Graph', 58),
 ('Social Network', 56),
 ('Network Theory', 53),
 ('Adjacency Matrix', 53),
 ('Degree (Graph Theory)', 53),
 ('Network Science', 51),
 ('Graph Drawing', 50),
 ('Edge (Graph Theory)', 50)]