Graph Data from Wikipedia
Building a Network from Unclean Data
I’m working through Complex Network Analysis in Python and one of the earlier examples is just too good to pass up, wherein the author generates useful Network data by combing through links from a starting Wikipedia page.
This technique is called “Snowball Sampling” and entails starting from some seed, or starting point, and running the algorithm which “uses retrieved data to find more data,” hence snowball. It does this by executing a Breadth-First Search from the starting seed, with a few thoughtful checks and data cleaning steps that I’ll go through in the next section.
In the Code
The code, copied from the book repository here, looks pretty dense, but is actually quite elegant. Pasting in its entirety, but segmenting to make observations on what’s going on under the hood.
Core Objects
After the relevant library imports, we start with our main objects:
F
: the blank network object that will hold our dataSEED
: the title of the starting pagetodo_set
: Aset
(notlist
, for better retrieval!) that holds all pages we still need to crawldone_set
: Aset
holding all pages we’ve already been to
from operator import itemgetter
import networkx as nx
import wikipedia
F = nx.DiGraph()
SEED = "Complex network".title()
todo_set = set(SEED) # The SEED itself
done_set = set() # Nothing is done yet
For this next section, it’s extremely important that we keep track of what layer we’re currently operating on. Wikipedia is enormous, and if we’re not careful, our scraper would just go down the “one more link” rabbit hole and spend hours, if not days/months scraping until we finish the site.
Therefore, for each page we scrape, we want to catalog both its title and what layer away from our seed the node is. This gives us:
todo_lst
: Alist
of(layer, page_title)
tuples
layer
: The current (integer) layer value, representing our distance from the seedpage
: The current page name as a string
We initialize todo_lst
with the below because we haven’t yet begun scraping. The first values of layer
and page
get updated accordingly.
todo_lst = [(0, SEED)] # The SEED is in the layer 0
layer, page = todo_lst[0]
A Clever Conditional
Specific to this particular use-case, the author also provided us with a list, STOPS
, which represent links that are either of very little informational value, or show up on virtually every page and provide no real insight, into the network built around your seed
STOPS = ("International Standard Serial Number",
"International Standard Book Number",
"National Diet Library",
"International Standard Name Identifier",
"International Standard Book Number (Identifier)",
"Pubmed Identifier", "Pubmed Central",
"Digital Object Identifier", "Arxiv",
"Proc Natl Acad Sci Usa", "Bibcode",
"Library Of Congress Control Number", "Jstor",
"Doi (Identifier)", "Isbn (Identifier)",
"Pmid (Identifier)", "Arxiv (Identifier)",
"Bibcode (Identifier)", "Pmc (Identifier)",
"Issn (Identifier)", "S2Cid (Identifier)")
Data Collection Loop
With all of these parts together, the following algorithm basically:
- Removes the next
(layer, page_title)
pair from thetodo_lst
- Uses this to open up the Wikipedia article corresponding to the
page_title
Finds all of the links the article contains while
- Filtering out any pages that appear in our
STOPS
list - Incrementing the
layer
value to becurrent_layer + 1
- Adding the new pair to the end of our
todo_list
- Filtering out any pages that appear in our
This ordering is crucial, as it ensures that we scrape the pages in layer
order (e.g. 0
before 1
s, all 1
s before any 2
s, etc) until the only pages left in our todo_lst
are those with a higher layer
value than we’re interested in.
from warnings import filterwarnings
filterwarnings('ignore')
while layer < 2:
del todo_lst[0] #(1)
done_set.add(page)
# print(layer, page)
try: #(2)
wiki = wikipedia.page(page)
except:
layer, page = todo_lst[0]
# print("Could not load", page)
continue
for link in wiki.links: #(3)
link = link.title()
if link not in STOPS and not link.startswith("List Of"):
if link not in todo_set and link not in done_set:
todo_lst.append((layer + 1, link))
todo_set.add(link)
F.add_edge(page, link)
layer, page = todo_lst[0] #(4)
print("{} nodes, {} edges".format(len(F), nx.number_of_edges(F)))
13506 nodes, 24391 edges
Some More Clever Data Fixes
The author also provides a nice chunk of code to resolve any naming inconsistencies between two pages of the same content (e.g. “Complex Network” vs the plural “Complex Networks”).
It does this by checking all pairwise combinations of page titles to see if they match the singular/plural of the other, collecting matches into a list, duplicates
.
Then, they make use of nx.contracted_nodes()
, whose middle argument is an iterable of node-key pairs telling networkx
“treat these two nodes as one node”. Concretely, if we had a linear network (A - B - C - D
) and called nx.contracted_nodes()
with middle argument [(B, C)]
, then the network would make a new node, BC
that squashed the two together, giving us a new network A - BC - D
. Finally, the self_loops=False
argument ensures that we don’t preserve the “connection/relationship between B
and C
” – they’re the same node and that self-referencing loop becomes redundant.
F.remove_edges_from(nx.selfloop_edges(F))
duplicates = [(node, node + "s") for node in F if node + "s" in F]
for dup in duplicates:
F = nx.contracted_nodes(F, *dup, self_loops=False)
Similarly, they check for multi-word titles that are space-delimited vs hyphen-delimited but represent the same content.
duplicates = [(x, y) for x, y
in [(node, node.replace("-", " ")) for node in F]
if x != y and y in F]
for dup in duplicates:
F = nx.contracted_nodes(F, *dup, self_loops=False)
This line is a bit hand-wavy, but essentially any node that survives the deduping, gets stuck with a new contraction
attribute that marks what node got deleted in the last step.
F.nodes['Complex Network']
{'contraction': {'Complex Networks': {}}}
And so this line just goes through and zeroes out all contraction
values, because it’s a useless attribute in this instance, and makes exporting the network to a standard format a chore
nx.set_node_attributes(F, 0, "contraction")
Subgraphing
At this point, we have a ton of nodes, and even more edges connecting them.
print(len(F.nodes), len(F.edges))
13390 23223
But do we need all of this data?
Of the 14k nodes we pulled in our dataset, nearly 10k of them only have degree 1 (one in-edge from a neighbor, not no out-edge connecting to any data point in our first two layers). This means that we can lop off nearly 70% of the data we collected, without losing nodes that provide any real interesting properties.
import pandas as pd
node_degrees = pd.Series([degree for title, degree in F.degree()])
node_degrees.value_counts().head()
1 10436
2 1596
3 605
5 230
4 184
dtype: int64
The author does this by defining a core
set of nodes that all have degree >= 2
core = [node for node, deg in dict(F.degree()).items() if deg >= 2]
Then defines a new, final, graph G
, derived from our original graph F
, but only for the nodes of interest
G = nx.subgraph(F, core)
Not only does the node count go down considerably, we also automatically rid ourselves of edges connecting to nodes outside of our core
set
print(len(G.nodes), len(G.edges))
2954 12787
Writing
I’m leaving this commented out, because I have no use for it in this notebook, but writing the graph object to memory is as easy as calling the function that writes the data to (one of many) standard graph data formats
nx.write_graphml(G, "output.graphml")
All Together
Finally, we can use the in_degree
values as a measure of “how fundamental is this page, in the world of our starting seed?”. Anyone with even a cursory understanding of Graph/Network Analysis won’t be surprised to see these results.
top_indegree = sorted(dict(G.in_degree()).items(),
reverse=True, key=itemgetter(1))[:100]
top_indegree[:10]
[('Graph (Discrete Mathematics)', 68),
('Vertex (Graph Theory)', 64),
('Directed Graph', 58),
('Social Network', 56),
('Network Theory', 53),
('Adjacency Matrix', 53),
('Degree (Graph Theory)', 53),
('Network Science', 51),
('Graph Drawing', 50),
('Edge (Graph Theory)', 50)]