Brief

By now I have crawled all pages I need by Crawler4j. To index them to serve my search engine, PageRank is a very famous and popular so it’s a good choice.

In Page Rank algorithm, directed graph of web pages is the essence of all the calculations. With the help of Jsoup, I can easily build a graph for Page Rank calculation. Then, my transitional result from Jsoup could be enhanced by NetworkX, and transformed into PageRank scores.

Quick Start with Jsoup

Jsoup

Jsoup is named because it is a Java library. It’s open-source and mainly works with HTMLs and do manipulations around URLs, CSS, or DOM. For more, refer to official website.

Installation

I choose to build using Maven. Using binary files are also available here.

  • Create a Maven project in IntelliJ (Or Eclipse of course).
  • In Maven POM add: ```
... org.jsoup jsoup 1.11.2 ...
Then import/install it.
- In `/src/main/java`, create `Extractor.java`. Do `import org.jsoup.Jsoup;` and compile.

## Do extraction from crawled pages
In current Page Rank project, what I need the most is to extract outgoing URLs from a bunch of downloaded pages.
### Setups
```java
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

are all I need for this task.
Define Extractor class with public void extract(String pathToPageFile, String baseUri) throws Exception { method. pathToPageFile stands for path to downloaded HTML file. And baseUri represents the URI base for this downloaded page and used when outgoing links use relative path.
Then, set up Document and Element of Jsoup as

File file = new File(pathToPageFile);
Document doc = Jsoup.parse(file, "UTF-8", baseUri);
Elements links = doc.select("a[href]");
Elements media = doc.select("[src]");
Elements imports = doc.select("link[href]");

Do doc.select with CSS queries is the key that Jsoup identify outgoing links from all kinds of references.

Extract numbers

Elements encapsulates all the details selected from given CSS queries. size() method shows how much corresponding resources has been extracted. This statistics is enough for general analysis. Simply do this by

	printf("\nMedia: %d", media.size());
	printf("\nImports: %d", imports.size());
	printf("\nLinks: %d", links.size());
...

  private void printf(String msg, Object... args) {
    System.out.println(String.format(msg, args));
}

And it shows:

Media: 82
Imports: 12
Links: 389

Extract Details

To dig deeper, details of extracted components are needed. All the Elements Objects are abstraction of ArrayList<Element>, so we can loop through it and play around with Element source within it. Example:

for (Element src: media) {
    printf(" * %s: <%s>", src.tagName(), src.attr("abs:src"));
}

to extract all absolute src. And it outputs what we want

Media: 14
 * script: <http://global.fncstatic.com/static/v/all/js/ag.jquery.js?20171226104054>
 * script: <http://global.fncstatic.com/static/v/all/js/ag.base.js?20171226104054>
 * img: <http://a57.foxnews.com/www.foxnews.com/content/dam/fox-news/images/2018/01/19/132/74/01_vegas5.jpg?ve=1&tl=1>
 ......

For more detail APIs, refer to API docs.

Construct Graph for PageRank

How PageRank works

Basically, PageRank indexes pages referring their weights in other pages. Referring to here, it can be illustrated with directed graphs, and weights of nodes are recursively computed. pagerank In this part, we are going to construct a graph representing by list of edges, which will serve as input for PageRank computation.

Enhance my Jsoup code for EdgeList file

Based on code before, now we will focus on extracted links.

Mapping files to URLs

To identify pages, we need a hashed ID for each of them, and a CSV file stored mapping from each hashing to actual URL.

8da923084554b2821233d6c5c5f02c97.html,https://www.foxnews.com/
a853ae2a38c1dd52f9e616255e69aa6d.html,http://www.foxnews.com/weather/your-weather/index.html
c8eacf1cac157943eb838eed1beb6a72.html,https://www.foxnews.com/category/us/crime/homicide.html
......

In Extractor, we should read from CSV to establish this mapping relationship before we can draw any graph.

    BufferedReader br = new BufferedReader(new FileReader(pathToCsv));
    String line = null;

    while ((line = br.readLine()) != null) {
        String[] mapping = line.split(",");
        fileUrlMap.put(mapping[0], mapping[1]);
        urlFileMap.put(mapping[1], mapping[0]);
    }

    br.close();

Build edges

In Extractor, declare a extractAll method to read all files from download directory and build graph in edge list form.

public void extractAll(String pathToDir) throws Exception {
    File dir = new File(pathToDir);
    File[] files = dir.listFiles();

    Set<String> edges = readEdges(files);
    writeEdgeFile(edges, pathToDir + "/../edge_file.txt");
}

and in readEdges, I read every out-link from this page and add them to edge set.

for (File file : files) {
    String fileName = file.getName();
    Document doc = Jsoup.parse(file, "UTF-8", fileUrlMap.get(fileName));
    Elements links = doc.select("a[href]");

    for (Element link : links) {
        String url = link.attr("href").trim();
        if (urlFileMap.containsKey(url)) { // Skip outsider links
            String edge = String.format("%s %s", fileName, urlFileMap.get(url));
            edges.add(edge);
        }
    }

    System.out.println(String.format("DEBUG: Complete: %d, Total: %d", counter++, totalFileNumber));
}
return edges;

As shown in my DEBUG output, things work perfectly.

......
DEBUG: Complete: 18561, Total: 18563
DEBUG: Complete: 18562, Total: 18563
Writing edge list file......

and out space-separated edge file should look like this:

1c7c80cabd4d9447871319a6f51e04f9.html e74b6c094853981f8b7c1ccde49b28bc.html
1dfdde3ab29662f72f63afe467385914.html cb83f91a16b49201941cacd85ac68148.html
aa86a9090be65917d257b5acc626bdda.html 190c3635396c57d2005576ceb456c2e5.html
......

Compute with NetworkX

NetworkX is a powerful Python-based networks study library. From edge list we just generated, it can help us derive PageRank from enclosed graph.

Installation

pip install networkx Make sure the Python version in use is the same as what pip pointing to (i.e. if using Python3, pip -V should return Python 3.x.x).

Load network graph to NetworkX

Because I’m using edge list form to represent graph, read_edgelist works for me.

>>> import networkx as nx
>>> G = nx.read_edgelist("/path/to/edge/file/edge_file.txt", create_using=nx.DiGraph())

Compute Page Rank

First we need to setup parameters to PR. Refer to this.

alpha (float, optional) – Damping parameter for PageRank, default=0.85. personalization (dict, optional) – The “personalization vector” consisting of a dictionary with a key for every graph node and nonzero personalization value for each node. By default, a uniform distribution is used. max_iter (integer, optional) – Maximum number of iterations in power method eigenvalue solver.

>>> pr = nx.pagerank(G, alpha=0.85, personalization=None, max_iter=30, tol=1e-06, nstart=None, weight='weight', dangling=None)

The PR will be computed based on given graph and stored into dictionary. It’s better to write them into file with format <id>=<PR_score>.

f = open('external_pr.txt', 'w')
for key, val in pr.items():
	f.write('{}/{}={}\n'.format(pathToFile, str(key), str(val)))
f.close()

And result looks like,

cb35e045b7b0470631719e8bb0548c29.html=8.081802212918882e-06
690f14924de5735e697a35f97e39e428.html=0.007704396255553174
3f2a3e1a34498d7393602a0f9ceaa59b.html=0.013412990502354767
......

Summary

Until now, we have successfully computed Page Rank from various HTML files which were crawled by our crawler before. Page Ranks will be very useful in building search engine and indexing millions of web pages. With this Page Rank result, I will build my search engine using Apache Solr in a couple of days. In my next post, there will be a prototypical search engine set up. Let’s start from here!