How Google produces relevant search results
25th May 2009 | 09:00
Indexing and PageRank explained
How indexing works
Search engines can seem like magic. You type what you're looking for into a text box, and – ping! – it appears mere milliseconds later. Of course, we all know that Google relies on more than a magic wand to produce relevant search results; but what exactly does it use instead?
Well, any search engine has three distinct phases: spidering the web to find and read web pages, indexing the web pages found and responding to user queries to produce a set of ranked results.
In this article we're not going to talk at all about the spidering activities, but you can imagine it as a special program that reads a web page, analyses the source HTML and then follows the URL links found. The pages are then stored and further analysed.
Indexing is an interesting algorithm. The purpose of indexing the web pages found through spidering is to produce fast and accurate search results. If it's not fast, the user will go elsewhere; if it's not accurate, then ditto.
This is where Google out-did its competitors at the start: it was faster (all the way from loading the page to the production of the results) and it was more accurate (due to its use of the PageRank algorithm). Once the search engine has indexed a huge number of pages, it can start to respond to search queries.
The index must be built in such a way that the search engine can produce a list of web pages that satisfy the query, and rank the results – speedily – so that the most applicable and relevant results are shown first. So the index is of paramount importance, with the ranking algorithm being equally significant.
After all, before Google and PageRank, there were other search engines like AltaVista that knew how to index. It was Google's ranking that changed everything. The standard type of data structure for text indexes is known as an inverted file. Suppose we have a set of text documents. The first stage is to analyse the content by parsing the text in order to identify the individual words (or tokens) in the text.
Once we've identified the words, writing a parser is pretty simple. We can use something like a state machine (or automaton) or regular expressions (which admittedly were designed for this kind of text parsing task). But what are we actually looking for?
The simple, intuitive answer is 'the bits in between the white space'. But that simple answer reveals a difference between how we humans scan text and how the computer does. For example, take another look at the first sentence in this paragraph. Implementing this definition as a program (and using square brackets to delimit the tokens here), we would see words like [simple,], ['bits] and [space.'].
Since that's a little too basic, we might refine the definition to 'words are the bits in between the white space and punctuation'. But again, we fall into a trap: for the first sentence in this paragraph, this definition would now produce the words [that] and [s] for the word 'that's'.
So probably the best answer – at least for the purposes of indexing English text – is to write a parser that, when it sees a punctuation character, checks the character after the punctuation mark.
If it's a letter, then it's most likely to be an abbreviation like 'that's' or 'it's', and it should be included in the word. Even better, these days documents will contain things like email addresses and URLs. These can be viewed as 'words' in their own right, and this word-parser algorithm would work with them too.
Of course, for other languages, parsing words could require other algorithms depending on issues like the character set being used (ideographic text, for example, would be interesting) or the grammar rules for the language. In parsing the text, it's likely that we'll find duplicate words.
For a simple index, such as the one we're using here, all we need to know is whether a particular document includes a given word. We're not particularly interested in whether the document uses that word once, twice or a thousand times. After all, when we get to displaying the list of search results, the user will just click on a given link and go to that particular URL to browse the document.
In our parsing exercise, then, we should ignore duplicates and aim for a list of words used. The next step is to throw out the words that have no searchable meaning because every document will use them. Examples of these stop words (as they're called) are 'the', 'a', 'is', 'I', 'in' and so on (and those last three words too!).
Every application of a text indexing algorithm will use its own set of stop words, and these lists will be built up and maintained at runtime from the content being seen. After all that is done, we'll have a table of document names with a list of words for each document, as shown in Figure 1.
Figure 1: A forward index lists all of the words that are found in a particular document
Here I've named the documents as numbers, and in practice we'd have another table that cross referenced these numbers to the actual URL for the document. The next step is to invert this table; after all, for a search we're going to be given a word, and we have to identify the documents that contain this word.
The word should be the key, and at the moment the table is keyed by document number. What we end up with is a table as shown in Figure 2. The table is keyed by word, and the data for each record in the table is a list of document numbers. This structure is known as an inverted file. Starting from a table of words per document, we end up with a table of documents per word.
Figure 2: This inverted index is a list that shows which documents contain a certain word
Searching for keywords
Searching for keywords
For a search engine, the original table is going to have billions of rows since there are billions of pages on the web. The rows will be relatively short (maybe less than a thousand fields). However, the inverted table will have correspondingly fewer rows – let's say of the order of magnitude of a million or so – but each row could be huge.
Note though, something interesting about this data per word: it is highly compressible. We made the document names numbers, with a table that would give us the actual document for a given number. Now, there's no reason to store the numbers. Instead, we could have a large bitmap – an array of bits for each word. If document number x has the word, then bit x in the bitmap would be set, otherwise it would be clear.
If we posit our hypothetical search engine indexing a trillion pages from the web, these bit arrays would be 130GB in size or so when uncompressed. However, it's likely that for words that are not commonly used, the majority of these bitmaps would be empty, or 0. Highly compressible, in other words.
There's another reason for storing the inverted file as a bitmap like this. Consider now the search string entered by the user and how that will be used. When we use a search engine, we don't usually search for individual words. More commonly we tend to search for phrases – 'inverted file', for example.
Consider what this search phrase means to the search engine: it means that each document presented to the user in the result set must contain both the word 'inverted' and the word 'file'. This is a logical 'AND' operation. The search engine is able to retrieve the data for each word, and in order to find the document numbers that have both words, all it needs to do is the same logical AND operation on the bitmaps.
This will produce a new bitmap with bits set where both bits were set in the original two bitmaps, and clear otherwise. Thus queries can be easily mapped onto the inverted file data structure. Thus far, we've seen how to use an inverted file in order to find documents that contain a particular word.
It must be emphasised that for a commercial search engine like Google, these indices are gigantic. They're so big that they are spread out and duplicated over many servers. However, for smaller sets of documents (usually known as a corpus) – for example a large group of PDFs, Office files or emails that you have on disk – creating an inverted file to index them is very easy, and can reduce search times dramatically.
Indeed, the hardest part of completing such an exercise is understanding the document format. Another point we must make is that for the vast majority of searches done on the Internet, users only look at the first couple of pages of results.
Rarely do people venture beyond those first 20 matches, and indeed they mostly stay on the first page (if the result you want is not there it's easier to type in another slightly different search phrase than to scan-read the next two or three pages. After all, that first page of results may have given you extra hints on how to search for what you want). So it is important that those first 10 to 20 results are as relevant as possible.
While they were at Stanford University, Sergey Brin and Larry Page published a paper called The Anatomy of a Large-Scale Hypertextual Web Search Engine in which they described implementing a new prototype of a search engine.
This search engine was called Google (a misspelling of googol, a word meaning 10 to the hundredth power). In the paper, Brin and Page talk about how to make search results more relevant and compelling by using an algorithm called PageRank (named after Larry Page).
Ranking the results
PageRank is a voting system for analysing links. Suppose we have a web page on the Internet, and there are 10 other pages linking to it. That page will then be deemed more important or relevant than a second page that has only five other pages linking to it.
The importance (or quality) of the first page will be higher than that of the second page, just because it has more incoming links. Page and Brin had recognised that not all incoming links are equally important. They modified this simple citation-counting algorithm by assigning a PageRank to each page and then weighting the importance of the incoming link by the PageRank of the source page.
The upshot of this change is that a link from a page with a high PageRank is more important than a link from a page that has a low PageRank. This concept makes sense; some sites are more trustworthy than others. Brin and Page extended this concept further by deciding that if the source page has a high number of outgoing links, it dilutes the importance of each link.
The PageRank of a page, then, is the sum of all the PageRanks of the pages linking to it, divided by the number of outgoing links they have. Intuitively, this formula encapsulates the view that if a page is cited from other (well-cited) pages on the Internet, it's probably relevant and worth investigating.
Figure 3: The PageRank of a website depends on how many other well-cited pages link to it
Figure 3 shows a calculation of PageRank for a small corpus of six documents, A though F. The links between the documents are shown by arrows, and the area of a document is proportional to its PageRank. Notice that although A has more links coming in than B does, B has the higher PageRank, because A has more links from pages that are irrelevant (E and F in particular).
Another way to view PageRank is to imagine a person that, once he's given a random page to start from, randomly clicks on links on the pages he visits without hitting the Back button. The PageRank of a page is essentially the statistical probability that he visits that page. The PageRank that Google Toolbar displays for pages is not this probability.
Instead, using a formula that remains a trade secret, Google converts a PageRank probability to a whole number between 0 and 10, with 0 meaning 'unranked' (the page is too new to have been cited often) and 10 assigned to the most important or highest-quality pages. There's only one 10 as far as we know: the Google search page.
Sites like Wikipedia, Twitter and Yahoo! manage a 9 for their homepages, whereas Facebook only manages an 8. Now that we have a method for determining a page's importance, we can fairly easily order the search results for a given page by order of their importance and quality.
This article should have given you some insight into how a modern search engine works. There are more details to understand, though, and a good place to start is Page and Brin's Paper.
First published in PC Plus Issue 281
Liked this? Then check out The tech that's shaping the web's evolution
Sign up for TechRadar's free Weird Week in Tech newsletter
Get the oddest tech stories of the week, plus the most popular news and reviews delivered straight to your inbox. Sign up at http://www.techradar.com/register