 |
How does
Google collect and rank results?
One of the most common questions we hear from librarians is
"How does Google decide what result goes at the top of the list?"
Here, from quality engineer Matt Cutts, is a quick primer on how we
crawl and index the web and then rank search results. Matt also
suggests exercises school librarians can do to help
students.
Crawling and Indexing A
lot of things have to happen before you see a web page containing
your Google search results. Our first step is to crawl and index the
billions of pages of the World Wide Web. This job is performed by
Googlebot, our "spider," which connects to web servers around the
world to fetch documents. The crawling program doesn't really roam
the web; it instead asks a web server to return a specified web
page, then scans that web page for hyperlinks, which provide new
documents that are fetched the same way. Our spider gives each
retrieved page a number so it can refer to the pages it fetched.
Our crawl has produces an enormous set of documents, but
these documents aren't searchable yet. Without an index, if you
wanted to find a term like civil war, our servers would have
to read the complete text of every document every time you searched.
So the next step is to build an index. To do this, we
"invert" the crawl data; instead of having to scan for each word in
every document, we juggle our data in order to list every document
that contains a certain word. For example, the word "civil" might
occur in documents 3, 8, 22, 56, 68, and 92, while the word "war"
might occur in documents 2, 8, 15, 22, 68, and 77.
Once
we've built our index, we're ready to rank documents and determine
how relevant they are. Suppose someone comes to Google and types in
civil war. In order to present and score the results, we need
to do two things:
- Find the set of pages that contain the user's query somewhere
- Rank the matching pages in order of relevance
We've
developed an interesting trick that speeds up the first step:
instead of storing the entire index on one very powerful computer,
Google uses hundreds of computers to do the job. Because the task is
divided among many machines, the answer can be found much faster. To
illustrate, let's suppose an index for a book was 30 pages long. If
one person had to search for several pieces of information in the
index, it would take at least several seconds for each search. But
what if you gave each page of the index to a different person?
Thirty people could search their portions of the index much more
quickly than one person could search the entire index alone.
Similarly, Google splits its data between many machines to find
matching documents faster.
How do we find pages that contain
the user's query? Let's return to our civil war example. The word
"civil" was in documents 3, 8, 22, 56, 68, and 92; the word "war"
was in documents 2, 8, 15, 22, 68, and 77. Let's write the documents
across the page and look for those with both words.
| civil |
3 |
8 |
|
22 |
56 |
68 |
92 |
| war |
2 |
8 |
15 |
22 |
|
68 |
77 |
| both words |
|
8 |
|
22 |
|
68 |
|
Arranging the documents this way makes clear that the words
"civil" and "war" appear in three documents (8, 22, and 68). The
list of documents that contain a word is called a "posting list,"
and looking for documents with both words is called "intersecting a
posting list." (A fast way to intersect two posting lists is to walk
down both at the same time. If one list skips from 22 to 68, you can
skip ahead to document 68 on the other list as well.)
|
An exercise for students Once you see how to
intersect two words in an index, it's not hard to do it for
three or more words as well. Here's a fun exercise: try to
find all the documents below that contain the words "civil"
and "war" and "reconstruction."
civil: 1 9 15 19 22 35
38 48 53 55 65 68 73 78 82 88 91 99 war: 15 18 25 29 31 35
37 40 42 46 48 65 75 85 91 96 reconstruction: 35 42 48 64
73 91 95
The answer is at the end of the article.
|
Ranking Results Now we have the set of pages that
contain the user's query somewhere, and it's time to rank them in
terms of relevance. Google uses many factors in ranking. Of these,
the PageRank algorithm might be the best known. PageRank evaluates
two things: how many links there are to a web page from other pages,
and the quality of the linking sites. With PageRank, five or six
high-quality links from websites such as www.cnn.com and
www.nytimes.com would be valued much more highly than twice as many
links from less reputable or established sites.
But we use
many factors besides PageRank. For example, if a document contains
the words "civil" and "war" right next to each other, it might be
more relevant than a document discussing the Revolutionary War that
happens to use the word "civil" somewhere else on the page. Also, if
a page includes the words "civil war" in its title, that's a hint
that it might be more relevant than a document with the title "19th
Century American Clothing." In the same way, if the words "civil
war" appear several times throughout the page, that page is more
likely to be about the civil war than if the words only appear once.
|
An exercise for students
Pretend that you're a search engine. Pick
a query like civil war or recycling or whatever
you want. Search for the phrase on Google, pick three or four
pages from the results, and print them out. On each printout,
find the individual words from your query (such as "civil" and
"war") and use a highlighter to mark each word with color. Do
that for each of the 3-5 documents that you print out. Now
tape those documents on a wall, step back a few feet, and
squint your eyes. If you didn't know what the rest of a page
said, and could only judge by the colored words, which
document do you think would be most relevant? Is there
anything that would make a document look more relevant to you?
Is it better to have the words be in a large heading or to
occur several times in a smaller font? Do you prefer it if the
words are at the top or the bottom of the page? How often do
the words need to appear? See if you can come up with 2-3
things you would look for to see if a document matched a query
well. This can help students learn to evaluate website
relevance the way a search engine would evaluate it so that
they can better understand why a search engine returns certain
results over others. |
As a rule, Google tries to find pages that are both reputable and
relevant. If two pages appear to have roughly the same amount of
information matching a given query, we'll usually try to pick the
page that more trusted websites have chosen to link to. Still, we'll
often elevate a page with fewer links or lower PageRank if other
signals suggest that the page is more relevant. For example, a web
page dedicated entirely to the civil war is often more useful than
an article that mentions the civil war in passing, even if the
article is part of a reputable site such as Time.com.
Once
we've made a list of documents and their scores, we take the
documents with the highest scores as the best matches. Google does a
little bit of extra work to try to show snippets – a few sentences –
from each document that highlight the words that a user typed. Then
we return the ranked URLs and the snippets to the user as results
pages.
As you can see, running a search engine takes a lot
of computing resources. For each search that someone types in, over
500 computers may work together to find the best documents, and it
all happens in under half a second.
[Answer: Only documents 35, 48, and 91 contain all three words
"civil" and "war" and "reconstruction."]
Matt Cutts is a
software engineer in the quality group at Google. He spends his days
trying to help good sites rank where they should and developing
techniques that keep deceptive or spammy sites from showing up in
Google's search. He also has a web log at http://www.mattcutts.com/blog/
that often discusses webmaster issues.
Other questions? Send us a note at librarian-newsletter@google.com.
Every newsletter we’ll try to answer 1 or 2 of the most frequently
asked questions.
Sign up
to receive this newsletter.
|