Google first crawls your web site and makes a copy of it. This is called an index.
Think of an index you might find at the end of a book. Traditional information retrieval systems (search engines) work similarly when they look up web documents.
But the web is ever-changing. Size isn’t everything, Nayak explained, and there’s a lot of duplication on the web. Google’s goal is to create a “comprehensive index.”
In 2020, the index was “maybe” about 400 billion documents, Nayak said. (We learned that there was a period of time when that number came down, though exactly when was unclear.)
“I don’t know in the past three years if there’s been a specific change in the size of the index.”
“Bigger is not necessarily better, because you might fill it with junk.”
You can keep the size of the index the same if you decrease the amount of junk in it,” Nayak said. “Removing stuff that is not good information” is one way to “improve the quality of the index.”
Nayak also explained the role of the index in information retrieval:
“So when you have a query, you need to go and retrieve documents from the index that match the query. The core of that is the index itself. Remember, the index is for every word, what are the pages on which that word occurs. And so — this is called an inverted index for various reasons. And so the core of the retrieval mechanism is looking at the words in the query, walking down the list — it’s called the postings list — and intersecting the postings list. This is the core retrieval mechanism. And because you can’t walk the lists all the way to the end because it will be too long, you sort the index in such a way that the likely good pages, which are high quality — so sometimes these are sorted by page rank, for example, that’s been done in the past, are sort of earlier in the thing. And once you’ve retrieved enough documents to get it down to tens of thousands, you hope that you have enough documents. So this is the core of the retrieval mechanism, is using the index to walk down these postings lists and intersect them so that all the words in the query are retrieved.”
We know Google uses the index to retrieve pages matching the query. The problem? Millions of documents could “match” many queries.
This is why Google uses “hundreds of algorithms and machine learning models, none of which are wholly reliant on any singular, large model,” according to a blog post Nayak wrote in 2021.
These algorithms and machine learning models essentially “cull” the index to the most relevant documents, Nayak explained.
“So that’s — the next phase is to say, okay, now I’ve got tens of thousands. Now I’m going to use a bunch of signals to rank them so that I get a smaller set of several hundred. And then I can send it on for the next phase of ranking which, among other things, uses the machine learning.”
Google’s A guide to Google Search ranking systems contains many ranking systems you’re probably well familiar with by now (e.g., BERT, helpful content system, PageRank, reviews system).
But Nayak (and other antitrust trial exhibits) have revealed new, previously unknown systems, for us to dig deeper into.
‘Maybe over 100’ ranking signals
Many years ago, Google used to say it used more than 200 signals to rank pages. That number ballooned briefly to 10,000 ranking factors in 2010 (Google’s Matt Cutts explained at one point that many of Google’s 200+ signals had more than 50 variations within a single factor) – a stat most people have forgotten.
Well, now the number of Google signals is down to “maybe over a hundred,” according to Nayak’s testimony.
What is “perhaps” the most important signal (which matches what Google’s Gary Illyes said at Pubcon this year) for retrieving documents is the document itself, Nayak said.
“All of our core topicality signals, our page rank signals, our localization signals. There’s all kinds of signals in there that look at these tens of thousands of documents and together create a score that then you extract the top few hundred from there,” Nayak said.
The key signals, according to Nayak, are:
The document (a.k.a., “the words on the page and so forth”).
Here’s the full quote from the trial:
“I mean, overall, there’s a lot of signals. You know, maybe over a hundred signals. But for retrieving documents, the document itself is perhaps the most important thing, those postings lists that we have that we use to retrieve documents from. That’s perhaps the most important thing, to get it down to the tens of thousands. And then after that, there are many factors, again. There are sort of code IR type, information retrieval type algorithms which cull topicality and things, which are really important. There is page quality. The reliability of results, that’s another big factor. There’s localization type things that go on there. And there is navboost also in that.”
Google core algorithms
Google uses core algorithms to reduce the number of matches for a query down to “several hundred” documents. Those core algorithms give the documents initial rankings or scores.
Each page that matches a query gets a score. Google then sorts the scores, which are used in part for Google to present to the user.
Web results are scored using an IR score (IR stands for information retrieval).
Navboost system (a.k.a., Glue)
Navboost “is one of the important signals” that Google has, Nayak said. This “core system” is focused on web results and is one you won’t find on Google’s guide to ranking systems. It is also referred to as a memorization system.
The Navboost system is trained on user data. It memorizes all the clicks on queries from the past 13 months. (Before 2017, Navboost memorized historical user clicks on queries for 18 months.)
The system dates back at least to 2005, if not earlier, Nayak said. Navboost has been updated over the years – it is not the same as it was when it was first introduced.
“Navboost is looking at a lot of documents and figuring out things about it. So it’s the thing that culls from a lot of documents to fewer documents,” Nayak said.
Trying not to minimize the importance of Navboost, Nayak also made it clear that Navboost is just one signal Google uses. Nayak was asked whether Navboost is “the only core algorithm that Google uses to retrieve results,” and he said “no, absolutely not.”
Navboost helps reduce documents to a smaller set for Google’s machine learning systems – but it can’t help with ranking for any “documents that don’t have clicks.”
Navboost can “slice locale information” (i.e., the origin location of a query) and the data information that it has in it by locale.
When discussing “the first culling” of “local documents” and the importance of retrieving businesses that are close to a searcher’s particular location (e.g., Rochester, N.Y.), Google is presenting them to the user “so they can interact with it and create Navboost and so forth.”
“Remember, you get Navboost only after they’re retrieved in the first place,” Nayak said.
So this means Navboost is a ranking signal that can only exist after users have clicked on a document/page.
Navboost can also create different datasets (slices) for mobile vs. desktop searches. For each query, Google tracks what kind of device it is made on. Location matters whether the search is conducted via desktop or mobile – and Google has a specific Navboost for mobile.
“It’s one of the slices,” Nayak said.
What is Glue?
“Glue is just another name for Navboost that includes all of the other features on the page,” according to Nayak, confirming that Glue does everything else on the Google SERP that’s not web results.
Glue was also explained in a different exhibit (Prof. Douglas Oard Presentation, Nov. 15, 2023):
“Glue aggregates diverse types of user interactions–such as clicks, hovers, scrolls, and swipes–and creates a common metric to compare web results and search features. This process determines both whether a search feature is triggered and where it triggers on the page.”
Also, as of 2016, Glue was important to Whole-Page Ranking at Google:
“User interaction data from Glue is already being used in Web, KE [Knowledge Engine], and WebAnswers. More recently, it is one of the critical signals in Tetris.”
We also learned about something called Instant Glue, described in 2021 as a “realtime pipeline aggregating the same fractions of user-interaction signals as Glue, but only from the last 24 hours of logs, with a latency of ~10 minutes.”
Navboost and Glue are two signals that help Google find and rank what ultimately appears on the SERP.
Deep learning systems
Google “started using deep learning in 2015,” according to Nayak (the year RankBrain launched).
Once Google has a smaller set of documents, then the deep learning can be used to adjust document scores.
Some deep learning systems are also involved in the retrieval process (e.g., RankEmbed). Most of the retrieval process happens under the core system.
Will Google Search ever trust its deep learning systems entirely for ranking? Nayak said no:
“I think it’s risky for Google — or for anyone else, for that matter, to turn over everything to a system like these deep learning systems as an end-to-end top-level function. I think it makes it very hard to control.”
Nayak discussed three main deep learning models Google uses in ranking, as well as how MUM is used.
Looks at the top 20 or 30 documents and may adjust their initial score.
Is a more expensive process than some of Google’s other ranking components (it is too expensive to run on hundreds or thousands of results).
Is trained on queries across all languages and locales Google operates in.
Is fine-tuned on IS (Information Satisfaction) rating data.
Can’t be trained on only human rater data.
RankBrain is always retrained with fresh data (for years, RankBrain was trained on 13 months’ worth of click and query data).
“RankBrain understands long-tail user need as it trains…” Nayak said.
Is BERT when BERT is used for ranking.
Is taking on more of RankBrain’s capability.
Is trained on user data.
Is fine-tuned on IS rating data.
Understands language and has common sense, according to a document read to Nayak during the trial. As quoted during Nayak’s testimony from a DeepRank document:
“DeepRank not only gives significant relevance gains, but also ties ranking more tightly to the broader field of language understanding.”
“Effective ranking seems to require some amount of language understanding paired with as much world knowledge as possible.”
“In general, effective language understanding seems to require deep computation and a modest amount of data.”
“In contrast, world knowledge is all about data; the more the better.”
“DeepRank seems to have the capacity to learn the understanding of language and common-sense that raters rely on to guesstimate relevance, but not nearly enough capacity to learn the vast amount of world knowledge needed to completely encode user preferences.”
DeepRank needs both language understanding and world knowledge to rank documents, Nayak confirmed. (“The understanding language leads to ranking. So DeepRank does ranking also.”) However, he indicated DeepRank is a bit of a “black box”:
“So it’s learned something about language understanding, and I’m confident it learned something about world knowledge, but I would be hard-pressed to give you a crisp statement on these. These are sort of inferred kind of things,” Nayak explained.
What exactly is world knowledge and where does DeepRank get it? Nayak explained:
“One of the interesting things is you get a lot of world knowledge from the web. And today, with these large language models that are trained on the web — you’ve seen ChatGPT, Bard and so forth, they have a lot of world knowledge because they’re trained on the web. So you need that data. They know all kinds of specific facts about it. But you need something like this. In search, you can get the world knowledge because you have an index and you retrieve documents, and those documents that you retrieve gives you world knowledge as to what’s going on. But world knowledge is deep and complicated and complex, and so that’s — you need some way to get at that.”
Was initially launched earlier, without BERT.
Augmented (and renamed) to use the BERT algorithm “so it was even better at understanding the language.”
Is trained on documents, click and query data.
Is fine-tuned on IS rating data.
Needs to be retrained so that the training data reflects fresh events.
Identifies additional documents beyond those identified by traditional retrieval.
Trained on a smaller fraction of traffic than DeepRank – “having some exposure to the fresh data is actually quite valuable.”
MUM is another expensive Google model so it doesn’t run for every query at “run time,” Nayak explained:
“It’s too big and too slow for that. So what we do there is to train other smaller models using the special training, like the classifier we talked about, which is a much simpler model. And we run those simpler models in production to serve the use cases.”
QBST and Term weighting
QBST (Query Based Salient Terms) and term weighting are two other “ranking components” Nayak was not asked about. But these appeared in two slides of the Oard exhibit linked earlier.
These two ranking integrations are trained on rating data. QBST, like Navboost, was referred to as a memorization system (meaning it most likely uses query and click data). Beyond their existence, we learned little about how they work.
The term “memorization systems” is also mentioned in an Eric Lehman email. It may just be another term for Google’s deep learning systems:
“Relevance in web search may not fall quickly to deep ML, because we rely on memorization systems that are much larger than any current ML model and capture a ton of seemingly-crucial knowledge about language and the world.”
Assembling the SERP
Search features are all the other elements that appear on the SERP that are not the web results. These results also “get a score.” It was unclear from the testimony if it’s an IR Score or a different kind of score.
The Tangram system (formerly known as Tetris)
We learned a little about Google’s Tangram system, which used to be called Tetris.
The Tangram system adds search features that aren’t retrieved through the web, based on other inputs and signals, Nayak said. Glue is one of those signals.
You can see a high-level overview of how Freshness in Tetris worked in 2018, in a slide from the Oard trial exhibit:
Applies Instant Glue in Tetris.
Demotes or suppresses stale features for fresh-deserving queries; promotes TopStories.
Signals for newsy queries.
Evaluating the SERP and Search results
The IS Score is Google’s primary top-level metric of Search quality. That score is computed from search quality rater rankings. It is “an approximation of user utility.”
IS is always a human metric. The score comes from 16,000 human testers around the world.
IS, ranking and search quality raters
Sometimes, IS-scored documents are used to train the different models in the Google search stack. As noted in the Ranking section, IS rater data helps train multiple deep learning systems Google uses.
While specific users may not be satisfied with IS improvement, “[Across the corpus of Google users] it appears that IS is well correlated with helpfulness to users at large,” Nayak said.
Google can use human raters to “rapidly” experiment with any ranking change, Nayak said in his testimony.
“Changes don’t change everything. That wouldn’t be a very good situation to have. So most changes change a few results. Maybe they change the ordering of results, in which case you don’t even have to get new ratings, or sometimes they add new results and you get ratings for those. So it’s a very powerful way of being able to iterate rapidly on experimental changes.”
Nayak also provided some more insights into how raters assign scores to query sets:
“So we have query sets created in various ways as samples of our query stream where we have results that have been rated by raters. And we use this — these query sets as a way of rapidly experimenting with any ranking change.”
“Let’s say we have a query set of let’s say 15,000 queries, all right. We look at all the results for these 15,000 queries. And we get them rated by our raters.”
“These are constantly running in general, so raters have already given ratings for some of them. You might run an experiment that brings up additional results, and then you’d get those rated.”
“Many of the results that they produce we already have ratings from the past. And there will be some results that they won’t have ratings on. So we’ll send those to raters to say tell us about this. So now all the results have ratings again, and so we’ll get an IS score for the experimental set.”
Another interesting discovery: Google decided to do all rater experiments with mobile, according to this slide:
Problems with raters
Human raters are asked to “put themselves in the shoes of the typical user that might be there.” Raters are supposed to represent what a general user is looking for. But “every user clearly comes with an intent, which you can only hope to guess,” Nayak said.
Documents from 2018 and 2021 highlight a few issues with human raters:
Raters may not understand technical queries.
Raters can not accurately judge the popularity of anything.
In IS Ratings, human raters don’t always pay enough attention to the freshness aspect of relevance or lack the time context for the query, thus undervaluing fresh results for fresh-seeking queries.
A slide from a presentation (Unified Click Prediction) indicates that one million IS ratings are “more than sufficient to superbly tune curves via RankLab and human judgment” but give “only a low-resolution picture of how people interact with search results.”
Other Google Search evaluation metrics
A slide from 2016 revealed that Google Search Quality uses four other main metrics to capture user intent, in addition to IS:
PQ (page quality)
On Live Experiments:
All Ranking experiments run LE (if possible)
Measures position weighted long clicks
Eval team now using attention as well
“One important aspect of freshness is ensuring that our ranking signals reflect the current state of the world.” (2021)
All of these metrics are used for signal development, launches and tracking.
Learning from users
So, if IS only provides a “low-resolution picture of how people interact with search results,” what provides a clearer picture?
No, not individual clicks. We’re talking about trillions of examples of clicks, according to the Unified Click Prediction presentation.
As the slide indicates:
provide a vastly clearer picture of how people interact with search results.
A behavior pattern apparent in just a few IS ratings may be reflected in hundreds of thousands of clicks, allowing us to learn second and third order effects.”
Google illustrates an example with a slide:
Click data indicates that documents whose title contains dvm are currently under-ranked for queries that start with [dr …]
dvm = Doctor of Veterinary Medicine
There are a couple relevant examples in the 15K set.
There are about a million examples in click data.
So the volume of click data for this special situation roughly equals the total volume of all human rating data.
Learning this association is not only possible from the training data, but required to minimize the objective function.
Clicks in ranking
Google seems to equate using clicks with memorizing rather than understanding the material. Like how you can read a whole bunch of articles about SEO but not really understand how to do SEO. Or how reading a medical book doesn’t make you a doctor.
Let’s dig deeper into what the Unified Click Prediction presentation has to say about clicks in ranking:
Reliance on user feedback (“clicks”) in ranking has steadily increased over the past decade.
Showing results that users want to click is NOT the ultimate goal of web ranking. This would:
Promote low-quality, click-bait results.
Promote results with genuine appeal that are not relevant.
Be too forgiving of optionalization.
Demote official pages, promote porn, etc.
Google’s goal is to figure out what users will click on. But, as this slide shows, clicks are a proxy objective:
But showing results that users want to click is CLOSE to our goal.
And we can do this “almost right” thing extremely well by drawing upon trillions of examples of user behavior in search logs.
This suggests a strategy for improving search quality:
Predict what results users will click.
Boost those results.
Patch up problems with page quality, relevance, optionalization, etc.
Not a radical idea. We have done this for years.
The next three slides dive into click prediction, all titled “Life Inside the Red Triangle.” Here’s what Google’s slides tell us:
The “inner loop” for people working on click prediction becomes tuning on user feedback data. Human evaluation is used in system-level testing.
We get about 1,000,000,000 new examples of user behavior every day, permitting high-precision evaluation, even in smaller locales. The test is:
Were your click predictions better or worse than the baseline?
This is a fully-quantifiable objective, unlike the larger problem of optimizing search quality. The need to balance multiple metrics and intangibles is largely pushed downstream.
The evaluation methodology is “train on the past, predict the future”. This largely eliminates problems with over-fitting to training data.
Continuous evaluation is on fresh queries and the live index. So the importance of freshness is built into the metric.
The importance of localization and further personalization are also built into the metric, for better or worse.
This refactoring creates a monstrous and fascinating optimization problem: use hundreds of billions of examples of past user behavior (and other signals), to predict future behavior involving a huge range of topics.
The problem seems too large for any existing machine learning system to swallow. We will likely need some combination of manual work, RankLab tuning, and large-scale machine learning to achieve peak performance.
In effect, the metric quantifies our ability to emulate a human searcher. One can hardly avoid reflections on the Turing Test and Searle’s Chinese Room.
Moving from thousands of training examples to billions is game-changing…
User feedback (i.e., click data)
Whenever Google talks about collecting user data for X number of months, that’s all “the queries and the clicks that occurred over that period of time,” from all users, Nayak said.
If Google were launching just a U.S. model, it would train its model on a subset of U.S. users, for example, Nayak said. But for a global model, it will look at the queries and clicks of all users.
Not every click in Google’s collection of session logs has the same value. Also, fresher user, click and query data is not better in all cases.
“It depends on the query … there are situations where the older data is actually more valuable. So I think these are all sort of empirical questions to say, well, what exactly is happening. There are clearly situations where fresh data is better, but there are also cases where the older data is more valuable,” Nayak said.
Previously, Nayak said there is a point of diminishing returns:
“…And so there is this trade-off in terms of amount of data that you use, the diminishing returns of the data, and the cost of processing the data. And so usually there’s a sweet spot along the way where the value has started diminishing, the costs have gone up, and that’s where you would stop.”
The Priors algorithm
No, the Priors algorithm is not an algorithm update, like a helpful content, spam or core update. In these two slides, Google highlighted its take on “the choice problem.”
“The idea is the score the doors based on how many people took it.
In other words, you rank the choices based on how popular it is.
This is simple, yet very powerful. It is one of the strongest signals for much of Google’s search and ads ranking! If we know nothing about the user, this is probably the best thing we can do.”
Google explains its personalized “twist” – looking at who went through each door and what actions describe them – in the next slide:
“We bring two twists to the traditional heuristic.
Instead of attempting to describe – through a noisy process – what each door is about, we describe it based on the people who took it.
We can do this at Google, because at our scale, even the most obscure choice would have been exercised by thousands of people.
When a new user walks in, we measure their similarity to the people behind each door.
This brings us to the second twist, which is that while describing a user, we don’t not [sic] use demographics or other stereotypical attributes.
We simply use a user’s past actions to describe them and match users based on their behavioral similarity.”
The ‘data network effect’
One final tidbit comes from Hal Varian email that was released as a trial exhibit.
The Google we know today is a result of a combination of countless algorithm tweaks, millions of experiments and invaluable learnings from end-user data. Or, as Varian wrote:
“One of the topics that comes up constantly is the ‘data network effect’ which argues that
High quality => more users => more analysis => high quality
Though this is more or less right, 1) it applies to every business, 2) the ‘more analysis’ should really be ‘more and better analysis”.
Much of Google’s improvement over the years has been due to thousands of people … identifying tweaks that have added up to Google as it is today.
This is a little too sophisticated for journalists and regulators to recognize. They believe that if we just handed Bing a billion long-tail queries, they would magically become a lot better.”