In This Article
Subscribe to our newsletter
By now, you’ve likely already heard about the Google document leak, hundreds of pages of confidential, internal search API documentation sitting around on an unprotected GitHub repo, discovered by an SEO practitioner, shared with some popular search marketing influencers, revealed to the public, and picked apart and vigorously debated across SEO social media and beyond, ultimately breaking out of the search-geek bubble to receive coverage in tech-adjacent lay publishers, like The Register, Mashable, Ars Technica, and CNET. It’s quite a story, replete with anonymous tip-offs, crowdsourced sleuthing, denials and accusations, legal and political implications, real anger, and feigned surprise. But for all the chatter, did we learn anything?
This goal of this post is practical: dip our pan into the controversy-laden stream of leak discourse, sift out the nuggets of actionable intelligence, and deliver them to online publishers, content creators, and strategists to do…whatever it is you all do with precious metals in the wonderful world of literary imagery (we won’t judge). So, we have a job to do, one that requires us to largely steer around the leak story’s more dramatic twists and the subplots that inspire prurient curiosity, but a stage-setting recap is in order as the context might help us to properly interpret the discoveries themselves.
1. A Bot Mysteriously Released Thousands of Proprietary Google Search Documents
In mid-March of this year, a Google automated agent uploaded confidential, internal API documentation describing over 2,500 search-related modules with over 14,000 attributes to a public repository (since taken down, but preserved—legally, as it was published under a permissive license—at an alternate site). Georgia-based marketer Efram Azimi discovered the repository and began investigating its contents, uncovering details that contradicted Google’s official statements on what signals Google’s ranking algorithm uses to confer and deny the ample rewards of its massive network of users.
Sensing the damaging revelations in the leak and anticipating Google’s potentially aggressive response, Azim sought counsel, reaching out to veteran search thought leader Rand Fishkin, a widely cited analyst during Google’s rise from innovative online navigation tool to the internet’s most powerful company who’s since transcended a strict SEO focus and no longer considers himself a domain expert. In response to Azimi's outreach, Fishkin enlisted ace SEO expert Mike King to evaluate the technical details.
On May 27th, Fishkin and King both published posts about the leak, taking the story public for the first time. Fishkin focused largely on the narrative of the leak, how it came to his attention, how he authenticated the materials, and—at a high level—what the documents revealed about Google’s inner workings. He referred to Azimi only as an anonymous tipster whom he had not previously known.
Google search is one of the most secretive, closely-guarded black boxes in the world. Well, maybe not anymore.
— Rand Fishkin (follow @randderuiter on Threads) (@randfish) May 28, 2024
In the last quarter century, no leak of this magnitude or detail has ever been reported from Google’s search division. If you're in #SEO, you should probably see this. pic.twitter.com/JxEs55IV21
King’s post was a deep dive into the leak’s contents, what they did and didn’t contain, and example statements from Google representatives that contradicted both the leak and testimony from the US Department of Justice’s ongoing antitrust case against Google.
Ok, let's get this party started!
— Mic King (@iPullRank) May 28, 2024
A couple weeks ago I said I was publishing the most important thing I ever wrote. I was wrong.
Documentation related to the Google Search algorithm leaked and I spent the weekend tearing it apart.https://t.co/v71B16Ggov
✌🏾
Azimi unveiled himself in a dramatic video testimonial on the following day, characterizing his motivation as pure determination to expose the corrupt Googlers that misled publishers and marketers for years.
On May 29th, The Verge reported that Google spokesperson Davis Thompson had confirmed the leak’s authenticity but disparaged its value, cryptically “caution[ing] against making inaccurate assumptions about Search based on out-of-context, outdated, or incomplete information.”
Appearing together in a video interview on June 7th, Fishkin shared that an ex-Googler he consulted believed the documents contained “the whole shebang,” i.e., the complete set of search attributes, while King hat-tipped Australian SEO Dan Petrovic, who also discovered the leak but researched in secret until hearing of King’s plans to go public.
2. The Leak Lists Many Ingredients but Reveals No Secret Recipe
Keep in mind that the documents in question are extensive but not definitive. They describe attributes available for use in ranking modules, but they do not reveal whether the attributes are used at all, which modules use them, or how they’re weighted when used. The documents are intended for engineers working on building or refining ranking processes to understand the array of computed markers that their algorithms can call and leverage when queried.
Just because you have a shovel in the shed doesn’t mean you use it when replacing the chain on your bike; it only means that you can. I see you riding that bike, and I admire the performance you get out of it, but I can’t tell if you used a shovel to change the chain or if you changed it at all. And, typically, I wouldn’t also know what’s in your shed, but if a…uh…document leak revealed that, in addition to the shovel, you’ve got a circular saw, a wheelbarrow, some zip ties, a socket wrench set, a leaf blower, and hundreds of other tools and gadgets, I can speculate how they all might be drafted into a bike performance optimization operation.
And so it is with “unsquashedLastLongestClicks,” one of several measurements for clicks described in the API documentation, that—like all the other attributes listed—might get heavy use or just gather dust idly.
Because so much is ambiguous, a debate in the SEO community is stirring about whether we’ve garnered any utility from the leak (more on this later). But there are few key findings that are worth highlighting.
3. There Is No Single “Google Ranking Algorithm”
Below, we’ll take a look at several of the major findings in more detail and examine how they might impact digital media publishers in particular. But, to put the leak revelations in context, we should have a working mental model of Google’s search engine.
As King describes, “[c]onceptually, you may think of ‘the Google algorithm’ as one thing, a giant equation with a series of weighted ranking factors. In reality, it’s a series of microservices where many features are preprocessed and made available at runtime to compose the SERP.” Consider the system as layered, with some layers operating in parallel and some in sequence. One obvious division is between processes that have completed their work before a search query (e.g., spiders, indexes, click histories, etc.) and processes that are called immediately after a query is received (e.g., index lookups, ranking algorithms, and SERP populators). For certain common queries, ranks can potentially be cached, but Google must be able to respond quickly to any query, including ones it’s never encountered before.
When it receives a query, it first gathers links from the index that it deems relevant, often many thousands of links. Then links must be ordered, which is where the vaunted search algorithms come in. Google seems to have several algorithms helping to populate a single SERP, such as algorithms specifically for images, videos, web pages, and ads. Setting aside other content types, Google may still run multiple algorithms for web pages alone, each contributing some links to the blended results. Further algorithms could adjust the blended results based on user personalization, localization, or other factors. Yet another process stitches the adjusted, blended outputs for all content types into the SERP. All of this occurs in milliseconds.
4. Clicks and User Engagement Impact Rankings
Despite official pronouncements claiming otherwise from Googlers Matt Cutts, Gary Ilyes, and John Mueller, evidence of click data usage in search rankings appeared in both Google’s site quality score patent and testimony from Pandu Nayak, Google VP of Search, during the DoJ case.
The patent describes a method for measuring how well a page matches a search query by analyzing user clicks on the SERPs. From the patent:
This specification describes how a system can determine a score for a site, e.g., a web site or other collection of data resources, as seen by a search engine, that represents a measure of quality for the site. The score is determined from quantities indicating user actions of seeking out and preferring particular sites and the resources found in particular sites. A site quality score for a particular site can be determined by computing a ratio of a numerator that represents user interest in the site as reflected in user queries directed to the site and a denominator that represents user interest in the resources found in the site as responses to queries of all kinds The site quality score for a site can be used as a signal to rank resources, or to rank search results that identify resources, that are found in one site relative to resources found in another site.
In this case, user actions aren’t limited to clicks, but could include “one or more of a mouse rollover, a click, a click of at least a certain duration, or a click of at least a certain duration relative to a resource length.”
A part of Nayak’s testimony, in cross examination by the government, centered around NavBoost, which Nayak categorized as a core system that Google uses to rank SERPs, though not the only one. NavBoost not only uses click data to influence ranking, it requires click data, which is why it can’t be the only system employed for ranking.
To recap, 1) several Google representatives denied the impact of click data on rankings, 2) Google’s search-related IP includes a patent for a site quality evaluation system that analyzes user behavior, including clicks, and 3) a key component system of Google search requires click data to function.
The leak documents confirm and expand the available evidence, cataloging a number of attributes that track and categorize click behavior, including goodClicks, badClicks, lastLongestClicks, and unsquashedClicks. King summons his many years of study and experience to make very credible guesses as to the definitions of each attribute here, but the documentation itself doesn’t explain their meaning; it simply lists names of attributes, their associated data types, and—in some cases—some notes about changes and deprecations. But the details aren’t as important as is the overwhelming evidence that user behavior is stored, analyzed, and referenced for ranking.
5. Chrome Extends Google’s User Actions beyond Search
While reviewing references to Chrome browser data in the documentation, Fishkin relays the perspective of an anonymous ex-Googler that he consulted who believes that, around 2005, internal discussions noted a desire for larger quantities of clickstream data, including user behavior on sites other than Google properties and insinuates that Chrome’s development targeted this very data, permitting Google engineers to look at nearly everything Chrome users traverse.
Does Google use Chrome clicks for ranking? Back in 2012, a marketer at a conference posed this question to top Google search liaison Matt Cutts, who responded in the negative, adding, “even then, I believe that 98% of that data is not logged at all, and with Google Instant the remaining 2% of that data is deleted after two weeks.” Responses from other marketers demonstrate a split in the community, with some willing to accept Cutts’s assurances on blind faith and others doubting the veracity of Cutts’s statements and expressing cynicism regarding Chrome click data usage.
The cynics were right, not just on the facts but on the incentives. If we accept that user actions are informative and that Chrome gives Google access to magnitudes more user actions, we ought not to be surprised that Google APIs include attributes specific to Chrome click data. The data are too valuable and represent too great an advantage over rival web companies to ignore.
What does this mean for publishers? Chrome-optimized UX is important because Chrome user behavior is scrutinized at a far greater detail than behavior on another browser.
6. A Site Is Only As Strong As Its Weakest Page
“Domain authority” is a generic term for a site’s overall search performance health. Observations over the years by countless marketing analysts backed the theory that sites with considerable high-quality content would often achieve high SERP rankings quickly for brand new content, whereas new sites, even those starting off by publishing superb, highly optimized content, would take months to reach a coveted SERP position.
Fishkin’s previous company Moz.com pioneered the measurement and tracking of domain authority, and both Fishkin and King take particular issue with Google’s venomous denials of the concept in their posts. In a February 2019 Reddit AMA, 13 year Google veteran Gary Illyes took a rather mean-spirited dig at Fishkin, writing “ Dwell time, CTR, whatever Fishkin's new theory is, those are generally made up crap. Search is much more simple than people think.”
King referenced an October 2016 video Q&A from another longtime Google search representative, John Mueller, as another denial of site authority, and its true that Mueller implies that there’s no site authority, but he seems to acknowledge the existence of some sort of sitewide quality metrics that could influence rankings for all site content, both established and fresh.
The leak documents lend credence to Fishkin’s original assertions, listing attributes such as siteAuthority, hostAge, and homepagePagerankNs within Google’s Q* ranking system.
7. Google Is Sandboxing New Content
Sandboxing is a long-theorized concept concerning the temporary restriction against achieving high SERP rankings for highly relevant, optimized, new content. By placing new content in the “sandbox,” the search engine preventsdelays the gratification of spammy tactics and protects credible websites from unwarranted displacement. During the sandbox period, search engines have time to assess the new content's performance, user engagement, and the authenticity of its backlinks.
Google representative John Mueller denied the existence of sandbox in a now-deleted tweet (screenshotted for posterity in King’s post), but the document leak contradicts him directly. As per King, “In the PerDocData module, the documentation indicates an attribute called hostAge that is used specifically ‘to sandbox fresh spam in serving time.’”
8. Google Increasingly Favors Big Brands over Small Publishers
In recent years, Google's search algorithm has noticeably shifted to increasingly favor established brands over small publishers, even when the smaller publishers provide similarly relevant and high-quality content, often superior content, as documented painstakingly by gaming enthusiast publisher Retro Dodo and independent air purification product reviewer HouseFresh.
Google would likely contend that brands generally possess stronger trust signals, e.g., robust backlink profile, high domain authority, and significant user engagement, but the trend is troubling for publishers and search users. If Google is unable to properly evaluate and elevate high-quality small publishers at scale, its utility to consumers is waning, possibly death spiraling.
Small publishers should absolutely focus on building credibility and fostering user trust to compete effectively in an increasingly brand-centric search landscape, but they should also consider a future without Google as an investment-worthy traffic driver.
9. Google Whitelists Authoritative Sites for Sensitive Queries
The appearance of attributes such as isCovidLocalAuthority and isElectionAuthority points to the existence of extra layers of curation specifically for politically sensitive queries. Presumably, sites would need to earn these labels to receive top rankings for related searchers, suggesting that they were vetted and placed on whitelists.
In an otherwise scathing account of Google’s repeated obfuscations and long-running disinformation campaign, Fishkin is complementary of Google’s whitelisting policy. This seems incoherent; Fishkin at once decries Google’s frequent, self-serving dishonesty while assuming its curation of information on controversial topics is both precise and genuinely altruistic
If Google is the pathologically disingenuous entity Fishkin portrays throughout his post, it doesn’t deserve the benefit of the doubt on whitelisting, and he’s potentially deceiving himself on whitelists. What makes more sense than his stated take on this subject is that he’s more comfortable with bad faith actors whose deceptions are compatible with Fishkin’s own political commitments than he is with their counterparts in adversarial political movements, which is understandable, and probably pretty universal, but far less noble than Fishkin portrayed, and consequently Google’s whitelisting practices should be scrutinized at the very least before heaping on praise.
King’s exhaustive post doesn’t mention whitelisting at all, so its import is perhaps in dispute. Nonetheless, whitelisting is exposed in the leak, and its implication for digital media publishers whose content intersects with politics is probably significant.
10. Google’s Human Labelers Offer Clues about Algorithm Signals
Google employs human labelers to identify and flag high-quality content. These human evaluators review web pages based on various criteria, such as relevance, authority, and user experience, providing feedback that supplements Google's machine learning models. By identifying exemplary content, human labelers help train these models to better detect and prioritize high-quality signals automatically, more effectively distinguishing between superior and subpar content, ultimately delivering more reliable and trustworthy information to users.
Source: Google’s Search Quality Rater Guidelines, November 2023
We know something about the roles and activities of Google's human labelers, or “quality raters,” through the publicly available guidelines that Google provides to them, which offer descriptions of the tests and processes they engage in, the criteria they’re expected to use, and how their feedback informs search, but the API leak revealed the attribute “golden,” a simple Boolean indicating that a resource had been been evaluated as “a gold-standard document” by the human-labeling regime.
This might appear as just a small tidbit at first glance; there doesn’t seem to be a queue to submit resources for evaluation so it’s out of our hands, and if a resource isn’t evaluated, it can’t earn a golden flag. True, but the labelers are also providing training data to models that attempt to replicate the work of curators. Creating pages that adhere well to the listed criteria would align not only with the labelers but also the machines that mimic them.
11. Leak Demotion Attributes Offer Optimization Opportunities
Several attributes listed in the leak documents point to signals that Google views as evidence of poor quality, undermining a page’s search visibility. Major Google search algorithm updates over the years, like Panda, Penguin, and Hummingbird, rolled out accompanied by official communication positioning the changes as notable events in the cat-and-mouse game Google is forced to engage in by the increasingly sophisticated tactics of black hats, spammers, and scammers wanting to earn traffic at the expense of users.
King highlights a few causes for demotion that aren’t particularly surprising but weren’t previously associated with past major updates, such as…
- Anchor text-destination mismatches—when a link’s anchor text promises information that doesn’t appear on the landing page
- Internal navigation problems—when the navigational user experience is below par
- SERP click underperformance—when search users regularly decline to click on a link in the SERP or return to the SERP quickly to try another link
- Adult content
- Global (i.e., unlocalized) labeling—when a page or site doesn’t specify a locale
- Exact match domains—when a site uses a common query as its name
Some of these, such as anchor text and navigation, are literal quality problems, while others like global and exact match are merely indicative of poor quality content.
12. Google Lies a Lot
Treat Google’s SEO/publisher liaisons with skepticism. Analyze their public statements as what they want you to believe.
If you’re reading this and thinking it all sounds entirely obvious and unsurprising, you’re not alone. Pretty much every major revelation stemming from the leak has matched a strain of popular speculation. Marketers have hypothesized all of the above, and presented various tests and case studies that seemed to confirm these theories.
None of this might be controversial at all were it not for the avalanche of denials and dismissals rushing down the face of Mountain View. While Google’s reputation has eroded from its “don’t be evil” peak, it still enjoys the support of many unaffiliated commentators who view the corporation as a benevolent dictator and signal-boost its public relations communications uncritically. Fishkin implies that his journey away from SEO was in part spurred on by the dismissive reaction his signature theories received from Google reps and the animosity that emerged between Google and Fishkin partisans.
More than anything, the documents contain smoking guns for solved cases that otherwise lacked the evidence to take to trial. We now have proof that the world is what we thought it was. That may not seem like much, but it’s a stronger foundation upon which to build the future.
13. The Document Leak Is a Marketer’s Rorschach Test
While the leak contained thousands of authentic internal Google technical documents outlining tens of thousands of search-relevant attributes, marketing professionals are split on whether anything of value was revealed.
On a LinkedIn post, noted search marketer Wil Reynolds weighed in, asking rhetorically, “Did you ever think that Google wouldn't use Clicks, Chrome, Gmail, Android and over other data source it has that competitors do not to try to improve its cash cow? I always assumed they used clicks, too high quality of a signal.”
Another prominent search veteran, Will Critchlow, responded to Reynolds’s post with, “So far, I'm not changing any tactics off the back of this new information.”
In his own response to Reynolds, Finnish SEO Seppo Puusa was even more dismissive of the leak’s value, writing “I love digging into the info in the documents, and spent the better part of yesterday in them. But it's important to note that this is really ‘SEO astrology’. There's a risk of using this info to ‘prove’ what you already believe.”
Turning Leak Insights into Actions
1. Prioritize UX Optimization
Since user behavior plays a significant role in rankings, strong UX becomes essential for any site that hopes to earn quality organic search traffic. Remember that Google’s inclusion of Chrome data in the algorithms means that even traffic that does not originate from search—traffic that has no associated query—still impacts search visibility. Google can collect copious, valuable user data on sites that have yet to receive any organic search traffic
A good user experience—encompassing a logical, organized information architecture, a complementary, non-distracting, highly functional design, an aesthetic and a topical focus that is appealing to the target audience—encourages positive click/user signals and helps to minimize bounces and one-off visits.
2. Don’t Neglect Archival and Low Performing Content
The leak’s revelations about Google’s sitewide quality metrics and domain authority convincingly imply that low-quality content will negatively impact the search performance of strong content on the same domain. Publishers should look for ways to update, reorganize, and prune all site content with the goal of maximizing quality signals across the domain.
3. Lean into Google’s Brand Biases
If Google likes brands, give it a brand; fake it until you make it. Publishers should focus on projecting a highly branded experience, regardless of the size of the site. Build a strong brand identity and enforce it consistently with a recognizable visual style and content that is curated for topical compatibility, voiced in a recognizable tone with persistent terms and audience-appropriate jargon.
Conclusion
The Google document leak opened an imperfect but sizeable window into the intricate workings of the world's most influential search engine, offering both validation and revelation to the SEO community. Despite Google's official denials and the ensuing debates, the insights derived from the leaked documents provide a clearer picture of the factors influencing search rankings. From the outsized role of user behavior and Chrome browsing habits to the favors bestowed on established brands, these findings underscore the complexity and multi-layered nature of Google's algorithms.
For digital marketers and publishers, the leak reinforces the importance of a few, evergreen optimization routines. Prioritizing user experience, maintaining high-quality content across the site, and building a strong, recognizable brand are important activities to improve search visibility and performance and maintain alignment with Google’s evolutionary path. By leveraging these insights, publishers can better navigate today’s competitive search terrain.