Yandex Data Leak: Initial Findings & SEO Learnings (The 1,922)

January 2023 has been an interesting month for Yandex, with the covering suffering a sizeable data leak.

You can read more about this here.

244 Removed Factors, 988 Deprecated

In the document of 1,922 factors, 244 have been categorized as “unused” and removed from consideration.

The original ranking factor name, description, and other identifying information other than its number in the document have been removed.

988 of the ranking factors are also listed as deprecated, meaning that 64% of the document is either not actively used or has been superseded – so it’s more like ~690 potential ranking factors, and a lot of them contain thin descriptions.

The age of some of these factors is also questionable, with some of the authors/those responsible for certain factors appearing to have left Yandex more than a decade ago.

For example, the author DenPlusPlus hasn’t been with Yandex for some time, and has commented on the leak highlighting that there are no “central folders” within the leak. So at most, we have a small window into the present and past of Yandex’s inner workings – but definitely not the full-ranking factors or algorithms.

PageRank

The leaked file affirms that Yandex uses a form of PageRank as a ranking factor, and given how a lot of “Google” tactics work it can be assumed that Yandex PageRank works in the same way as Google PageRank.

It’s also worth highlighting that PageRank is the first ranking factor listed.

Pessimization

This one is the one a lot of people are highlighting. Our interpretation is that when a website is penalized (pessimized), its PageRank is reduced to zero.

This is inline with the longstanding theory that when you receive a penalty in Yandex, recovery is a lot harder.

Clicks & CTR Is A Factor (User Signals)

It has been known for some time that click manipulation works in Yandex. Now with the leaked ranking factors, we have further affirmations.

There are also mentions of hard clicks, soft clicks, quick backs, and traffic to websites from specific sources.

Overall Site Performance Impacts Individual Queries

The average performance of a URL (and host) is a ranking factor, including the number of times a URL (and host) are requested.

URL Construction Matters

As well as specific ranking factors focusing on the URL, the URL component is tagged in more than 130 ranking factors. Some of the top-level takeaways are:

Negatives

Too many trailing slashes are seen as negative
Using numbers in the URL can be seen as a negative

Positives

URL contains a corresponding country or city (GEO identifier) to the user
URL contains the query or semantic relation to the query

URL length also seems to be a form factor – but it doesn’t sway either way as being positive or negative. For example, one factor outlined is dividing the URL length by 5.

Another talks about the length of the query (request) and URL length but follows on from a similar factor which talks about YouTube URLs and specifically using the Levenshtein distance.

The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

Both factors are tagged as being a part of the same “search ticket”, so one could assume that both use the Levenshtein distance metric, but it hasn’t been declared in the descriptions.

So a simplistic takeaway here would be to keep URLs simple, and as focused on the search query as possible.

Predicting The Number Of Products On A Page

Yandex uses DSSM, looking at the URL and Page Title to determine if a webpage has one product, or has multiple products listed on it.

DSSM probability prediction by using the document URL and the title, to determine that there is only one product on the page.
DSSM prediction of the probability by using the document URL and the title, to determine that there are likely a lot of products on the page.

This is especially important if determining that multiple products (e.g. a typical eCommerce category page) are more suitable and a better value proposition to serve users than just a single product page.

Yandex Has Page Quality Scores

There are 7 ranking factors mentioning page quality, and although two elude to page quality experiments, two give further information:

DSSM predicts the page quality score for a document
Page quality aggregated by the host (average score)

This is interesting that the host plays a role in perceived page quality (assuming that cheap hosts, get cheap, spam websites?).

Other ranking factors in the document also show how the host plays a role…

YMYL Exists/Existed

In total, 15 factors related to medical, financial, and legal topics.

TikTok Is There

There are factors that mention traffic and links from TikTok. It is not 100% clear if these are implemented.

Host Reliability

The number of URLs on a domain that responds with errors (presumably 5XX and 4XX) is an indicator of quality.

Metrika Data Does Impact Rankings

The ranking factors leak shows that Yandex Metrika data does impact rankings.

A lot of descriptions simply refer to a similar mechanism – similar to YabarUrlVisits. This has its own ranking factor, which is described as the volume of traffic coming from the Yabar (i bar)

And then through other individual ranking factors, we know the Metrika factors that influence rankings. These being:

Number of visits to individual URLs
Number of visitors to individual URLs
The average time spent by users on individual URLs
The audience data (core audience) of visitors to webpages with a Metrika Counter
The average time a user spends on the host when accessed externally (from another non-search site) from a specific URL
Average ‘depth’ (number of hits within the host) of a user’s stay on the host when accessed externally (from another non-search site) from a specific URL

This also indicates that Yandex Direct (e.g. Yandex PPC/Yandex Paid Search) does and can impact organic search performance.

This type of manipulation has been rumored to have worked/worked anecdotally for some time with some Runet webmasters setting up Metrika accounts and artificial traffic, correlating with ranking improvements.

Age of Links

The leak has revealed that the age of backlinks impacts how they, the links, impact overall search ranking.

Factors For Query Relevancy In Text & Titles

The leaked ranking factors also give us good insight into how query presence is treated within document text and titles.

Keywords in the text and titles.
Occurrence of keywords in sentences.
Occurrence of keywords in paragraphs.

It’s also worth noting that IDF (Inverse Document Frequency) is also mentioned.

Meta keywords have also been re-confirmed.

BM25 Algorithm Used For Text Analysis

33 different ranking factors utilize the BM25 algorithm for text analysis.

The below explanation of BM25 has been taken from Wikipedia:

In information retrieval, Okapi BM25 (BM is an abbreviation of best matching) is a ranking function used by search engines to estimate the relevance of documents to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. Robertson, Karen Spärck Jones, and others.

The name of the actual ranking function is BM25. The fuller name, Okapi BM25, includes the name of the first system to use it, which was the Okapi information retrieval system, implemented at London’s City University in the 1980s and 1990s. BM25 and its newer variants, e.g. BM25F (a version of BM25 that can take document structure and anchor text into account), represent TF-IDF-like retrieval functions used in document retrieval.

Presence Of Yandex Ads & Ads In General

The presence of Yandex Ads, and Ads in general, are two separate ranking factors.

Nothing in the description gives an opinion as to whether or not the presence of general ads, or Yandex ads, is a good or bad thing, only that it somehow matters.

Yandex also actively looks to see if the webpage contains adverts for adult content.

Time of Day/Day of Week Impacts

There are around 10 factors listed that indicate that the time of the day, and day of the week, have an impact on rankings.

This makes sense.

If you’re searching for [restaurants near me] at 10am, it makes sense to provide you with localized results and map results that are open/opening soon for lunch, as well as a mix of review articles for lunch, and maybe some for supper.

If you perform the same search at 4pm, lunch related results are no longer as relevant, so restaurants on the map and localized results in the SERPs would better serve users if they’re related to supper.

Identifiers For Specific Websites

There are identifiers for specific websites, e.g. Wikipedia and Vkontakte. So these websites are treated as their own source types within search results, and almost have their own set of rules (to a point).