Many brands and digital marketers seek to know which factors have the highest impact on search engine rankings. In fact, it’s very difficult to come up with a specific list of key factors, as they keep varying and are influenced by updates from Google and social media networks.
Besides, it always depends on which industry you are aiming to rank for. If you are seeking for higher rankings in casino SEO the factors that are going to be of relevance to your strategy might be different if you have a clothing brand website.
With Fortis Media’s expertise and the tools we have within our reach we conducted an in-depth analysis to come up with the best ranking factors and also show some data that can help you take decisions when choosing in which direction to take your SEO strategy.
In January 2021 we conducted an analysis for a leading US horse racing company to identify, which factors have the highest impact on Google rankings. Therefore, the data presented in this article is immensely more relevant if you represent a horse racing brand or you are interested in learning the ranking aspects that define this industry.
We used a novel approach using Machine Learning and decided to share some of our findings in this report. As part of the research, we also did Exploratory Data Analysis, T-test, Cohen’s D, and Regression Analysis (Linear and Logistic).
This was done by collecting top 100 ranking positions for 2907 keywords, which resulted in 274 952 data points. The data was further enriched by third-party suppliers such as Majestic and Semrush for domain-level metrics. We started with 95 variables: 79 from data suppliers and 16 new variables as part of this analysis.
The following is some important advice to digital marketing specialists and also to companies curious about how SEO works, which comes from the extensive research and analysis performed.
Note: Analyzing one company in a particular industry and time frame shows results most relevant to those limitations. This means that if we change the company and the industry, the results may vary.
1. Article Titles Should Be Similar To Search Queries You’re Trying To Rank For
Assuming causality Search Query and Title Similarity is an important ranking factor, having more targeted articles with Titles that are close to the target queries could improve ranking, rather than having broad titles and articles.
All analysis methods agree that Search Query and Title Similarity Index are most closely associated with the ranking position. Search Query and Title Similarity Index as a variable penalizes Titles that are long but contain all the Search Query terms, meaning that Titles that are succinct and on the topic tend to be associated with higher ranking values.
This does not consider semantic mapping that might be happening on Google’s side where different words are treated as synonyms.
2. Search Query and Description Relevance Rank Are Not So Important
Search Query and Description Relevance Rank were the least impactful ranking factors (while statistically significant in the regression analysis).
It could be that Google in some cases chooses a description based on the user query, which might distort the results, it’s also possible that the method to catch relevance used in this analysis is deficient.
3. Building a strong portfolio of external backlinks is essential
Both Regression analyses and ML approaches showed External Backlinks to be an important ranking factor.
Before modeling in regression analyses, External Backlinks had to be log-transformed for any meaningful association could be discerned, meaning that for External Backlinks – one unit of increase in ranking is a factor of 10 increase External Backlinks.
For example, assuming for one increase in rank from the current rank it requires 100 extra External Backlinks, the subsequent increase from the new level – it would require 1000 extra External Backlinks.
There is a decrease in the effectiveness of each new External Backlink creates as the total portfolio of External Backlinks increases.
Having the largest External Backlinks portfolio does not guarantee top ranking as shown in the Bivariate Analysis section, one way to look at it could be that having a good External Backlinks portfolio is a necessary but not sufficient condition for high ranking.
4. Social media domains tend to cluster around the 10th position
This analysis found that the largest mean External Backlinks values are at position ~10. After a deeper investigation, it was identified that this pattern is mostly driven by social media websites.
They tend to cluster around the 10th position, especially youtube.com, which on its’ own takes ~10% of all results for position 10.
5. Pay extra attention to the Trust Flow to Citation Flow Ratio
Trust Flow to Citation Flow Ratio showed up as an important ranking factor in the ML model section. The ideal value should be between 0.5 and 1.5.
Investigating the variable’s relationship with ranking in isolation, the mean value tends to shoot up for top ranking positions, while staying rather steady for the rest of the positions.
The ML modeling section displayed a penalty for domains with low Trust Flow to Citation Flow Ratio, while hovering around zero above a certain threshold.
As with other variables, there could be some other confounding factors – one way to interpret the results is that domains with a higher Trust Flow to Citation Flow Ratio tend to have a better External Backlink portfolio.
This section reviews the patterns between ranking and a selected set of potential explanatory variables, investigating the type of relationship (linear or non-linear) and outliers.
Selected on-site variables, like Average Search Query and Title Similarity Index, tend to display a more linear pattern than off-site variables in relation to ranking.
Off-site variables such as External Backlinks, Citation Flow, and Referring Domains display a non-linear relationship, where the mean value for a given rank increases until ~10th position after which the trend reverses.
Part of this pattern for off-site variables is driven by social media websites. Removing social media websites changes the pattern.
Two hypotheses could be considered to explain the pattern for social media websites peaking at the ~10th position.
1. Users are downvoting social media websites
By clicking on them less often and as such Google simply down-ranking them as a result.
2. Social media websites have less relevant content.
Overall the relationships between ranking and other variables are weak due to the wide variance of each SEO metric for any given rank.
So, in conclusion, off-site metrics tend to display a piecewise pattern with a transition point at position 10, as can be seen below.
Disclaimer: The data displayed in this section is from January 16th 2021, and does not represent the current context but is included to show the overall context of the analyzed data.
The average number of external backlinks tends to grow until the 10th SERP, then we see the drop.
However, removing social media domains changes the distribution for external backlinks with higher ranking positions tend to have more backlinks.
Social Media Listings are Potentially Skewing Results
Social media websites comprise only 0.003% of all unique domains but represent ~3 percent of all listings and ~ 70 percent of off-site metric value mass.
This means that social media listings exhibit a pattern of possibly skewing results. According to Investopedia “skewing means a distortion or asymmetry that deviates from the symmetrical bell curve, or normal distribution, in a set of data’.
Machine Learning Analysis
Examining SHAP patterns showed that Search Query and Title Similarity Index seem to display a linear pattern in relation to ranking. While having higher Index values is associated with a higher likelihood to be ranked on the 1st page or rank higher on the first page. If you are not familiar with the concept of SHAP in Machine Learning, learn more in this article from Towards Data Science.
Trust Flow to Citation Flow Ratio is another important ranking factor according to the model, tends to show a penalty for low ratio values and being neutral past a certain threshold, this could be because domains with low Ratio values, tend to have a poor quality external backlink portfolios.
Highest External Backlinks (or Citation Flow) values are not necessarily associated with top ranking positions, but having lower values tends to be associated with lower rankings.
Ranked variables when all ranking factors are included:
After Organic Traffic, Search Query, and Title Similarity Index is the second most important variable for model performance, Trust Flow to Citation Flow Ratio is in the top 5, but neither Citation Flow, Trust Flow nor External Backlinks is in the top 20.
This is partly because the model uses other variables, such as Referring IPs or Referring Domains type HTTPS, which are highly correlated with the before mentioned ones, and being highly correlated among themselves would split the difference further diluting their contribution.
Importance of Ranking Factors According to ML
The graphs show variables in descending order as considered important in predicting rank. After extensive internal discussions, we decided to focus only on those factors we have the strongest influence.
Trust Flow to Citation Flow Ratio came out as the strongest predictor. However, the graph doesn’t show whether there is a linear relationship between the rank and variable value. Later in this article, we will present a deeper look into potential emerging patterns.
Low Trust Flow to Citation Flow Ratio is associated with lower rankings
By visually inspecting the ML models it can be seen that a lower TF/CF ratio is associated with lower rank predictions. Both Machine Learning models learned that lower Trust Flow to Citation Flow Ratios is associated with lower rankings.
For the classification model, the threshold seems to be at ~0.5, where the majority of the values are below the 0 value (less likely to be on the first page) and stabilizing beyond that point – having both positive and negative values. Take note of the color gradient – pages with lower Citation Flow Values and Trust Flow to Citation Flow Ratio seem to be especially penalized.
Similar pattern holds for the Learn-to-Rank model, except for the ratio value where most of the SHAP values lie below the 0 lines being slightly below 1.
Higher Search Query and Title Similarity Index Values Are Associated With Higher Ranks
Search Query and Title Similarity Index display a rather straightforward linear relationship – higher values are associated with a higher likelihood of being on the 1st page. This is noteworthy as the ML model can learn any pattern to fit the data, yet the pattern is linear.
This pattern holds for both the classification and Learn-to-Rank models, meaning that both models find higher Search Query and Title Similarity Index values to be associated with higher ranks.
It corroborates the findings from the Linear Modeling and the Variable Relationships sections where it was shown to have the strongest association with our target variable – Ranking.
Citation Flow Displays A Non-Linear Pattern
The Classification ML model consistently displays a pattern where the highest contribution towards being classified on the 1st Page is in the range between 40-50. It holds, even if we switch out Citation Flow for External Backlinks, Trust Flow, Referring Domains, and sample different queries for model building. The pattern persists when we build an ML model with only a single input variable – Citation Flow.
Note that the data points with high SHAP Values (contribution towards being classified as being on the 1st page) tend to also have high Trust Flow to Citation Flow Ratios. This is consistent with the findings investigating two-way variable relationships, which showed that the 1st page does not have on average the highest External Backlinks values – the 2nd page does.
Description Relevance Rank Is The Least Predictive
In comparison to other variables in the model, Description Relevance Rank seems to be the most weakly associated with the target variable and as such not very predictive. The variable is the rank ordering of the Search Result Description Text and the Search Query bm25 Score (an information retrieval metric) ranked for every query – the best matching description starting at 1 and increasing as the metric value decreases.
In the linear modeling section, the variable was statistically significant but with the smallest effect from selected variables, the same holds when using ML models. The variable displays a linear trend when classifying for the 1st-page ranking or ranking the full list (100 rankings per query), but no discernable pattern is visible when ranking the results of the first page.
Client Ranking Profile for Classification (Ranking on the 1st Page)
The red dots show where on the contribution scale (SHAP) our client‘s results lie. Due to them being one of the most ranking domains and have the highest mean and median rankings in the data set, it’s reasonable to assume that the results are biased towards their domain, assuming causality.
From off-site metrics in the model, the client seems to have rather good values, maybe having some improvement in the Trust Flow to Citation Flow Ratio. On-site metrics are query by query basis, but it seems, that there would be some gains to be had in those cases where the Search Query and Title Similarity Index value is low.
The analysis as is found some significant factors associated with Ranking, but other potential factors could be relevant.
Online literature, like Cognitive SEO, as well as SME input, suggests that Recency (time since the article was published) is a potentially important variable, this analysis tried including the variable, but due to formatting inconsistency from the data supplier it was not feasible to do, with better quality data, it could be worthwhile including this variable in future research.
Another aspect worthwhile investigating is page-level content, such as text length, text relevance to query (BM25, TF-IDF), and others. An effort was put in to gather the HTML documents for all 274 952 ranking results (91 522 unique URLs), but too much data was missing, with non-random gaps in data (lower ranked results tended to have more data missing), which could have led to bias in the analysis.
This was partly because the scraping was done a few months after initial data gathering and because the ranking scraping and page scraping were done from different locales.
Plus, several sources, like the Search Engine Journal, suggest Page Load Speed is an important factor, this data was unavailable for this analysis, but future analyses could consider including it.
Another area of investigation which came up during this analysis was Content Accessibility Metrics as explanatory variables, but due to the same reason other page-level content metrics were not included, future research should consider including them.
Lastly, a more generic model would be warranted, this analysis was based on 2907 selected keywords where twinspires.com is already a well-ranking domain, as such the patterns learned from the data will be biased towards the domain (and other top-ranking domains).
Creating a model with other keywords (randomly selected or from other subject areas) could improve the generalizability of insight.
Remember that Fortis Media has been selected among the top SEO Companies by Designrush.
Click here and book your free audit with us.