*Article* **Ranking by Relevance and Citation Counts, a Comparative Study: Google Scholar, Microsoft Academic, WoS and Scopus**

#### **Cristòfol Rovira \* , Lluís Codina , Frederic Guerrero-Solé and Carlos Lopezosa**

Department of Communication, Universitat Pompeu Fabra, 08002 Barcelona, Spain; lluis.codina@upf.edu (L.C.); frederic.guerrero@upf.edu (F.G.-S.); carlos.lopezosa@upf.edu (C.L.)

**\*** Correspondence: cristofol.rovira@upf.edu; Tel.: +34-667295308

Received: 5 August 2019; Accepted: 14 September 2019; Published: 19 September 2019

**Abstract:** Search engine optimization (SEO) constitutes the set of methods designed to increase the visibility of, and the number of visits to, a web page by means of its ranking on the search engine results pages. Recently, SEO has also been applied to academic databases and search engines, in a trend that is in constant growth. This new approach, known as academic SEO (ASEO), has generated a field of study with considerable future growth potential due to the impact of open science. The study reported here forms part of this new field of analysis. The ranking of results is a key aspect in any information system since it determines the way in which these results are presented to the user. The aim of this study is to analyze and compare the relevance ranking algorithms employed by various academic platforms to identify the importance of citations received in their algorithms. Specifically, we analyze two search engines and two bibliographic databases: Google Scholar and Microsoft Academic, on the one hand, and Web of Science and Scopus, on the other. A reverse engineering methodology is employed based on the statistical analysis of Spearman's correlation coefficients. The results indicate that the ranking algorithms used by Google Scholar and Microsoft are the two that are most heavily influenced by citations received. Indeed, citation counts are clearly the main SEO factor in these academic search engines. An unexpected finding is that, at certain points in time, Web of Science (WoS) used citations received as a key ranking factor, despite the fact that WoS support documents claim this factor does not intervene.

**Keywords:** ASEO; SEO; reverse engineering; citations; google scholar; microsoft academic; web of science; WoS; scopus; indicators; algorithms; relevance ranking; citation databases; academic search engines

#### **1. Introduction**

The ranking of search results is one of the main challenges faced by the field of information retrieval [1,2]. Search results are sorted so that the results best able to solve the user's need for information are ranked at the top of the page [3]. The challenges faced though are far from straightforward given that a successful ranking by relevance depends on the correct analysis and weighting of a document's properties, as well as the analysis of the need for that information and the key words used [1,2,4].

Relevance ranking has been successfully employed in a number of areas, including web page search engines, academic search engines, academic author rankings and the ranking of opinion leaders on social platforms [5]. Many algorithms have been proposed to automate this relevance and some of them have been successfully implemented. In so doing, different criteria are applied depending on the specific characteristics of the elements to be ordered. PageRank [6] and Hyperlink-Induced Topic Search (HITS) [7] are the best know algorithms for ranking web pages. Variants of these algorithms have also been used to rank influencers in social media, and include, for example, IP-Influence [8], TunkRank [9], TwitterRank [10] and TURank [11]. To search for academic documents, various algorithms have been proposed and used, both for the documents themselves and for their authors. These include Authority-Based Ranking [12], PopRank [13], Browsing-Based Model [14] and CiteRank [15]. All of them use the number of citations received by the articles as a search ranking factor in combination with other elements, such as publication date, the author's reputation and the network of relationships between documents, authors and affiliated institutions.

Many information retrieval systems (search engines, bibliographic databases and citation databases, etc.) use relevance ranking in conjunction with other types of sorting, including chronological, alphabetical by author, number of queries and number of citations. In search engines like Google, relevance ranking is the predominant approach and is calculated by considering more than 200 factors [16,17]. Unfortunately, Google does not release precise details about these factors, it only publishes fairly sketchy, general information. For example, the company says that inbound links and content quality are important [18,19]. Google justifies this lack of transparency in order to fight search engine spam [20] and to prevent low quality documents from being ranked at the top of the results by falsifying their characteristics.

Search engine optimization (SEO) is the discipline responsible for optimizing websites and their content to ensure they are ranked at the top of the search engine results pages (SERPs), in accordance with the relevance ranking algorithm [21]. In recent years, SEO has also been applied to academic search engines, such as Google Scholar and Microsoft Academic. This new application has received the name of "academic SEO" (or ASEO) [22–26]. ASEO helps authors and publishers to improve the visibility of their publications, thus increasing the chances that their work will be read and cited.

However, it should be stressed that the relevance ranking algorithm of academic search engines differs from that of standard search engines. The ranking factors employed by the respective search engine types are not the same and, therefore, many of those used by SEO are not applicable to ASEO while some are specific to ASEO (see Table 1).

SEO companies [27–29] routinely conduct reverse engineering research to measure the impact of the factors involved in Google's relevance ranking. Based on the characteristics of the pages that appear at the top of the SERPs, the factors with the greatest influence on the relevance ranking algorithm can be deduced. It is not a straightforward task since many factors have an influence and, moreover, the algorithm is subject to constant changes [30].

Studies that have applied a reverse engineering methodology to Google Scholar have shown that citation counts are one of the key factors in relevance ranking [31–34]. Microsoft Academic, on the other hand, has received less attention from the scientific community [35–38] and there are no specific studies of the quality of its relevance ranking.

Academic search engines, such as Google Scholar and Microsoft Academic, are an alternative to bibliographic commercial databases, such as Web of Science (WoS) and Scopus, for indexing scientific citations and they provide a free service of similar performance that competes with the business model developed by the classic services. Unlike search engines, bibliographic databases are fully transparent about how they calculate relevance, clearly informing users how their algorithm works on their help pages [39,40].

The primary aim of this study is to verify the importance attached to citations received in the relevance ranking algorithms of two academic search engines and two bibliographic databases. We analyze the two main academic search engines (i.e., Google Scholar and Microsoft Academic) and the two bibliographic databases of citations providing the most comprehensive coverage (WoS and Scopus) [41].

We address the following research questions: Is the number of citations received a key factor in Google Scholar relevance rankings? Do the Microsoft Academic, WoS and Scopus relevance algorithms operate in the same way as Google Scholar's? Do citations received have a similarly strong influence on all these systems? A similar approach to the one adopted here has been taken in previous studies of the factors involved in the ranking of scholarly literature [22,23,31–34].

The rest of this manuscript is organized as follows. First, we review previous studies of the systems that concern us here, above all those that focus on ranking algorithms. Next, we explain the research methodology and the statistical treatment performed. We then report, analyze and discuss the results obtained before concluding with a consideration of the repercussions of these results and possible new avenues of research.

#### **2. Related Studies**

Google Scholar, Microsoft Academic, WoS and Scopus have been analyzed previously in works that have adopted a variety of approaches, including, most significantly:


However, few studies [43,62] have focused their attention on information retrieval and the search efficiency of academic search engines, while even fewer papers [22,23,31–34] have examined the factors used in ranking algorithms.

The main conclusions to be drawn from existing studies of relevance ranking in the systems studied can be summarized as follows:


Surprisingly, the relevance ranking factors of academic search engines and bibliographic databases have attracted little interest in the scientific community, especially if we consider that a better position in their rankings means enhanced possibilities of being found and, hence, of being read. Indeed, the initial items on a SERP have been shown to receive more attention from users than that received by items lower down the page [63].


**Table 1.** Search engine optimization (SEO) and academic search engine optimization (ASEO) factors. WoS, Web of Science.

In the light of these previous reports, it can be concluded that the number of intervening factors in the academic search engines is likely to be fewer than those employed by Google and that, therefore, the algorithm is simpler (see Table 1).

#### **3. Methodology**

This study is concerned with analyzing the relevance ranking algorithms used by academic information retrieval systems. We are particularly interested in identifying the factors they employ, especially in the case of systems that do not explain how their ranking algorithm works. A reverse engineering methodology is applied to the two academic search engines (i.e., Google Scholar and Microsoft Academic) and to two bibliographic databases of citations (i.e., WoS and Scopus). These, in both cases, are the systems offering the most comprehensive coverage [41,65,66]. The specific objective is to identify whether the citations received by the documents are a determining factor in the ranking of search results.

Reverse engineering is a research method commonly used to study any type of device in order to identify how it works and what its components are. It is a relatively economical way to obtain information about the design of a device or the source code of a computer program based on the compiled files.

One of the fields in which reverse engineering is being applied most is precisely in that of detecting the factors included in Google's relevance ranking algorithm [28,29,67]. The little information provided by Google [16] is used as a starting point to analyze the characteristics of the pages ranked at the top of the search results to deduce what factors are included and what their respective weighting is. However, the ranking algorithms are complex [68], moreover, they are subject to frequent modifications and the results of reverse engineering are usually inconclusive. Recently, a reverse engineering methodology has also been applied to academic search engines [34].

To obtain an indication of the presence of a certain positioning factor, the ranking data are treated statistically by applying Spearman's correlation coefficient, selected here because the distribution is not normal according to Kolmogorov–Smirnov test results. Generally, a comparison is made of the ranking created by the researcher using the values of the factor under study with the search engine's natural ranking—for example, a ranking based on the frequency of the appearance of the keywords in the document and Google's native ranking. If a high coefficient is obtained, this means that this factor is contributing significantly to the ranking. However, in the case of Google, many factors intervene: more than 200, according to many sources [69,70]. Therefore, it is very difficult to detect high correlations indicative of the fact that a certain characteristic has an important weighting. Statistical studies generally consider a correlation between 0.4 and 0.7 to be moderate and a correlation above 0.7 to be high. In reverse engineering studies with Google, the correlation values between the positions of the pages and the quantitative values of the supposed positioning factors do not normally exceed 0.3 [68]. Although the correlations are low, with studies of this type, relatively clear indications of the factors intervening in the ranking can be obtained.

Google themselves provide even less data on how they rank by relevance in Google Scholar. Perhaps their most explicit statement is the following:

"Google Scholar aims to rank documents the way researchers do, weighing the full text of each document, where it was published, who it was written by as well as how often and how recently it has been cited in other scholarly literature." [71].

Previous research [64,72] has shown that Google Scholar applies far fewer ranking factors than is the case with Google's general search engine. This is a great advantage when applying reverse engineering since the statistical results are much clearer, with some correlations being as high as 0.9 [34].

Likewise, Microsoft Academic does not offer any specific details about its relevance ranking algorithm [73]. We do know, however, that it applies the Microsoft Academic Graph or MAG [74], an enormous knowledge database made up of interconnected entities and objects. A vector model is applied to identify the documents with the greatest impact using the PopRank algorithm [13,75]. However, Microsoft Academic does not indicate exactly what the "impact" is when this concept is applied to the sorting algorithm:

"In a nutshell, we use the dynamic eigencentrality measure of the heterogeneous MAG to determine the ranking of publications. The framework ensures that a publication will be ranked high if it impacts highly ranked publications, is authored by highly ranked scholars from prestigious institutions, or is published in a highly regarded venue in highly competitive fields. Mathematically speaking, the eigencentrality measure can be viewed as the likelihood that a publication will be mentioned as highly impactful when a survey is posed to the entire scholarly community" [76]

Unlike these two engines, the WoS and Scopus bibliographic databases provide detailed information about their relevance ranking factors [39,40]. In systems of this type, a vector model is applied [1] and relevance is calculated based on the frequency and position of the keywords of the searches in the documents; therefore, citations received are not a factor.

Another factor that facilitates the use of reverse engineering in the cases of Google Scholar, Microsoft Academic, WOS and Scopus is the information these systems provide regarding the exact number of citations received, a factor used to compute their rankings. Unlike the general Google search engine that does not give reliable information about the number of inbound links, in the four systems studied the number of citations received, and even the listing of all citing documents, is easily obtained. The relative simplicity of the algorithms and the accuracy of the citation counts mean reverse engineering is especially productive when applied to the study of the influence of citations in relevance ranking in academic search engines and bibliographic databases of citations.

For the study reported here, 25,000 searches were conducted in each system, a similar number to those typically conducted in reverse engineering studies [27–29] or other analyses of Google Scholar rankings [22,23,31–34]. The ranking provided by each tool was then compared with a second ranking created applying only the number of citations received. As the distributions were not normal according to Kolmogorov–Smirnov test results, Spearman's correlation coefficient was calculated. The hypothesis underpinning reverse engineering as applied to search engine results is that the higher the correlation coefficient, the more similar the two rankings are and, therefore, a greater weight can be attributed to the isolated factor used in the second ranking.

To avoid thematic biases, the keywords selected for use in the searches needed to be as neutral as possible. Thus, we chose the 25 most frequent keywords appearing in academic documents [77–79]. Searches were then conducted in Google Scholar and Microsoft Academic using these keywords which also enabled us to identify two-term searches based on the suggestions made by these two engines. Next, from among these suggestions we selected those with the greatest number of results. In this way, we obtained two sets of 25 keywords, the first formed by a single term and the second by two. The terms used can be consulted in Annex 1. It is critical that each search provides as many results as possible in order to ensure that the statistical treatment of the rankings of these results is robust. It is for this reason that we didn't use Long Tail Keywords.

To address our research question concerning the impact of citations received on ranking, we undertook searches with these keywords in each system collecting up to 1000 results each time. In the case of the academic search engines, searches were carried out with both one and two-term keywords. In the case of the bibliographic databases, searches were only carried out with two-term keywords since our forecasts were very clear in indicating that citations did not affect the results—indeed, the documentation for these systems also make this quite clear. However, as we see below, the results were not as expected.

The search engine data were obtained using the Publish or Perish tool [80,81] between 10 May 2019 and 30 May 2019 (see Appendices A and B). The data from the bibliographic databases were obtained by exportation from the systems themselves between these same dates.

In each of the systems studied, our rankings created using citation counts were compared with the native rankings of each system. To do this, the number of citations received was transformed to an ordinal scale, a procedure previously used in other studies [31,32]. According to reverse engineering methodology, if the two rankings correlate then they are similar and, therefore, it can be deduced that citations are an important factor in the relevance ranking algorithms. The ranking by citations received was correlated for each of the 25 searches (and their corresponding 1000 results) carried out in each system with the native ranking of these systems.

To obtain a global value for each system which integrates the 25 searches and their corresponding 25,000 data items, the median values of each of the 25 citation search rankings were used for each position in the native ranking. The native ranking positions of each system are shown on the x-axis, while the rankings according to citations received are shown on the y-axis. Each gray dot corresponds to one of the 25,000 data items from the 25 searches of 1000 results each conducted on each system. The blue dots are the 1000 median values that indicate the central tendency of the data. The more compact and the closer to the diagonal the medians are, the greater the correlation between the two rankings.

The software used in the analysis was R, version 3.4.0 [82] and SPSS, version 20.0.0. The confidence intervals were constructed via normal approximation by applying Fisher's transformation using the R psych package [83,84]. Fisher's transformation when applied to Spearman's correlation coefficient is asymptotically normal. Graphs were drawn with Google Sheets and Tableau.

#### **4. Analysis of Results**

Table 2 shows the results obtained when analyzing the four systems. It can be seen that in some cases different analyses were conducted on the same system. This reflects various circumstances impacting the study. For example, in the case of Microsoft Academic we did not perform a full-text search, rather it was limited to the bibliographic reference: that is, the title, keywords, name of publication, author and the abstract. Interestingly, in conducting the study we found that in more than 95% of the searches the keywords were present in the result titles. This gave rise to a problem when we compared the results with Google Scholar, since this search engine does perform full-text searches. For this reason, we undertook a second data collection in the case of Google Scholar, restricting searches to the title and, in this way, we are able to make a more accurate comparison of the results provided by Google Scholar and Microsoft Academic. The introduction of a second variant was a result of the number of search terms. The study was carried out using two sets of 25 keywords, the first made up of a single term and the second of two terms. Finally, in the case of WoS, two data collections were undertaken since it became apparent that the ranking criteria had changed. However, each of these variants allows us to make partial comparisons and to analyze specific aspects of the systems.


**Table 2.** Correlation coefficients for the academic search engines.

The correlations for the two academic search engines were, in all cases, higher than 0.7 and in some cases reached 0.99 (Table 2). These results indicate that citation counts are likely to constitute a key ranking factor with considerable weight in the sorting algorithm. The other factors can cause certain results to climb or fall down the ranking, but the main factor would appear to be citations received. In the case of the bibliographic databases, the correlations were close to zero and, therefore, in such systems, we have no evidence that citations intervene. However, in the case of WoS, over a period of several days, we detected it to be using different sorting criteria and its results were in fact ranked using the number of citations received as its primary factor. This result is surprising since it does not correspond to the criteria the database claims to use in its corresponding documentation. We attribute these variations to tests being performed to observe user reactions, and on the basis of these results, decisions can presumably be taken to modify the ranking algorithm, but this is no more than an inference.

We find virtually no differences between searches conducted with either one or two terms in the academic search engines (see Figures 1–6). In principle, there is no reason as to why there should be any differences as the same ranking algorithm was applied in all cases. However, we did find some differences in the case of Google Scholar when searches were performed with or without restriction to the title, the same search providing a different ranking. When the terms were in all the titles, correlation coefficients of 0.9 were obtained (see Figures 3–6). When this was not the case, the correlation coefficient was still very high, but fell to 0.7 in the searches with two words (see Figure 2). This difference of almost 20 points is a clear indication that the inclusion of keywords in the title is also an important positioning factor. These differences are even more evident if we analyze the correlation coefficients of each search. Figure 7 shows how in all cases the correlation coefficients of the searches performed with this restriction are greater than when this restriction is not included. When restricting searches to the title, this factor is nullified since all the results have it. In such circumstances, the correlation is almost 1 since practically the only factor in operation is that of citations received. Therefore, indirectly, we can verify that the inclusion of the search terms in the document title forms part of the sorting algorithm, which we already knew, but for which we are now able to provide quantitative evidence showing that the weight of this factor is lower than that of citations since there is little difference between the two correlations. Unfortunately, this same analysis cannot be conducted in the case of Microsoft Academic because it does not permit full-text searches.

**Figure 1.** Google Scholar Searches with One Word and No Title Restriction (rho 0.968, median in blue, the rest of data in gray).

**Figure 2.** Google Scholar Searches with Two Words and No Title Restriction (rho 0.721, median in blue, the rest of data in gray).

**Figure 3.** Google Scholar Searches with One Word and Title Restriction (rho 0.990, median in blue, the rest of data in gray).

**Figure 4.** Google Scholar Searches with Two Words and Title Restriction (rho 0.994, median in blue, the rest of data in gray).

**Figure 5.** Microsoft Academic Searches with One Word and Title Restriction (rho 0.907, median in blue, the rest of data in gray).

**Figure 6.** Microsoft Academic Searches with Two Words and Title Restriction (rho 0.937, median in blue, the rest of data in gray).

**Figure 7.** Google Scholar: Title Restriction vs No Title Restriction with Two Words.

Likewise, we detected no differences between Google Scholar and Microsoft Academic, above all when comparing the results of searches restricted to the title (Figures 3 and 6). In all four cases, we obtained correlation coefficients of 0.9. Therefore, it seems that both academic search engines apply a very similar weight to the factor of citations received.

Finally, it is worth noting that the two bibliographic databases (see Figures 8 and 9) do not employ citations received as a positioning factor, as stated in their documentation. Therefore, it is perfectly logical that their corresponding correlation coefficients are almost zero. However, somewhat surprisingly, in the case of WoS, we found that between 20 May 2019 and 25 May 2019 the ranking was different and we obtained a correlation coefficient similar to that of the academic search engines, i.e., 0.9 (see Figures 10–12) and that, therefore, a different algorithm was being applied with the significant inclusion of citation counts. It is common that before introducing changes in the design of websites, tests are made with real users. Figures 10 and 11 illustrate screenshots of the same search but employing two different relevance rankings. This can be done by randomly publishing different prototypes to gather information on user behavior in order to determine which prototype achieves greatest acceptance. As discussed above, it would seem that WoS was implementing such a procedure and was carrying out tests aimed at modifying its relevance ranking using an algorithm similar to that of academic search engines, although we should insist that this is only an inference.

**Figure 8.** Scopus Searches with Two Words and Title Restriction (rho −0.10, median in blue, the rest of data in gray).

**Figure 9.** WoS Searches with Two Words and Title Restriction (Version 1) (rho −0.075, median in blue, the rest of data in gray).

**Figure 10.** Search conducted on WoS with relevance ranking using the number of citations.

**Figure 11.** Same search as in Figure 10 with relevance ranking but without using the number of citations.

**Figure 12.** WoS Searches with Two Words and Title Restriction (Version 2) (rho 0.907, median in blue, the rest of data in gray).

#### **5. Discussion**

The importance attached to citations received in Google Scholar ranking of results of searches is not exactly a new finding. Beel and Gipp [31–34] described the great importance of this factor, both in full-text searches and searches by title only. However, our study incorporates methodological improvements on these earlier studies giving greater consistency to our results. Beel and Gipp applied a very basic statistical treatment, drawing conclusions from an analysis of scatter plots but without calculating correlation coefficients or conducting other specific statistical tests. Moreover, to obtain a global value of various rankings the authors took the mean. It is our understanding that the more appropriate measure of central tendency for ordinal variables is the median. Finally, the words used by the authors when conducting their searches were randomly selected from an initial list. This procedure generated a number of problems since many searches did not generate any results. In contrast, the procedure applied in the study described here is based on the most frequent words in academic documents [77–79] and the searches suggested by the academic search engines themselves based on an analysis of a large volume of user searches. This procedure ensures the random selection of the searched content and that in the vast majority of searches there are at least 1000 results. Future studies need to confirm that searches providing few results apply the same ranking criteria, as would be expected.

Beel and Gipp [31,32] found that citations received were more influential in full-text searches than they were in those restricted to the title only. Their conclusions were that Google Scholar applied two slightly different algorithms depending on the search type. Our results differ on this point as we detect a greater weighting for citations in searches restricted to just the title. There is no reason, however, to believe that different algorithms are being applied, rather it would appear to be a case of the same algorithm behaving differently depending on the factors that intervene. The presence of search words in the title is a positioning factor that forms part of the algorithm. If we ensure that all the results have the search words in the title, then we cancel out this factor and the effect of citations received is very clear. On the other hand, in full-text searches this factor does intervene and, therefore, the influence of citations received is less clear since the ranking is also determined by the presence or otherwise of the words in the title.

In a study conducted by Martín-Martín et al. [33], the authors found that in Google Scholar the citations received also had a strong influence on searches by year of publication. The authors calculated Pearson's correlation coefficient and obtained values above 0.85. These results are similar to those obtained in our study. However, Martín-Martín et al. [33] adopted a somewhat unusual method for calculating the overall value of all the searches conducted, taking the arithmetic mean of the correlation coefficients. It is our understanding that to obtain a measure of central tendency shortcuts cannot be taken and it is more appropriate to obtain the median for each position and then calculate the correlation coefficient of these medians for the Google Scholar ranking.

In Rovira et al. [34], the authors focused their attention on the weight of citations received in the relevance ranking, but only in the case of Google Scholar. While in this earlier study the authors considered searches by year, author, publication and the "cited by" link, searches by keyword were not examined. However, a very similar conclusion was reached regarding citations received: namely, that they are a very important relevance ranking factor in Google Scholar. The present study has expanded this earlier work by analyzing other information retrieval systems using keyword searches, the most common search type conducted.

Finally, it is worth stressing that we have not found any previous reports on the specific criteria for the relevance ranking used by the other three systems analyzed here. As such, we believe our study provides new reliable data on these systems.

Relevance is a concept that is clearly very much open to interpretation since it seeks to identify items presenting the highest quality, a characteristic with a very strong subjective element. The diversity of algorithms for determining relevance is a clear indicator of the complexity of its automation. For this reason, citations received is granted so much weight.

#### **6. Conclusions**

Our results indicate that citation counts are probably the main factor employed by Google Scholar and Microsoft Academic in their ranking algorithms. In the case of Scopus, by contrast, we find no evidence that citations are taken into account, as indeed is reported in the database's supporting documentation [39].

In the specific case of WoS, we detected two distinct rankings. In the initial data collection exercise, the ranking of results was conducted according to the criteria described in the WoS documentation, that is, without applying citation counts and weighting the results according to the position and frequency of keywords, as Elsevier [40] states in its documentation for this service. However, somewhat surprisingly, in a second data gathering process it became evident that the ranking on this occasion was, in essence, based on citations received. It would seem that these two distinct ranking systems were detected because WoS was undertaking tests with a view to changing its algorithm and, as such, modified its ranking criteria to obtain a better understanding of user behavior.

Our findings allow us to improve the experimental foundations of ASEO and enable us to offer useful suggestions to authors as to how they might optimize the ranking of their research in the main academic information retrieval systems. Greater visibility is implicit of a greater probability of their being read and cited [61,63] and, thereby, of boosting authors' chances to improve their h-index [85]. Any information that allows us to identify the factors that intervene in relevance ranking is of great value, not so that we might manipulate the ranking results—something that is clearly undesirable—but rather so that we can take them into account when promoting the visibility of the academic production of an author or a research group.

Other academic databases are emerging, including Dimensions and Lens, but they do not provide the same coverage as the two databases considered here. Nevertheless, we cannot rule out the possibility of their being analyzed in future studies. Such studies should usefully seek to undertake the simultaneous analysis of various factors, including, for example, citations received and keywords in a document's title, as we have discussed above. One of the limitations of this study is precisely that a single factor is studied in isolation, when a ranking algorithm employs many factors simultaneously. It would be of particular interest to analyze whether such algorithms employ interactions between several factors.

**Author Contributions:** Conceptualization, C.R.; Methodology, C.R.; Validation L.C., F.G-S. and C.L.; Investigation C.R., L.C. and F.G.-S.; Resources, C.R. and C.L.; Data Curation, C.R.; Writing—Original Draft Preparation, C.R.; Writing—Review and Editing, L.C., F.G.-S. and C.L.; Supervision L.C. and F.G.-S.

**Funding:** This research was funded by the project "Interactive storytelling and digital visibility in interactive documentary and structured journalism". RTI2018-095714-B-C21, ERDF and Ministry of Science, Innovation and Universities (Spain).

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **Appendix A. List of Terms Used in The Searches**

#### **One-Term Searches**

Search words obtained from [78].



#### **Two-Term Searches**

Search words obtained from the above list and by selecting the search suggestions provided by Google Scholar and Microsoft Academic with the greatest number of results.


**Table A2.** Words and Rho of two-term searches. \*\* *p* < 0.01, \* *p* < 0.05.

#### **Appendix B. Data Files**

Rovira, C.; Codina, L.; Guerrero-Solé, F.; Lopezosa, C. Data set of the article: Ranking by relevance and citation counts, a comparative study: Google Scholar, Microsoft Academic, WoS and Scopus (Version 1) (Data set). Zenodo. Available online: http://doi.org/10.5281/zenodo.3381151 (accessed on 10 September 2019).

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **SEO inside Newsrooms: Reports from the Field**

#### **Dimitrios Giomelakis \*, Christina Karypidou and Andreas Veglis**

Media Informatics Lab, School of Journalism and Mass Communications, Aristotle University of Thessaloniki, 54625 Thessaloniki, Greece; ckarypid@jour.auth.gr (C.K.); veglis@jour.auth.gr (A.V.)

**\*** Correspondence: dgiomela@gmail.com or dgiomela@jour.auth.gr

Received: 13 November 2019; Accepted: 10 December 2019; Published: 13 December 2019

**Abstract:** The journalism profession has changed dramatically in the digital age as the internet, and new technologies, in general, have created new working conditions in the media environment. Concurrently, journalists and media professionals need to be aware and possess a new set of skills connected to web technologies, as well as respond to new reading tendencies and information consumption habits. A number of studies have shown that search engines are an important source of the traffic to news websites around the world, identifying the significance of high rankings in search results. Journalists are writing to be read, and that means ensuring that their news content is found, also, by search engines. In this context, this paper represents an exploratory study on the use of search engine optimization (SEO) in news websites. A series of semi-structured, in-depth interviews with professionals at four Greek media organizations uncover trends and address issues, such as how SEO policy is operationalized and applied inside newsrooms, which are the most common optimization practices, as well as the impact on journalism and news content. Today, news publishers have embraced the use of SEO practices, something that is clear also from this study. However, the absence of a distinct SEO culture was evident in newsrooms under study. Finally, according to results, SEO strategy seems to depend on factors, such as ownership and market orientation, editorial priorities or organizational structures.

**Keywords:** search engine optimization; SEO; search engines; search; online journalism; media websites; news content; news articles

#### **1. Introduction**

Major search engines today are considered to be one of the most trusted and common services to retrieve information from the internet and at the same time the main method used for navigation for hundreds of millions of users around the world [1,2]. In this context, recent studies have indicated that a significant percentage of users turn to search engines first when shopping online or when information gathering really matters [3,4]. As a result, online search remains one of the best traffic sources for any website [5–7]; however, it should be noted that the vast majority of all search traffic comes from the top rankings in search engine results [8,9].

Web technologies, new reading tendencies and new information consumption habits create new working conditions for both media organizations and journalists in order to improve online news websites and make them more readable. On the one hand, the abundance of news media websites requires news organizations to be on all major platforms, all the time [10]. On the other hand, many people look for specific information, and their priorities area convenience, rapid access, and accuracy [11]. As internet technologies have brought about new modes of producing and consuming news content, many changes are observed in basic journalistic work processes, such as newsgathering, news production and distribution, and the way people consume news [12–17]. The journalistic profession is changing. Journalists and news media organizations are required to adapt to the new conditions, to become competitive and respond to the needs of the market by making the most of

their resources [14,16,18]. News websites, online radio and web TV, are the main areas of action. It is generally observed that the arrival of digital technologies has made journalistic work both easier, enabling better monitoring of economic and political organizations, and more difficult, overwhelming journalists with more information than they can handle [19].

Search technology has evolved over the last years, using more complicated algorithms and incorporating information from Web 2.0 applications in order to provide better results [20]. As new technologies continue to develop rapidly, different news sources have also emerged, including search engines, online news aggregators, social networks and citizen journalism. A number of studies have shown that search engines are an important source of the traffic to many news websites today, identifying the importance of high rankings in search results and creating a significant challenge for the digital media outlets to keep their news content at the top of the search rankings [21–27]. Journalists are writing to be read, and that means ensuring that news content is found also, by search engines.

The current exploratory study is focused on the use of search engine optimization (SEO) in news websites. Specifically, it focuses on four Greek news websites with some of them being among the most recognized media outlets in Greece with high traffic volumes. The innovation of the study lies in the fact that it is one of the first research studies to investigate SEO practices in news websites, as well as in a journalism context with a different culture that has received little attention so far. Another strength pertains to the study design, including in-depth interviews with practitioners coming from four news publishers. Through a series of semi-structured interviews with SEO and media professionals, the study examines the familiarity of these news publishers with SEO practices, including common trends and practices inside their own newsrooms, and the perceived impact of SEO on journalism and news content.

#### **2. About SEO**

The practices designed to increase the visibility and traffic (visitors) that a website or a webpage receives from organic (i.e., unpaid) search engine results are referred as search engine optimization (SEO) [28–30]. SEO is connected with the creation of the first search engines in the early-90s, and it has been associated with the influence of search engine results. In general, the position and the frequency a site appears in the Search Engine Results Page (SERP) influence the number of visitors it will receive from the search engine's users.

SEO can be applied to many different websites and can target different types of search, including image—video search, local, news or academic search [31–33]. It is also very closely connected to e-Commerce websites [34]. Besides, SEO constitutes a part of Search Engine Marketing (SEM) and one of the leading and most influential activities in the field of online marketing which defines the steps taken to organically grow a site's relevancy by building links, writing strong content or submitting to search sites [28,35]. SEO and SEM strategies should be carried out in order to attract customers and clients for business-to-consumer (B2C) companies [36]. In general, a business website can be found via a search engine by an online user in two ways: Through a pay-per-click campaign (PPC) or through an organic result listing that is based essentially on SEO. Malaga [37] divides SEO practices into four major categories:


Keyword research is the main SEO task that involves finding and analyzing actual search terms/phrases people enter into search engines. This practice, (usually with help from keyword suggestion tools, such as Google's Keyword planner) gives SEO professionals a better understanding of how high the demand is for specific keywords, as well as how hard it would be to compete for those terms in the organic search engines results. Indexing is the process of attracting the search engine spiders to a website. All of the major search engines have a submission form where users could submit a website (entering the URL) for consideration. On-page optimization includes the management of all factors associated directly with a website, such as keywords, appropriate content, internal link structure, as well as html elements. It also contains page title (or HTML title tag), on-page headlines, description of web pages (or meta description tag) and URLs. Finally, off-page optimization includes all the actions made away from the website, such as link building or social signal strategy. Regarding link building, the more referrals someone has across the Web, the more search engine spiders notice and categorize their content [38]. Also, social signals may have a positive impact on websites and are considered as the new link building metric as search engines increasingly search for social signals to help the ranking of pages [39,40].

SEO has come a long way from its early days, and the search industry has seen many innovations from artificial intelligence to voice search. The latter looks like a fast-rising trend in web search, considering that a significant percentage of searches on mobile comes from voice searches [41–44]. Today, recent algorithm changes in search engines especially in Google, the world's most popular search engine, place more value on quality and content marketing, leading many experts to call these changes the new SEO [45]. As Ledford [29] notices, the search results are affected by the perceived quality of the page (indicated by a quality score) in accordance with the algorithm used, which includes a number of factors such as location, frequency of keywords, links or clickthrough rates.

#### **3. SEO and News Websites**

As people's reading habits change due to web technologies, online journalism finds itself having to chase web traffic [46]. Nowadays, there is no doubt that the internet is the main source of news preferred by many readers in order to get informed [47]. Search engines are used as a basic tool of navigation and filter for news by many people, as internet traffic depends to a great extent on them [25,48]. According to reports from the Reuters Institute for the Study of Journalism, search remains a significant gateway to the news in many countries, such as Poland, Turkey, Germany, France, Italy, the United States and Brazil [26,27]. As a result, the survival of a website is related to its visibility through web search [49].

Although search engine optimization appeared almost in parallel with the creation of the first search engines [50], SEO practices were only adopted by newsrooms within the last few years [32,51]. Many leading online media outlets (e.g., Daily Mail, Guardian, Los Angeles Times, Daily Telegraph) have employed SEO specialists in an attempt to win greater visibility and position their stories at the top of the search rankings. In the British Broadcasting Company (BBC), for example, journalists are trained in basic SEO [52–55]. Another important change concerns the implementation of a dual-headline system that is used until now. Specifically, a short one for the front page and other website indexes, and a longer SEO title with more characters/keywords which appear on the story page itself and also in search engine results [52,53]. Other noteworthy examples include the Los Angeles Times, as well as the Christian Science Monitor, where the incorporation of an SEO strategy or an SEO manager proved to be key factors leading to traffic increase [54,56].

The presence of SEO strategies is considered as an emerging production norm and practice that direct impacts journalistic workflow and creates new challenges for media professionals. Given the increase in online news and the dependence of the media outlets on digital platforms (especially in Google), news publishers constantly try to find techniques in order to improve the prominence and visibility of their stories in search engines and other news aggregators [6,32,51,57,58]. Today, the digital success of news organizations depends, among other things, on operational changes in the processing and distribution of news. In this context, news publishers need to consider search engine positioning strategies and implement SEO actions in their newsrooms [6,32]. Application of SEO practices to digital media outlets can be divided into three broad categories: On-page, off-page and technical SEO. Investing in the appropriate systems and training, news organizations need to alter their content to attract the interest of the bots and thus, improve the exposure of their news stories in search

engines [54]. Regarding on-page SEO, many news publishers today around the world create news rich in keywords: They create SEO-friendly titles, they use metadata, they include relevant keywords in the initial paragraphs (synonyms, plural variations or other forms) and use multimedia content in the form of videos, photographs, podcasts, etc. [6,32]. In contrast, off-page SEO refers to all the actions carried out off the web page, such as obtain as many quality incoming links as possible or disseminate the news content on the social platforms. Finally, technical SEO may include actions like the use of special formats for the mobile web, good information architecture or an increase in website speed [32,57,58].

Today, SEO is considered among the key journalistic skills that a modern journalist must possess. However, the development of online journalism is accompanied by a significant dependence of news publishers on technology firms that run the function of infomediation (a mix of edition, aggregation and distribution of third-party content that connects information supply with information demand). This function has changed news production practices and also it is creating conflicts between journalistic values, norms and new digital practices [6]. The entrance of SEO strategies in news organizations was—and is sometimes still—criticized by some people who believe that they downgrade the journalistic work. This belief derives from the observation that journalists change their news agenda and the way they write, producing content written mainly for machines. According to Giomelakis and Veglis [6], the aim of SEO is not better or more diverse journalism. However, journalistic work can be benefited if it is implemented consciously and wisely. SEO does not require media professionals to dumb down or writing with the only purpose to achieve better rankings. Besides, journalists are still writing to be read, and these practices can help their articles to be found [6,59]. Quality content made from professionals is still necessary, and thus, the role of the journalist is more important than SEO. During the writing process, a web editor must feel creative, combining SEO with quality content production. Given the wealth of news websites and the rapid dissemination of information, the main goal is a story to be found by readers also through search engines and news aggregators [59]. This means that journalists, and media organizations in general, must adapt to the new circumstances. As the world of SEO continues to evolve, SEO workers in every business must be up to date on new developments and also utilize useful tools and services for their work.

#### **4. Methodology**

This study is focused on the application of SEO practices inside newsrooms and news media outlets. The paper draws on data derived from in-depth interviews with practitioners coming from four news publishers in Greece with some of them being among the most recognized media outlets with high traffic volumes. The only prerequisite was the respondents and media representatives to have a position of SEO manager or be responsible for SEO strategy in every newsroom. The sample of four interviewees included an SEO manager, two general directors and one business owner. The main goal of the study was to examine different types of media organizations with different characteristics. Thus, the sample (see Table 1) included some long-established online news publishers along with a newer media outlet (with online presences ranging from 3 to 12 years), both nationwide and local media, and finally, online-only media organizations, as well as outlets co-published with print editions.


**Table 1.** Greek media outlets under study.

\* Data are based on information from the Alexa ranking system that includes top sites from all categories in Greece, not only news websites (foreign websites are also included).

Following the initial acceptance of media representatives, the interview questions and a link to an online, semi-structured questionnaire were sent electronically. Semi-structured interviews were used in order to allow an in-depth exploration of participants' responses and space to express their experiences and the trends in their newsrooms. A standard set of questions was covered, allowing flexibility for any follow-up questions and to explore other issues of relevance to participants. The interviews and data collection took place during a two-month period (from June until July 2019). The research, along with the development of the questionnaire, was based on the literature review [6,32,51,53,57–60] and was adapted in the context of the Greek media landscape. The interviews included a mixture of open-ended and closed-ended questions in order to answer the research questions (RQ) based on what previous, recent studies have found. It should be noted that the majority of the questions were open-ended, allowing the respondents to answer in their own words in open text format to better capture their complete knowledge, feeling, and understanding of the topics. Additionally, the small number of close-ended questions (including questions with a five-level Likert scale) focused on some general characteristics for every media outlet (e.g., job position of respondents, type of media outlets) and questions where respondents were asked to give their personal opinion. Regarding the latter, the respondents had the opportunity to justify their answers providing more details. This type of question (close-ended) was chosen because it was easier and quicker for respondents to answer, decreasing the likelihood of irrelevant or confused answers. Also, it was taken into account that if respondents were struggling to understand particular questions, they could read the answer options for further context (e.g., names of common SEO tools). In this case, an 'Other' answer option was added if a respondent wanted to provide a unique answer. Where it was necessary, a follow-up via telephone was carried out in order to make clear the answers or discuss other issues of relevance. Apart from some general questions about media outlets under study, questions covered many areas, including how SEO policy is operationalized and applied inside newsrooms, how long these policies have been in place and also the impact on journalism and news content. The main questions during the interviews are depicted in Table 2, according to their type (open/close-ended).


**Table 2.** Main questions under study. SEO, search engine optimization.

Prior to performing the main research, a pilot study was conducted in order to find any problems and discrepancies that the questions might have included. A number of improvements were made on the initial questionnaire, mainly in the fields of readability and usability. An introductory text informed the respondents about the use of the obtained information and the voluntary nature of this research. Thematic analysis (TA) using Braun and Clarke's six-phase framework [61] was used to make sense of the data. The analysis focused on examining patterns within qualitative data that are important or interesting. Specifically, it was used to explore questions about respondents' experiences, perspectives, practices or factors that influence and shape particular phenomena. The different stages that were followed in the process of analysis were:

1. Familiarizing with data

Repeated reading in order for researchers to feel familiar with the data. Also, the follow up interviews made via telephone were transcribed. In parallel, thoughts or ideas for coding and meanings were noted, generating an initial list of interesting segments across the data set.

2. Assign preliminary codes to the data

Identify interesting elements of the data (coding manually) that can be assessed in a meaningful way. Important ideas/codes providing a description of respondent's experience were identified and written in the form of a list.

3. Searching for repeated patterns and themes

When all interviews were coded separately, the different codes were sorted into potential themes. Also, the researchers considered how different codes may combine to form a primary theme.

4. Reviewing themes

Themes were reviewed and refined—coherent patterns were formed in the context of the data set. The researchers re-read data several times and went backwards and forwards between raw data, codes and themes, until they felt confident enough that different codes, could collate together and form a theme.

5. Defining and naming themes

The scope and content of each theme and sub-theme clearly defined. The researchers considered the themes themselves, and each theme in relation to the others. Different dimensions of the same pattern-phenomenon were shown at this stage.

6. Producing the report

Final analysis of repeated patterns (themes)—connection with research questions and relevant literature.

Finally, the thematic analysis identified five major themes:


#### **5. Research Questions**

Following the previous discussion and guided by the key concepts of field theory, this study attempts to answer the following research questions:

**RQ1:** Do the media representatives know and thus, utilize SEO practices in their working organization, and what factors may affect their use according to their point of view?

**RQ2:** Is there an individual SEO job position inside the newsrooms under study, and which are the most common SEO tools and practices?

**RQ3:** Do media outlets under study monitor their web traffic and what metrics do they focus on?

**RQ4:** What is the impact of SEO on news content creation inside these newsrooms?

**RQ5:** According to respondents' point of view, what is the impact of SEO on the journalism profession, in general, regarding the publication/selection of news, as well as the relationship with the audience?

#### **6. Results and Discussion**

#### *6.1. Awareness and Usage of SEO (RQ1)*

The respondents were questioned about their experience in using SEO practices with the answers ranged initially in five categories ((1) Poor, (2) Fair, (3) Good, (4) Very Good, (5) Excellent). Subsequently, the respondents had the opportunity to justify their answers providing more details through open-ended questions (e.g., the years they deal with SEO). Generally, the sample reported a high degree of SEO practice awareness. The majority of the respondents had excellent knowledge of them, and one of them had very good knowledge. It is very interesting that all of them have dealt with SEO for many years in their working organization, as two of them had ten-years of experience, one of them had eight-years, and the other had five-years. The result seems to be reasonable given that many studies have shown that search engines are an important source of the traffic to many news websites around the world, identifying the importance of high positions in search results [21–27]. Also, newsrooms are being increasingly asked to cater to audience interests in order to generate more traffic and "clicks" [62] and in this context many Greek media professionals have begun to emphasize more on what audiences want to know [63].

Regarding the factors that may influence SEO usage, all the respondents agreed that the use of SEO practices is more prevalent in large news organizations, which try to ensure high traffic numbers, in contrast with smaller media outlets. Moreover, they reported that ownership and market orientation (i.e., public or private), as well as the type of media organization (print, radio, television or online), are highly related to how search engine optimization is used in newsrooms. For example, private news organizations seem to use more of their resources in SEO strategy than the public ones. Also, its use might vary if media organizations are online-only or coexist with other distributions (e.g., print/broadcast channel). All the above are in agreement with prior studies where SEO strategy was found to be varied among newsrooms and factors, such as market orientation, or the ownership/business model were proved to play an important role on how new technologies are incorporated into the journalism profession [6,53,64–67]. In short, it may be concluded that news organizations tend to develop distinct forms of SEO use aligned with their organizational imperatives, structures, business model, as well as their editorial priorities.

#### *6.2. SEO Job Position and Most Common SEO Practices*/*Tools (RQ2)*

According to the results of this research, it is noticeable that there was no specific SEO specialist jobs in any of the news websites that were examined. The respondents referred different reasons for this, each one for the website in which they work: It may be a decision of the ownership, there may be financial reasons, or someone else works on it. Considering there were no SEO specialists working for the news websites, respondents answered that SEO is a duty of the director or the chief editor or the journalists or a freelancer. The results come in contrast with many leading media outlets across the US and other European newsrooms (e.g., Daily Mail, Guardian, BBC, Los Angeles Times) which created SEO specialist/editor positions within their newsrooms during the last years [52–55]. In this context, the incorporation of an SEO chief and SEO strategy proved to be key factors leading to traffic increase in news publishers, such as Los Angeles Times and the Christian Science Monitor [54,56].

Apart from the SEO job position, respondents were also asked to indicate the most common search engine optimization practices used in their newsrooms. According to the answers received from media representatives, the most common practices utilized by media professionals were keyword research, as well as research about hot topics, the top search queries and the users' preferences. Correspondingly, the most popular tools were Google keyword planner (for keyword research) and Google trends (that analyzes the interest and the popularity of a search term and track "buzz" online). Alexa.com services and Google search console were less popular among newsrooms. Furthermore, half of the respondents answered that backlink checker for monitoring inbound links, as well as software solutions for general SEO analysis, are often used in their newsrooms.

Regarding off-SEO and link-building strategies, all the media outlets under study tend to share their news content on social media after publishing it. All the newsrooms share their content on Facebook, three of them on Twitter and one on Instagram (all three comprising the most popular social networks in Greece), in an effort to increase the shelf life and distributed reach of quality content. The newsrooms seem to realize the significance of social media platforms in a changing news media landscape having an active presence on Facebook, Twitter and YouTube (all of them), as well as on Instagram (3 out of 4). Particularly in Greece, the media market is characterized by high use of social media, since Greek people tend to read the news and participate by commenting or sharing content [26]. In this context, social media are considered a great source of user opinions whose structure can offer useful information for the polarity classification task [68]. While search engines are increasingly becoming more sophisticated at interpreting web content, social signals and links from social media have already started to play their role in SEO. The extent of social impact on search is evolving, which constitutes a characteristic of the SEO industry [39,50,69,70]. Today, an active presence in social media is considered a requirement for a news organization and the above may affect or have an indirect impact on SEO. For example, the more Facebook page likes or Twitter followers a media outlet has, the more social signals, such as likes or tweets its posts will generate. Video content (e.g., content on YouTube) is also often ranked higher in search engines.

Finally, it is worth mentioning that 2 out of 4 representatives admitted during the interviews to the practice of links exchange in cooperation with other web sites, especially web sites with relevant content. This finding is in line with prior work [6] where it was also found that many Greek media outlets exploit this practice, establishing contact with other sites in order to share links, to boost ranking and also, cover a wider range of topics. Reciprocal linking between two websites through their content can help in building more qualitative inbound links and also in bringing an increase in traffic. It is considered a common tactic for media websites, especially when there is a lack of staff and a number of thematic categories cannot be covered properly.

#### *6.3. Online Tra*ffi*c Reports and Main Traits (RQ3)*

Initially, all the respondents reported web analytics usage for monitoring website traffic in their company and all of them mentioned the use of Google analytics. The above results were reasonable, given that Google Analytics is deemed globally as the most popular web analytics software and a leading tool for sales and marketing purposes [71,72]. Moreover, the high degree of web analytics use can be characterized as unsurprising, given the growing importance of internet metrics in the journalism profession in recent years across different types of newsrooms [67,73]. Based on the data from the interviews, in three out of the four newsrooms under study, the traffic coming from search engines was around a third or more (30–40%) and in another news outlet was between 10–20%. The results are in consonance with previous studies that have shown that a large percentage of readers get informed through search engines [21–23,26–28] and confirm that, for news sites, search remains crucially important. The respondents were also asked to give information regarding the use of web analytics, such as who is in charge and who tracks them, as well as the frequency of use. In two media outlets, the traffic reports are being checked daily; one respondent reported many times throughout a day; while another one answered monitoring on a weekly basis. Chief editors have access to these tools in all newsrooms under study, while in the majority of media outlets, the traffic reports are also monitored from journalists and the marketing department. The IT departments, as well as external partners specializing in SEO, have access in two out of four newsrooms.

Regarding the most popular metrics, newsrooms were more interested in web analytic metrics that report data for the overall website content and less about specific sections of their website or about specific articles. The interest of newsrooms in online traffic metrics (concerning a variety of indicators) was measured on a 1–5 scale, with 1 signifying not at all interested to 5 indicating highly interested. Subsequently, the respondents had the opportunity to justify their answers providing more details through open-ended questions. According to the results, it was found that newsrooms use

Web Analytics to obtain information regarding general website/content metrics. For example, they seem to be very interested in the overall data for the website traffic, such as page views, sessions and unique visitors, as well as the type of content users specifically prefer to read (popular articles). Additionally, they showed significant interest regarding the visitors' behavior and metrics, such as bounce rate/exit, average time spent on the website and new/returning users. Finally, the respondents seemed to appreciate the data for the traffic sources—channels (e.g., search engines, social media, etc.), as well as the data for the search terms that led to a website. On the other hand, newsrooms seem to be less interested regarding the demographic data of users (e.g., language, country, city, age, sex), as well as technical data, such as the most used browser (e.g., Chrome, Firefox, etc.), operating system (e.g., Windows, Mac OS), screen resolution, devices, etc. Also, they showed little interest concerning the number of comments on their articles and social media sharing (e.g., Facebook, Twitter, etc.).

#### *6.4. SEO and News Content Creation (RQ4)*

The current study examined real data and certain aspects of everyday routines according to respondents' media outlets. In general, the respondents reported that SEO affects news stories, and more specifically, it may have a significant impact on the editing of the articles and news language. Based on the results, newsrooms have seen SEO practice having a direct impact on journalistic workflow and the creation of news content incorporating techniques designed to ensure high ranking in search engine results pages.

The most popular SEO practices used by media outlets implemented in news content creation were the use of keywords in SEO-friendly titles, the use of meta description tag to summarize a web page's content, and the use of internal links that point to another webpage on the same website (see Figure 1). Other widely used practices were the presence of keywords in the main text of an article, keyword tags, image optimization (with use of keywords and small sentences on file names or alternative texts/tags), external links and also the use of multimedia content. Less common practices (only two out of four media outlets) were the use of different titles and the estimation of the title character limit.

**Figure 1.** Most common SEO practices in news content creation.

The above results are unsurprising. Titles are considered one of the most important on-page SEO elements with major search engines, paying a lot of attention to them. Also, the use of meta description tag is a good practice and is highly recommended because they are commonly used by Google as a snippet/preview of someone's web pages on SERPs [28,29]. The addition of relevant, video content adds value, makes content even richer, and it is often ranked higher from search engines [45]. Video content attracts and engages users more, since it is deemed to be highly shareable and having a higher clickthrough rate compared to traditional text results [50]. In addition, the optimization of images helps search engines to determine easily what the image is about, and it is considered very important, especially for image-based search engines, such as Google Images [19]. Finally, links that lead to other relevant web pages (preferably high-ranking sites) can contribute positively to SEO, especially if a site is new [6,74]. In contrast with our results, the practice of different titles is often used by several large news organizations (e.g., New York Times, BBC News, Huffington Post or Guardian) and the most common practice is the different headlines between the front page and the story page itself. The latter, appearing in search engine results, are usually more specific and include more keywords [6,32,52,53].

#### *6.5. The General Impact of SEO on Journalism (RQ5)*

Other topics addressed by this research included the impact of SEO practices on issues related to both the publication and selection of news. It also examined if these practices improve relations with the public and with the journalism profession in general. According to the respondents, the position of the article on the site and the period of time it will remain online are connected with SEO, as their responses ranged from quite to very much, while one of them indicated that there is no connection at all. About the multimedia content, there was some differentiation between the respondents: Two of them said that its presence affects SEO practices a little and two of them said a lot. According to the answers, the presence and sharing of articles on social media are also connected with SEO.

Moreover, three out of four respondents indicated that the use of SEO practices within newsrooms could greatly improve their relationship with the audience, and just only one believed that there is no connection between them. The respondents were asked to give their opinion about whether SEO, in general, affects the way journalists choose the news stories that will be published. All the respondents responded that SEO, in general, affects the news agenda and news content chosen by their media outlet considerably. From the same perspective, SEO techniques seem to prefer news stories that concern topical subjects or breaking news, while making the promotion of features or opinion pieces more difficult [59]. However, respondents also reported that SEO strategy would have a greater impact on others (third person) than themselves (first person) as regards the selection of news content. From a social perspective, this might be connected to other studies where Davison's [75] initial hypothesis of the third-person effect was studied.

Finally, all the respondents considered SEO strategy as an essential tool for editors, bloggers, media professionals and anyone, in general, who publishes newsworthy content. They seemed to appreciate the usefulness of an SEO strategy, and they believed that it could only benefit journalistic work by helping news articles to be found. The above results are consistent with previous studies where journalists and media owners in Greece were found to be able to adapt to technological progress, considering it useful for their profession [15,76]. It is remarkable that there were different opinions about the journalistic product and how it is affected by the SEO practices. Only one of the respondents thought these practices do not diminish journalistic quality, while one of them thought that SEO affects it a little, one quite a lot and one very much in agreement with media professionals who talk about a negative impact on the craft of journalism and the creation of stories exclusively for search engines.

#### **7. Conclusions and Future Extensions**

This paper examined the use of search engine optimization practices in news websites and journalism using a series of semi-structured, in-depth interviews with professionals at four Greek media organizations. The aim of this study was to identify how familiar news publishers are with SEO practices, how SEO policy is applied inside newsrooms, the most common trends and practices, as well as the impact on news content. Nowadays, newsrooms and media outlets have embraced the use of SEO practices and utilize them in order to make their content more easily available through search results, something that is clear also from this study. In the same context, the study also noted specific optimization practices that are often used by news websites, such as keyword research, research about hot topics and the top search queries or dissemination via social media. All of the four newsrooms under study incorporate several techniques designed to ensure high ranking in search results, such as the use of keywords and SEO-friendly titles, meta description tags, internal/external links, image optimization or the use of multimedia content. Furthermore, search traffic measurement tools and Google analytics services were used across all news organizations studied. According to the respondents' viewpoints, SEO strategy seems to be varied among different newsrooms, depending on factors, such as ownership model, editorial priorities and organizational structures. Finally, SEO practices may have a considerable impact on the way journalists and media professionals choose news stories, as well as their publishing practices (e.g., position of the article on the site or the amount of time it will remain online). In the context of the above findings, it should be noted that all the news publishers of the sample seem to not have a clear structure for using SEO practices, and they have adopted a more rudimentary approach utilizing different, popular off-the-shelf tools. The absence of a distinct SEO team or SEO chief is evident in the studied newsrooms where journalists or media professionals with many other responsibilities often deal with these practices.

In a constantly changing and competitive media environment, no one can take their readership for granted. While the media industry adapts to the digital age and competition increases, effective use of SEO within newsrooms seems to be an important element for attracting more online readers. SEO is not only about visibility on search engines—it also includes making a website more user-friendly. As Richmond notes [59], everything has to do with the editorial choices. SEO per se is value-neutral, and it does not require journalists to dumb down or write solely for gaining traffic. SEO practices reflect the essential needs of web users to find information, while also securing long-term promotion for journalism. The content is still the most important thing for any website, and it has to be characterized by reliability, interest and quality [6]. In this way, readers will be satisfied and motivated to share it through social networks, blog posts or forums, which is what search engines look for. Under these conditions, a journalist must be creative, but also strategic, combining SEO with quality content production.

This study is not without limitations. Firstly, as being an exploratory study, the sample included only a small number of Greek newsrooms, and thus, it is not claiming to be representative of the entire population of newsrooms and media outlets. It is reasonable that a larger sample could yield better results. Moreover, our results are dependent on the accuracy and honesty of the respondents answering the questions. Nevertheless, the main strength of the study was that it attempted to explore the role of SEO practices in media websites, as well as in a journalism context with a culture (i.e., Greek newsrooms) that has received little attention so far. Even though SEO is widely used by marketing practitioners, there is a relatively small amount of academic research that systematically attempts to capture this phenomenon and its impact on various industries. To the best of our knowledge, there is a scarcity of academic research examining the relationship between SEO and journalism. We believe that this study provides useful insights concerning the use of SEO inside newsrooms and it will open the door for further longitudinal analysis. Future extension of this work will include the repetition of the study with a larger sample size and a more varied selection of newsrooms. In this context, comparative studies with foreign online media would be another study goal in the near future.

**Author Contributions:** Conceptualization, D.G. and C.K.; methodology, D.G.; validation, D.G.; formal analysis, D.G.; investigation, D.G. and C.K.; resources, D.G. and C.K.; data curation, D.G.; writing—original draft preparation, D.G. and C.K.; writing—review and editing, D.G., C.K. and A.V.; visualization, D.G.; supervision, D.G. and A.V.; project administration, D.G.

**Funding:** This research received no external funding.

**Acknowledgments:** The authors would like to thank interviewees for their time, as well as for sharing their insights and a valuable overview of the SEO strategy in their news organizations.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
