25 jun 2014

Are Latin-American repositories invisible on Google and Google Scholar?

In this issue, not without some embarrassment, we digest a contribution from our own. The main objective of this study is to ascertain the presence and visibility of Latin American repositories in Google and Google Scholar through the application of page count and visibility indicators. For a sample of 127 repositories, the results indicate that the indexing ratio is low in Google, and virtually non-existent in Google Scholar. A complete lack of correspondence between the repository records and the data produced by these two search tools are indicated as well. These results are mainly attributable to limitations arising from the use of description schemas that are incompatible with Google Scholar (repository design) and the reliability of web indicators (search engines). We conclude that neither Google nor Google Scholar accurately represent the actual size of open access content published by Latin American repositories; this may indicate a non-indexed, hidden side to OA, which could be limiting the dissemination and consumption of open access scholarly literature.


§  How many Latin American institutional repositories documents are indexed by Google and Google Scholar?
§  What is the web impact of content published in Latin American institutional repositories?
§  Is there correlation between page count and web visibility?

Unit analysis

Latin American institutional repositories listed in the Ranking Web of Repositories (July 2013 edition)
127 Latin American institutional repositories
§ The size of the repositories in number of items hosted is obtained from the information provided by the platform itself.
§ The total number of items listed in Google and Google Scholar is calculated with the command site (<site:domain.com>; <site:domain.com filetype:pdf>).
§  Mention values were obtained from the search engine Open Site Explorer. This retrieves both the number of external links for each repository (measured at the aggregate domain level, i.e., all external links from the same domain are counted only once), and the MzRank indicator at subdomain level, which provides an estimated value for the popularity of the websites analysed.
§  Additionally, the number of mentions for each URL was calculated from Google, which gave an estimated indicator of the number of external links (<“domain.com” –site:domain.com –inurl:domain.com>)
§ A correlation analysis was conducted for all indicators (given the unequal distribution of web data, the Spearman correlation coefficient was applied) as was a principal component analysis (PCA).
§  Number of documents hosted by the repository (ITE)
§  Number of files indexed in Google (Gtot)
§  Number of PDF files indexed in Google (Gpdf)
§  Number of files indexed in Google Scholar (GStot)
§  Number of PDF files indexed in Google Scholar (GSpdf)
§  Number of times the URL is mentioned (URL)
§  Number of external links grouped by domain (V)
§  Link popularity score (0 to 10) (Mz)
Period analyzed:  All
Data collection date: October 2013

Errors in the functionality of search engines
Page count values for the repository are lower than those shown for the search engines (these errors vary according to the source).
·  In the case of Google, 109 URLs whose size is greater than the number of items were located. For PDF values, the number of URLs with this error is lower at 47, which indicates that this query is more accurate than that for overall size. It therefore seems clear that the search engine is retrieving not only items from the repository but also other files hosted on the domain (including those pertaining to the application used to manage the repository).
·    In the case of Google Scholar, there are even fewer errors. Total page count yields 11 URLs with page count values greater than those for the repositories, while for PDF files there are only three). In this case, the errors are directly related to errors in the indexing of resources, but they are practically non-existent and are, in any case, detectable and easily controlled.
Google and Google Scholar Coverage
·    If we circumscribe the coverage analysis to only those documents in PDF files, a low coverage in Google (48.3%) and virtually nonexistent in Google Scholar (2.5%) is detected (Figure 1).

Figure 1. Percentage of  documents (PDF files) from 127 Latin-American Repositories indexed in Google and Google Scholar)
Data source: re-elaborated from Orduña-Malea & Delgado López-Cózar (in press)

·       If the search is extended to all file types indexed on Google and Google Scholar, the results indicate that Google indexes 100% of the documents and Google Scholar only 34.2%.
 Impact of the repositories: mention indicators
·   URL mentions: the values obtained are exceptionally high, especially for <tesis.usp.br> (5,380,000 hits). Although search engines round up these values, it is evident that extra noise is high, despite using the <-inurl> command to exclude certain types of spam. Even so, we detected some exceptions in some URLs, which, despite having high page count values (for items both in the repository and indexed by Google), made hardly any impact in URL mentions.
·     Referring domains: the achieved impact was very low: only 4 URLs achieved more than 100 domains linking in, while 21 did not return any result. These data correspond to the MzRank values (which depend directly on the quantity and quality of inbound external links on the analysed websites). In this case, no URL scored more than 5 points (the maximum is 10). Moreover, 23 URLs obtained a “0” value.
Correlation between page count and impact
·    The number of items retrieved directly from the platform (ITE) correlated significantly with various mention indicators, especially with PDF file page count in Google (r=.75) and total page count in Scholar (r=.68). However, a very low correlation was obtained with PDF page count in Google Scholar (r=.31), when it was precisely this indicator which should have been the most accurate in capturing the number of articles deposited in an institutional repository; it returned very low indexing ratios.
 With regard to the correlation of ITE with mention indicators, unexpectedly significant results were achieved with the number of URL mentions (r=.63), which demonstrates that despite the document noise of this indicator, the results do have certain value.
·  Finally, almost no correlation was observed between ITE and indicators related to hyperlinks, both for the number of referring domains (r=.26) and for MzRank (r=.22).
·     The PCA clearly shows the separation between performance in page count and visibility, and how the URL mention indicator seems closer to the page count than to the visibility indicators, when by their nature the opposite should be true.


The complete lack of correspondence between the repository records and the data provided by the Google & Google Scholar should be noted. Equally striking are the highly marked discrepancies in information between the search engines themselves: they only coincide in their extremely low indexing values for PDF documents.

This raises a preliminary question about the reliability and validity of the data search and recovery process (“site” command), the technical indexing mechanisms of the robots used by Google and Google Scholar and/or the deficient web architecture of the repositories themselves, which could well be the cause that lies behind the other aspects. Similarly, the design of the database of some of the repositories may prevent the accurate retrieval of indicators by search engine bots (a concept known as the invisible internet), although the development of applications such as DSpace (widely used in the installation of this study’s sample repositories) has eliminated this problem.

With regard to Google (which should, in principle, index everything to achieve its goal of making the world’s information universally accessible), the inordinately high page count data (well above real values) must be due to the counting of files that are not specifically items of the collection studied, i.e., files pertaining to the software itself or other information hosted by the server being analysed (easily verifiable by manually browsing through the results returned for the “site” query in the search engine).

Regarding the number of PDF documents, although exact figures for this document type in the repositories under study are not known, such a low indexing ratio is very strange. The pervasive use of the PDF format is an irrefutable fact in academia (Aguillo 2009), and it is very odd that academic repositories such as those studied here, which often contain scholarly output – theses, articles, reports and other academic documents (course syllabi, teaching materials) –have such a low percentage, save a few notable exceptions (<lume.ufrgs.br>, <repositorio.ufsc.br>). It is therefore plausible to conclude that Google underrepresents the scientific and academic content of the repositories.

By contrast, the total number of documents indexed in Google Scholar, contrary to Google, is well below what was expected. The low item indexing ratios in Google Scholar (whose database is not the same as Google’s) are consistent with those obtained previously by Arlitsch and O’Brian (2012), who detected low indexing ratios in the United States for repository articles in Google Scholar, where only 30% of documents stored in the 21 repositories that formed their sample were included indexed in Google Scholar. Using the same methodology (the query: "site: repositoryURL") in this study, the indexing rate was only 34.2%. Lower (17.1%) is the indexing ratio we found in June 2014 for the documents on the World Bank's Open Knowledge Repository (Martín-Martín et al, 2014).

In any case, there are several reasons that may explain why the overall data should be viewed with some caution. Aaron Tay has sharply summarized them in a recent post on his blog entitled 8 surprising things I learnt about Google Scholar.

First, because the “site” operator does not return all the items that Google Scholar has indexed for a repository (special caution should be taken with URLs where the suffix PDF does not appear explicitly), which means it is not exhaustive. Second, because the system of grouping multiple versions of an article operates in such a way that one version is taken as the “primary” version. This process is done automatically, although authors may also manually select which is the main version of the article. The “site” command theoretically only returns data for the main version (though this not always happens). This means that if an article is hosted on different platforms (e.g. journal and repository), and if the primary version is the one published in the journal, the “site” operator applied to the repository will not count the item and vice versa, although it is indexed on both platforms.

This is the reason behind the fact that the indexation rate of the World Bank's Open Knowledge Repository has been much lower than the Latin-American repositories. This repository contains a large number of articles published in journals (over 10%) and related documents, whose versions appear and subsumed in other URLs.

Whenever the search strategy consists of a sample of papers searched individually, the indexing rate increases significantly. This corroborates the known issue of the underrepresentation of the number of results Google Scholar yields when a “site” command search is conducted. According to the study conducted by Doemeland and Trevino (2014), which uses the former methodology, almost 75% of the sample was indexed in Google Scholar. In our previous Digest (GoogleScholar Digest, n 2) we already confirmed this.

This particularly affects the accuracy of Google Scholar in measuring the performance of repositories with this indicator, and largely explains the low values. It also opens up a future research line which should consider whether the repositories with better indexing ratios in Google Scholar are also those with higher numbers of primary versions amongst their items, which may explain the better results of some repositories compared to others.

What does stand to reason is that Google Scholar indexes far fewer PDF documents than Google, given the requirements and recommendations that this search engine provides institutional repository webmasters for indexing documents. These include the following:

v “If you’re a university repository, we recommend that you use the latest version of Eprints (eprints.org), Digital Commons (digitalcommons.bepress.com), or DSpace (dspace.org) software to host your papers.
v To be included, your website must make either the full text of the articles or their complete author-written abstracts freely available and easy to see when users click on your URLs in Google search results.
v Automatic crawlers need to be able to discover and fetch the URLs of all your articles, as well as to periodically refresh their content from your website. Browse interface is necessary for the search robots to discover the URLs of your articles. We recommend that the URL of every article is reachable from the homepage by following at most ten simple HTML links.
v Your website must not require users (or search robots) to sign in, install special software, accept disclaimers, dismiss popup or interstitial advertisements, click on links or buttons, or scroll down the page before they can read the entire abstract of the paper.
v Sites that show login pages, error pages, or bare bibliographic data without abstracts will not be considered for inclusion and may be removed from Google Scholar.
v Since Google refers users to your website to read the papers, your webpages must be available to both users and crawlers at all times. The search robots will visit your webpages periodically in order to pick up the updates, as well as to ensure that your URLs are still available. If the search robots are unable to fetch your webpages, e.g., due to server errors, misconfiguration, or an overly slow response from your website, then some or all of your articles could drop out of Google and Google Scholar.
v Your files need to be either in the HTML or in the PDF format. PDF files must have searchable text, i.e., you must be able to search for and find words in the document using Adobe Acrobat Reader.
v Each file must not exceed 5MB in size. To index larger files, or to index scanned images of pages that require OCR, please upload them to Google Book Search.”

Arlitsch and O’Brian (2012), while noting the limitations of the “site” command, found that the main causes are the metadata schema used and the navigability and information architecture features, which do not help the search engine robots carry out the indexing processes correctly. Indeed, they applied various changes to the description schema (rejecting Dublin Core in favour of other schemas recommended by Google Scholar, such as Highwire Press), and then indexing ratios improved significantly over time.

These limitations of Google Scholar in measuring the presence of repository contents also contrast with the policy of certain products of this company, such as Google Scholar Metrics, which quantifies the scholarly impact of repositories (Delgado López-Cózar & Robinson-García 2012).

In short, it may be concluded that the low repository content indexing ratios are mainly due to these two limitations: the use of description schemas that are not compatible with Google Scholar (repository design) and the reliability of the web indicators (search engines).

Finally, it was found that the queries that combined the overall page count with the PDF file type in Google were those that achieved more optimal results, and that were most similar to the data that the repositories themselves indicated with regard to the size of their collections. This may have been determined by the fact that primary versions are not accounted for in the search – whereas in Scholar they are – which clearly underrepresents the presence of repositories when measured by the “site” command.

The final conclusions of this study highlight the insufficient dissemination of open access scholarly literature (crucially in terms of web visibility) in a medium (the Web) that is by definition its natural environment, and in a context (Latin America), in which scholarly production requires extra visibility because it lies outside the academic mainstream (i.e. not published in journals indexed in WoS or Scopus).

Given the weight of the green route in the dissemination of OA scholarly literature, and the importance of Google (and Google Scholar) to the search and use of academic information, the low visibility of the contents could significantly affect the real use of OA by end users. It would appear to be generating a great hidden mass of open access content, from institutional repositories, which neither Google, in the first instance, or users, in the last instance, can locate.

The lack of web visibility of the analysed repositories is determined by the low indexing ratios of their content (both in Google and Google Scholar), since a low web presence determines a corresponding low web visibility.

These low indexing ratios are, in turn, determined by the use of description schemas that are ill-suited to Google and inadequate web navigability, factors already outlined by Arlitsch and O’Brien (2012). Additionally, this study has also identified certain technical limitations in the use of web indicators in Google and Google Scholar to measure this indexing.

Therefore, we consider that neither Google nor Google Scholar are accurate or representative of the actual page count of open access content published by Latin American repositories; this may indicate the existence of a hidden, non-indexed side of OA.

In any case, the technical limitations of Google Scholar, in only counting primary versions of articles, tilt the balance towards the use of Google to measure page count, despite the fact that the document noise is greater. However, a thorough analysis of the real influence of the primary version search and accuracy of the “site” command in repository performance in Google Scholar (which requires an item by item analysis of each collection) is deemed necessary.

Much of the solution to these problems is purely technical, and should be addressed in the short term to ensure the visibility of repositories, to which institutions are now devoting significant financial and human resources. This must include a rethinking of the goals that must be achieved to guarantee the success of a repository, for which presence and visibility in search engines must bear greater weight.

However, the results come from the analysis of a small sample of repositories, and should be widened in the future to larger samples in order to draw more definitive conclusions.


Aguillo, Isidro F. (2009). Measuring the institutions’ footprint in the web. Library Hi Tech, 27(4), 540-556.
Arlitsch, K. & O’Brian, P.S.  (2012). Invisible institutional repositories: addressing the low indexing ratios of IRs in Google. Library Hi Tech, 30(1), 60-81.
Delgado López-Cózar, E. & Robinson-García, N. (2012). Repositories in Google Scholar Metrics: what is this document type doing in a place as such. Cybermetrics, v. 16.
Available at  http://cybermetrics.cindoc.csic.es/articles/v16i1p4.pdf (accessed 15 March 2014).
Doemeland, Doerte & Trevino, James (2014). Which World Bank reports are widely read? World Bank Policy Research Working Paper, n. 6851.
Martín-Martín, A.;  Ayllón, J.M.; Orduña-Malea, E.; Delgado López-Cózar, E. (2014). The World Bank’s policy reports in Google Scholar. Are they visible, cited, and downloaded?. EC3 Google Scholar Digest Reviews, n. 2.

Granada y Valencia, 25 de junio de 2014

12 jun 2014

The World Bank’s policy reports in Google Scholar. Are they visible, cited, and downloaded?

Although the main goal of this work is to assess the use and impact of the World Bank’s reports, it is included in the Google Scholar’s Digest reviews because the authors not only analyse the visibility of those documents in Google Scholar but also use this database to measure the impact of these reports through their citations. In any case, we only address in this review the results that are directly associated with Google Scholar. Since the study reviewed only analyses a limited sample of certain document types (those classified as “Economic and Sector Work” or as “Technical Assistance”) and a very specific timespan (2008-2012), in the discussion section we intend to find out in which degree the reports published by the World Bank are indexed and cited in Google Scholar. To do this, the contents of the World’s Bank Knowledge Repository (OKR) are analysed finding that only the 17.1% of the 15,319 documents deposited in the OKR are indexed in Google Scholar, and 60% have received at least one citation.


§ Can we estimate the demand and use of the World Bank’s policy reports from their download and citation counts?
§  How many World Bank’s policy reports are covered by Google Scholar?
§  Are the World Bank’s policy reports cited in Google Scholar?
§  Can we identify how often (and when) policy reports were downloaded?

Unit analysis

World Bank’s policy reports: those documents within the Documents & Reports database that have been published as Economic and Sector Work, or Technical Assistance reports.
1,611 policy reports.
§  Data were gathered for all policy reports which are part of the World Bank’s Documents and Records (D&R) database.
§  Download counts were gathered using Omniture web analytics software.
§  The World Bank’s Open Knowledge Repository (OKR) was used to verify whether the policy reports were included in Google Scholar or not.
§  Number of times a PDF has been downloaded from outside the World Bank’s own website.
§  Number of times a policy report has been cited in Google Scholar.
Period analyzed:  2008-2012
Data collection date: Unknown.
1.  74.5% of the World Bank’s policy reports are indexed in Google Scholar (1,201 out of 1,611). No significant differences have been found in the range of years studied (Figure 1).
Figure 1. Percentage of  World Bank’s policy reports (Economic and Sector Work, or Technical Assistance) indexed on Google Scholar (2008-2012)
Data source: re-elaborated from Doemeland & Trevino (2014)
2.  88% of the policy reports (1,054 out of 1,201) in the sample were never cited. Of the 147 policy reports cited, 93 were cited between 1 and 5 times, and only 54 (3%) were cited more than 5 times (Table 1).

Table 1. World Bank’s policy reports (Economic and Sector Work, or Technical Assistance) cited and downloaded on Google Scholar (2008-2012)
Data source: re-elaborated from Doemeland & Trevino (2014)

3.  68% of the policy reports sample (1,093 out of 1,611) were downloaded (Table 1), although most of these relatively few times (40% were downloaded between 1 and 100 times). The policy reports that were downloaded more than 250 times compose the 13% of the sample, and only 25 policy reports (2%) receive more than 1,000 downloads during the period investigated.
4.  Citation counts are much lower than download counts (Figure 2). Only 12% received at least one citation.
Figure 2. Percentage of  World Bank’s policy reports (Economic and Sector Work, or Technical Assistance) cited and downloaded (2008-2012)
Data source: re-elaborated from Doemeland & Trevino (2014)
5.  Reports on middle-income countries with larger populations, using more expensive, complex, multi-sector, and core diagnostics, tend to be downloaded more frequently.
6.  Multi-sector reports also tend to be cited more frequently, but unlike downloads, costs are not a significant determinant of citations.
7.  The cross support provided by the World Bank’s Research Department plays an important role in increasing the demand and use of policy reports.

The most suggestive results of this work concerning our object of study (scientific knowledge about Google Scholar) are the empirical evidences provided on the wide and diverse coverage of Google’s academic search engine. They confirm something well-known: Google Scholar, unlike other traditional bibliographic databases that are mainly focused on indexing journal articles and conference proceedings, collects instances of all the types of documents produced in the scientific domain (articles, conference proceedings, books and book chapters), as well as the academic circles (doctoral theses, master’s or undergraduate theses, teaching materials) and of special interest in this work, the professional world (patents, scientific/technical reports).

In this case, the documents at hand are technical reports, and specifically, those published by the World Bank. It is demonstrated that more than 75% of the World Bank’s policy reports classified as “Economic and Sector Work” or as “Technical Assistance” between 2008 and 2012 are indexed by Google Scholar. The World Bank's Policy Research Report series brings to a broad audience the results of World Bank research on development policy. These reports are designed to contribute to the debate on appropriate public policies for developing economies (Figure 3).

Figure 3. World Bank’s policy research reports webpage
Source: http://econ.worldbank.org

The importance of this kind of documents is not only justified by the institution that publishes them (the World Bank is an authoritative economic institution), but also by the influence that its research, performed through these reports, may have in the economic policies and the economic development of the nations concerned.

Being documents written by policy makers rather than by researchers, they reflect points of view different to those to be found in strictly scientific articles. And, as the technical documents that they are, they contain abundant bibliographic references which allow measuring their professional, economic, and even social impact in a more comprehensive and complete way. As Google Scholar indexes these document types, its analysis makes possible, albeit indirectly, the measurement of other impact dimensions besides the academic one.

Nonetheless, since the study by Doemeland & Trevino analyses a limited sample of certain document types (those classified as “Economic and Sector Work”, or as “Technical Assistance”) and a very specific period of time (2008-2012), we feel compelled to ask:

Are all the reports published by the World Bank indexed in Google Scholar? 
Can we say that Google Scholar is exhaustive?

In order to answer these questions, we have analysed the contents of the World’s Bank Open Knowledge Repository (OKR), a repository launched by the World Bank in April of 2012 with the purpose of enabling free and unrestricted access to most of its research and intellectual materials (books, articles, reports and research documents). The goal of this new open access policy for the bank’s information is that all documents are freely accessible to anybody who wants to reuse them, distribute them, or produce derivative works from them, even for commercial purposes. Emphasis is placed in the fact that documents in the OKR should be easy to find by search engines.

Currently (June 2014), the OKR contains 15,319 documents (Figure 4, upper image). According to the study by Doemeland and Trevino (2014), almost 75% of the sample was indexed in Google Scholar. Therefore, would it be safe to say that 75% of the total number of documents in the OKR is indexed in Google Scholar?

If we use the “site” operator in Google Scholar, it only retrieves about 2,760 results (Figure 4; bottom image). Although Google Scholar clearly states that this operator is not intended for checking the full coverage of a specific source, such a low result leads us to think that the inclusion rate for the entire OKR does not correspond with the results provided by Doemeland and Trevino in their reports sample.

Otherwise, when downloading from GS all the documents hosted in the OKR, we have obtained a total of 2,620 documents. This means that the difference between the number of documents that GS says it finds when you make the query, and the real number of records that GS contains is negligible.

Figure 4. Documents indexed on the World Bank’s repository, and in Google Scholar

The data we have obtained in this quick inquiry tells us that only 17.1% of the documents in the OKR are indexed in Google Scholar. The number of documents per year is shown in Table 2.

Table 2. Documents from OKR indexed in Google Scholar, and their citations per year

To get a definitive idea about the type of documents that Google Scholar hasn’t indexed from this authoritative source, it is illustrative to have a look at the document types that compose the OKR: Working Papers (5,106), Economic and Sector Work (ESW) Studies (3,497), Knowledge Notes (2,599), Books (1,823), Journal articles (1,768), Annual Reports & Independent Evaluations (221), Serials (121), Technical papers (135). Whatever the reasons, these data suggest that the number of scientific/technical documents on the Web is much larger than that we may think, and larger than what Google Scholar can show us.

As regard citations, the results we have obtained show that 60% of the 2,620 documents from the OKR indexed in Google Scholar have received at least one citation (Table 2), which greatly differs from the results obtained by Doemeland & Trevino (2014), who found that only 12% of the documents in their sample had been cited at least once.

In order to further illustrate the impact of the documents published by the World Bank in the OKR, Table 3 shows the Top 25 most cited OKR documents in Google Scholar.

Table 3. Top 25 most cited documents from the OKR in Google Scholar

Since all of them have received at least 100 citations (the most cited document even surpassing 800 citations), the scientific impact of the documents contained in the OKR is undeniably significant.

As a preliminary conclusion, we found that, even though Google Scholar gathers more document types than any other database, the visibility of World Bank reports in Google Scholar is far from being complete. And this is only considering the material deposited in the official repository, not to mention the remaining material that may be allocated in other subdomains of the World Bank.

All these issues tie directly with the subject of our previous “digest” (How many academic documents are visible and freely available on the Web?) and it’ll pave the way to new working papers which will be released soon:

-       The first of them will intend to measure with greater accuracy the proportion of documents published by the World Bank (not only in the repository) that are indexed in Google Scholar, as well as how many of them are cited.
-       The second one, of a more general nature, will focus on the size of Google Scholar.

Granada & Valencia 12 june 2014