25 jun 2014

Are Latin-American repositories invisible on Google and Google Scholar?



A DIGEST OF
SUMMARY
In this issue, not without some embarrassment, we digest a contribution from our own. The main objective of this study is to ascertain the presence and visibility of Latin American repositories in Google and Google Scholar through the application of page count and visibility indicators. For a sample of 127 repositories, the results indicate that the indexing ratio is low in Google, and virtually non-existent in Google Scholar. A complete lack of correspondence between the repository records and the data produced by these two search tools are indicated as well. These results are mainly attributable to limitations arising from the use of description schemas that are incompatible with Google Scholar (repository design) and the reliability of web indicators (search engines). We conclude that neither Google nor Google Scholar accurately represent the actual size of open access content published by Latin American repositories; this may indicate a non-indexed, hidden side to OA, which could be limiting the dissemination and consumption of open access scholarly literature.

1. DIGEST


RESEARCH QUESTIONS
§  How many Latin American institutional repositories documents are indexed by Google and Google Scholar?
§  What is the web impact of content published in Latin American institutional repositories?
§  Is there correlation between page count and web visibility?
METHODOLOGY

Unit analysis

Latin American institutional repositories listed in the Ranking Web of Repositories (July 2013 edition)
Sample
127 Latin American institutional repositories
Design
§ The size of the repositories in number of items hosted is obtained from the information provided by the platform itself.
§ The total number of items listed in Google and Google Scholar is calculated with the command site (<site:domain.com>; <site:domain.com filetype:pdf>).
§  Mention values were obtained from the search engine Open Site Explorer. This retrieves both the number of external links for each repository (measured at the aggregate domain level, i.e., all external links from the same domain are counted only once), and the MzRank indicator at subdomain level, which provides an estimated value for the popularity of the websites analysed.
§  Additionally, the number of mentions for each URL was calculated from Google, which gave an estimated indicator of the number of external links (<“domain.com” –site:domain.com –inurl:domain.com>)
§ A correlation analysis was conducted for all indicators (given the unequal distribution of web data, the Spearman correlation coefficient was applied) as was a principal component analysis (PCA).
Measures
§  Number of documents hosted by the repository (ITE)
§  Number of files indexed in Google (Gtot)
§  Number of PDF files indexed in Google (Gpdf)
§  Number of files indexed in Google Scholar (GStot)
§  Number of PDF files indexed in Google Scholar (GSpdf)
§  Number of times the URL is mentioned (URL)
§  Number of external links grouped by domain (V)
§  Link popularity score (0 to 10) (Mz)
Period analyzed:  All
Data collection date: October 2013

RESULTS
Errors in the functionality of search engines
Page count values for the repository are lower than those shown for the search engines (these errors vary according to the source).
·  In the case of Google, 109 URLs whose size is greater than the number of items were located. For PDF values, the number of URLs with this error is lower at 47, which indicates that this query is more accurate than that for overall size. It therefore seems clear that the search engine is retrieving not only items from the repository but also other files hosted on the domain (including those pertaining to the application used to manage the repository).
·    In the case of Google Scholar, there are even fewer errors. Total page count yields 11 URLs with page count values greater than those for the repositories, while for PDF files there are only three). In this case, the errors are directly related to errors in the indexing of resources, but they are practically non-existent and are, in any case, detectable and easily controlled.
Google and Google Scholar Coverage
·    If we circumscribe the coverage analysis to only those documents in PDF files, a low coverage in Google (48.3%) and virtually nonexistent in Google Scholar (2.5%) is detected (Figure 1).

Figure 1. Percentage of  documents (PDF files) from 127 Latin-American Repositories indexed in Google and Google Scholar)
Data source: re-elaborated from Orduña-Malea & Delgado López-Cózar (in press)

·       If the search is extended to all file types indexed on Google and Google Scholar, the results indicate that Google indexes 100% of the documents and Google Scholar only 34.2%.
 Impact of the repositories: mention indicators
·   URL mentions: the values obtained are exceptionally high, especially for <tesis.usp.br> (5,380,000 hits). Although search engines round up these values, it is evident that extra noise is high, despite using the <-inurl> command to exclude certain types of spam. Even so, we detected some exceptions in some URLs, which, despite having high page count values (for items both in the repository and indexed by Google), made hardly any impact in URL mentions.
·     Referring domains: the achieved impact was very low: only 4 URLs achieved more than 100 domains linking in, while 21 did not return any result. These data correspond to the MzRank values (which depend directly on the quantity and quality of inbound external links on the analysed websites). In this case, no URL scored more than 5 points (the maximum is 10). Moreover, 23 URLs obtained a “0” value.
Correlation between page count and impact
·    The number of items retrieved directly from the platform (ITE) correlated significantly with various mention indicators, especially with PDF file page count in Google (r=.75) and total page count in Scholar (r=.68). However, a very low correlation was obtained with PDF page count in Google Scholar (r=.31), when it was precisely this indicator which should have been the most accurate in capturing the number of articles deposited in an institutional repository; it returned very low indexing ratios.
 With regard to the correlation of ITE with mention indicators, unexpectedly significant results were achieved with the number of URL mentions (r=.63), which demonstrates that despite the document noise of this indicator, the results do have certain value.
·  Finally, almost no correlation was observed between ITE and indicators related to hyperlinks, both for the number of referring domains (r=.26) and for MzRank (r=.22).
·     The PCA clearly shows the separation between performance in page count and visibility, and how the URL mention indicator seems closer to the page count than to the visibility indicators, when by their nature the opposite should be true.

2. DISCUSSION

The complete lack of correspondence between the repository records and the data provided by the Google & Google Scholar should be noted. Equally striking are the highly marked discrepancies in information between the search engines themselves: they only coincide in their extremely low indexing values for PDF documents.

This raises a preliminary question about the reliability and validity of the data search and recovery process (“site” command), the technical indexing mechanisms of the robots used by Google and Google Scholar and/or the deficient web architecture of the repositories themselves, which could well be the cause that lies behind the other aspects. Similarly, the design of the database of some of the repositories may prevent the accurate retrieval of indicators by search engine bots (a concept known as the invisible internet), although the development of applications such as DSpace (widely used in the installation of this study’s sample repositories) has eliminated this problem.

With regard to Google (which should, in principle, index everything to achieve its goal of making the world’s information universally accessible), the inordinately high page count data (well above real values) must be due to the counting of files that are not specifically items of the collection studied, i.e., files pertaining to the software itself or other information hosted by the server being analysed (easily verifiable by manually browsing through the results returned for the “site” query in the search engine).

Regarding the number of PDF documents, although exact figures for this document type in the repositories under study are not known, such a low indexing ratio is very strange. The pervasive use of the PDF format is an irrefutable fact in academia (Aguillo 2009), and it is very odd that academic repositories such as those studied here, which often contain scholarly output – theses, articles, reports and other academic documents (course syllabi, teaching materials) –have such a low percentage, save a few notable exceptions (<lume.ufrgs.br>, <repositorio.ufsc.br>). It is therefore plausible to conclude that Google underrepresents the scientific and academic content of the repositories.

By contrast, the total number of documents indexed in Google Scholar, contrary to Google, is well below what was expected. The low item indexing ratios in Google Scholar (whose database is not the same as Google’s) are consistent with those obtained previously by Arlitsch and O’Brian (2012), who detected low indexing ratios in the United States for repository articles in Google Scholar, where only 30% of documents stored in the 21 repositories that formed their sample were included indexed in Google Scholar. Using the same methodology (the query: "site: repositoryURL") in this study, the indexing rate was only 34.2%. Lower (17.1%) is the indexing ratio we found in June 2014 for the documents on the World Bank's Open Knowledge Repository (Martín-Martín et al, 2014).

In any case, there are several reasons that may explain why the overall data should be viewed with some caution. Aaron Tay has sharply summarized them in a recent post on his blog entitled 8 surprising things I learnt about Google Scholar.

First, because the “site” operator does not return all the items that Google Scholar has indexed for a repository (special caution should be taken with URLs where the suffix PDF does not appear explicitly), which means it is not exhaustive. Second, because the system of grouping multiple versions of an article operates in such a way that one version is taken as the “primary” version. This process is done automatically, although authors may also manually select which is the main version of the article. The “site” command theoretically only returns data for the main version (though this not always happens). This means that if an article is hosted on different platforms (e.g. journal and repository), and if the primary version is the one published in the journal, the “site” operator applied to the repository will not count the item and vice versa, although it is indexed on both platforms.

This is the reason behind the fact that the indexation rate of the World Bank's Open Knowledge Repository has been much lower than the Latin-American repositories. This repository contains a large number of articles published in journals (over 10%) and related documents, whose versions appear and subsumed in other URLs.

Whenever the search strategy consists of a sample of papers searched individually, the indexing rate increases significantly. This corroborates the known issue of the underrepresentation of the number of results Google Scholar yields when a “site” command search is conducted. According to the study conducted by Doemeland and Trevino (2014), which uses the former methodology, almost 75% of the sample was indexed in Google Scholar. In our previous Digest (GoogleScholar Digest, n 2) we already confirmed this.

This particularly affects the accuracy of Google Scholar in measuring the performance of repositories with this indicator, and largely explains the low values. It also opens up a future research line which should consider whether the repositories with better indexing ratios in Google Scholar are also those with higher numbers of primary versions amongst their items, which may explain the better results of some repositories compared to others.

What does stand to reason is that Google Scholar indexes far fewer PDF documents than Google, given the requirements and recommendations that this search engine provides institutional repository webmasters for indexing documents. These include the following:

v “If you’re a university repository, we recommend that you use the latest version of Eprints (eprints.org), Digital Commons (digitalcommons.bepress.com), or DSpace (dspace.org) software to host your papers.
v To be included, your website must make either the full text of the articles or their complete author-written abstracts freely available and easy to see when users click on your URLs in Google search results.
v Automatic crawlers need to be able to discover and fetch the URLs of all your articles, as well as to periodically refresh their content from your website. Browse interface is necessary for the search robots to discover the URLs of your articles. We recommend that the URL of every article is reachable from the homepage by following at most ten simple HTML links.
v Your website must not require users (or search robots) to sign in, install special software, accept disclaimers, dismiss popup or interstitial advertisements, click on links or buttons, or scroll down the page before they can read the entire abstract of the paper.
v Sites that show login pages, error pages, or bare bibliographic data without abstracts will not be considered for inclusion and may be removed from Google Scholar.
v Since Google refers users to your website to read the papers, your webpages must be available to both users and crawlers at all times. The search robots will visit your webpages periodically in order to pick up the updates, as well as to ensure that your URLs are still available. If the search robots are unable to fetch your webpages, e.g., due to server errors, misconfiguration, or an overly slow response from your website, then some or all of your articles could drop out of Google and Google Scholar.
v Your files need to be either in the HTML or in the PDF format. PDF files must have searchable text, i.e., you must be able to search for and find words in the document using Adobe Acrobat Reader.
v Each file must not exceed 5MB in size. To index larger files, or to index scanned images of pages that require OCR, please upload them to Google Book Search.”

Arlitsch and O’Brian (2012), while noting the limitations of the “site” command, found that the main causes are the metadata schema used and the navigability and information architecture features, which do not help the search engine robots carry out the indexing processes correctly. Indeed, they applied various changes to the description schema (rejecting Dublin Core in favour of other schemas recommended by Google Scholar, such as Highwire Press), and then indexing ratios improved significantly over time.

These limitations of Google Scholar in measuring the presence of repository contents also contrast with the policy of certain products of this company, such as Google Scholar Metrics, which quantifies the scholarly impact of repositories (Delgado López-Cózar & Robinson-García 2012).

In short, it may be concluded that the low repository content indexing ratios are mainly due to these two limitations: the use of description schemas that are not compatible with Google Scholar (repository design) and the reliability of the web indicators (search engines).

Finally, it was found that the queries that combined the overall page count with the PDF file type in Google were those that achieved more optimal results, and that were most similar to the data that the repositories themselves indicated with regard to the size of their collections. This may have been determined by the fact that primary versions are not accounted for in the search – whereas in Scholar they are – which clearly underrepresents the presence of repositories when measured by the “site” command.

The final conclusions of this study highlight the insufficient dissemination of open access scholarly literature (crucially in terms of web visibility) in a medium (the Web) that is by definition its natural environment, and in a context (Latin America), in which scholarly production requires extra visibility because it lies outside the academic mainstream (i.e. not published in journals indexed in WoS or Scopus).

Given the weight of the green route in the dissemination of OA scholarly literature, and the importance of Google (and Google Scholar) to the search and use of academic information, the low visibility of the contents could significantly affect the real use of OA by end users. It would appear to be generating a great hidden mass of open access content, from institutional repositories, which neither Google, in the first instance, or users, in the last instance, can locate.

The lack of web visibility of the analysed repositories is determined by the low indexing ratios of their content (both in Google and Google Scholar), since a low web presence determines a corresponding low web visibility.

These low indexing ratios are, in turn, determined by the use of description schemas that are ill-suited to Google and inadequate web navigability, factors already outlined by Arlitsch and O’Brien (2012). Additionally, this study has also identified certain technical limitations in the use of web indicators in Google and Google Scholar to measure this indexing.

Therefore, we consider that neither Google nor Google Scholar are accurate or representative of the actual page count of open access content published by Latin American repositories; this may indicate the existence of a hidden, non-indexed side of OA.

In any case, the technical limitations of Google Scholar, in only counting primary versions of articles, tilt the balance towards the use of Google to measure page count, despite the fact that the document noise is greater. However, a thorough analysis of the real influence of the primary version search and accuracy of the “site” command in repository performance in Google Scholar (which requires an item by item analysis of each collection) is deemed necessary.

Much of the solution to these problems is purely technical, and should be addressed in the short term to ensure the visibility of repositories, to which institutions are now devoting significant financial and human resources. This must include a rethinking of the goals that must be achieved to guarantee the success of a repository, for which presence and visibility in search engines must bear greater weight.

However, the results come from the analysis of a small sample of repositories, and should be widened in the future to larger samples in order to draw more definitive conclusions.

References

Aguillo, Isidro F. (2009). Measuring the institutions’ footprint in the web. Library Hi Tech, 27(4), 540-556.
Arlitsch, K. & O’Brian, P.S.  (2012). Invisible institutional repositories: addressing the low indexing ratios of IRs in Google. Library Hi Tech, 30(1), 60-81.
Delgado López-Cózar, E. & Robinson-García, N. (2012). Repositories in Google Scholar Metrics: what is this document type doing in a place as such. Cybermetrics, v. 16.
Available at  http://cybermetrics.cindoc.csic.es/articles/v16i1p4.pdf (accessed 15 March 2014).
Doemeland, Doerte & Trevino, James (2014). Which World Bank reports are widely read? World Bank Policy Research Working Paper, n. 6851.
https://openknowledge.worldbank.org/bitstream/handle/10986/18346/WPS6851.pdf?sequence=1
Martín-Martín, A.;  Ayllón, J.M.; Orduña-Malea, E.; Delgado López-Cózar, E. (2014). The World Bank’s policy reports in Google Scholar. Are they visible, cited, and downloaded?. EC3 Google Scholar Digest Reviews, n. 2.

Granada y Valencia, 25 de junio de 2014

No hay comentarios:

Publicar un comentario