31 oct 2014

The most cited documents in Google Scholar & Web of Science: two sides of the same coin

As one more example of those unlikely but still curiously recurrent cases of two different teams simultaneously studying the same research problem, two studies on highly cited documents have been released this week with barely a few hours of difference between them. One of them was published by Van Noorden, Maher & Nuzzo in Nature. The other one was submitted by Martín, Ayllón, Orduña & Delgado López-Cózar to the arXiv repository.

The first one studies the top 100 most cited articles in Web of Science on the ocassion of the 50th anniversary of the Science Citation Index (SCI), although it also offers an alternative list containing the top 100 most cited documents in Google Scholar. Additionally, the authors comment many of these articles individually and their contribution to the advancement of science.
Conversely, the second study, carried out in the ocassion of Google Scholar's 10th anniversary,  takes a different approach to the same issue, analysing the highly cited documents in Google Scholar (a sample of 64,000 documents published between 1950 and 2013). Among other things, we compare the ranking obtained from Google Scholar data to the most cited documents in Web of Science.
This unique event has an interesting side effect, since it allows us to compare and verify the validity of the methodologies used and the results reached in both studies.

Our conclusion is that Google Scholar presents a different view to the one we were used to after many years using the Web of Science (50% of the highly cited documents in GS are not indexed in WoS). 

However, if we analyse only those documents that are indexed both in Google Scholar and the Web of Science (32,680 documents in our sample), Google Scholar presents a very similar portrait of the world of research to the one offered in Web of Science, with a significant difference: 91.6% of the documents have received more citations in GS than in WoS. Only 3,079 documents (9.4%) have more citations according to WoS than in GS. Furthermore, the average number of citations per document in GS is 1.79, and 1.08 in WoS, which means that on average, GS has 70% more citations per document than WoS.

Our recommendation: read, and compare

15 oct 2014

Academic Search Engines: A quantitative outlook, by José Luis Ortega

The recently published book entitled “Academic Search Engines: a quantitative outlook“ (Chandos Publishing) is the first monograph that deals in a joint, general, and exhaustive manner, with the topic of Academic Search Engines. And this makes itself immensely valuable. With this book we’ll have a complete view of the past, present and even the future of the tools whose aim is to improve the search and discovery of scientific information on the Web. The novelty of this work dwells both in the originality of the subject, and in the perspective from which it is approached: in a quantitative manner, instead of mere qualitative.

Reading this book we’ll learn the details of not only all the features, search functionalities, and the specifics of the information retrieval techniques used by these academic search engines, but also their coverage (the sources from which they feed, the number of documents that they cover and their typologies). In short, all the essential and necessary information to assess the quality of all these information systems, namely:

  • the thoroughness and accuracy of their contents,
  • the effectiveness of the tools used to retrieve and extract content
  • the quantity, variety and quality of the results that they display.
The products that the book analyses are the following:

Each one of these academic search engines is subjected to a detailed analysis in its own chapter of the book, except the last three (BASE, Q-Sensei Scholar, WorlWideScience), which are examined together. The analysis is not merely descriptive, for the commentary is both incisive and critical, which allows the reader to quickly discern the strengths and weaknesses of each search engine.

After reading each one of these chapters, the conclusion is clear: academic search engines are so different in concept, purpose, and design, and deliver very different features and results, to the point that they cannot be easily compared.

Nonetheless, without any doubt, the most suggestive chapter is paradoxically the last one, where the author executes a comparative analysis of all search engines whose goal - in the author’s own words - is none other than “to analyze together these systems from different perspectives which contribute to have a multilayer view on the performance of each search service. In this way, this comparative approach would explain in a different form the advantages and shortcomings of each search engine in relation with the other ones, which would stress the significance of these facets. With that, it is not intended to do a competitive process to select the best search engine for scientific information, but to contextualize the performance of these search tools in relation with the other ones as a way to describe its advantages or to mark their weakness.”

Summarizing, these are the results of the comparative analysis:

  • Google Scholar is the most exhaustive and complete academic search engine since it presents a deep crawling of the academic Web with a wide range of sources. And the duplicate management is rather satisfactory as well.
  • According to the source types, Google Scholar is the search engine that feeds from a wider range of sources.
  • From a qualitative point of view, Microsoft Academic Search has proved to be the best profiling tool, because the structure of the entire site is built around profiles at different aggregated levels.
  • If the search interfaces are observed, BASE and Scirus are the services that produce better performance, whereas AMiner could be considered the worst engine in this aspect.
  • According to their exporting features, BASE is still the best solution according to the number of different formats, although WorldWideScience and CiteSeerx are the tools allowing a larger number of records to download.

As it occurs in all endeavours of life, “nobody is perfect”, and that’s why it is not surprising this comparative analysis hasn’t been able to discover the ideal and perfect search engine: all of them stand out at least in one aspect, and at the same time, all of them are outperformed by the rest in many other aspects. As the author appropriately concluded, “As it has been seen throughout this benchmarking exercise, it has not been possible to appreciate what is the best system because it depends of each user’s needs.”

The principal problem of this book is that problem which affects all books that deal with technological topics, especially those discussing individual software or hardware products: obsolescence. Since technology is constantly changing and products are updated incessantly, it may happen that a product accessible at the time the book is being written (or a version of the product analysed) greatly differs from the time when the book is published, even to the point of being unrecognisable each other. Therefore, it may happen that a product analysed in a book is already dead by the time the book is released (such as Scirus), or that it is about to (Microsoft Academic Search). This, however, doesn’t diminish the interest of the book in the slightest, since from a scientific point of view, it is essential to know the contributions and solutions that every product has implemented to solve the problem of the search, retrieval and evaluation of scientific information on the Web.

Portrait of Google Scholar and its derivative products Scholar Metrics & Scholar Citations

Of special interest to our blog is the empirical analysis the author carried out on Google Scholar and its derivative products Google Scholar Metrics and Google Scholar Citations. Apart from describing in detail how the search engine works, the most relevant information, from our point of view, is the empirical data about its coverage:
  • Google Scholar contains 109.3 millions of document, from which 94.74 millions (86.7%) would correspond to scientific documents and a 14.5 (13.2%) millions of Courts opinions, or Case laws.
  • The distribution of publications over time is quite irregular, marked by peaks and valleys. These results suggest a slow increase rate with important freshness problems.
  • Apart from legal documents, the dominant document type in GS is the “academic paper” (although the text doesn’t clarify it, we assume it is referring to journal papers), which represents 46.8% (44.4 million documents), followed by patents (19.6%; 18.55 million), and books (12.1%; 11.46 million). Moreover, the 21.1% of documents are included in a category that is not really a document type (citations), but just a bibliographic reference to a document that GS hasn’t been able to locate on the Web, only in the reference lists of other documents. It could be a book, a book chapter, a report, a journal paper, conference proceedings… Therefore, there is an elevated fraction of documents whose typology we still ignore.
  • Regarding the sources from which GS extracts information, it is should be noted that 58.8% of the documents in GS come from publishers such as SpringerLink or ScienceDirect, 8.2% from Google Patents and Google Books, 28.1% corresponds to open access repositories (thematic repositories 16.9%; institutional repositories 11.8%) and finally the 4.7% from bibliographic services. However, it is worth remembering that these data have been obtained through queries that used the “site:” query command, which is not entirely reliable since it only offers approximate hit counts and not an exact number of documents. Also, it is only taking into account the primary versions of the documents, and not the rest of the versions, which may be stored in a great variety of hosts.
  • Google Scholar’s search system shows serious inconsistencies that put the reliability of the search engine into question, especially about the estimated number of results displayed in the queries, and the limitation of 1,000 results per query.
  • Google Scholar has significantly improved over time the management of duplicates, and corrected document parsing and erroneous citation counting and assignation.
  • In December 2013, it is estimated that Google Scholar Citations contained 350,000 author profiles, with 18.3 millions of papers assigned to these profiles.
  • The countries with a higher presence of author pages are the United States (24.8%), the United Kingdom (6.4%) and Brazil (5.3%).
  • Three of the organizations with more profiles are from Brazil: Universidade de São Paulo (1.3%), Universidade Estadual Paulista (0.6%) and Universidade Estadual de Campinas (0.4%). After these, the author finds the most usual organizations in research rankings, such as the University of Michigan (0.5%), Harvard University (0.4%) and the University of Washington (0.4%).
  • As regards the thematic coverage inside Google Scholar Citations, an overwhelming majority of the author profiles belong to researchers in the area of Computer Science. It is surprising the absence of profiles with labels related to relevant scientific areas such as Medicine, Chemistry and Physics.
  • It is estimated that Google Scholar Metrics contained nearly 30,000 journals at the end of 2012.

After this exhaustive quantitative overview of Google Scholar and its derivative products, we want to point out some discrepancies and contradictory data with the results previously presented by our research group (EC3 Bibliography). These differences, from a technical and methodological point of view, open a fascinating scientific debate, and they demonstrate one more time the difficulty of making size estimations for tools with such an opaque and dynamic nature as these search engines.

There is only one thing to question ourselves: What is the future that awaits to these search engines? A search engine, regardless of all its sophisticated tools for searching scientific information, is designed to be used. Do we have data about the past and present use of these search engines? Trying to zero in on this issue, we have tried to determine which are the most popular academic search engines according to users’ queries, as measured by Google Trends. We have generated the charts both including Google Scholar and leaving it out, to be able to observe the differences. The results speak for themselves: there is ONE search engine above all the rest, which in practice becomes irrelevant today.

To sum up, and going back to the study that is the object of this review, we must conclude as we started: we are facing an essential reference work in the history of the search for scientific information on the Web in general, and specifically on the academic search engines.