Towards Open Bibliometric Indicators: Report #1

Survey of scholarly open data

Authors

Simon Willemin

Dr. Teresa Kubacka

Knowledge Management Group, ETH Library, ETH Zurich

Last updated on

August 5, 2023

Preliminary version

This is a preliminary version of the report, released for the inital feedback from the community. We are looking forward to hearing from you. See Section 7 for the version history of this document.

This report is part of the project “Towards Open Bibliometric Indicators (TOBI)”, whose goal is to assess the data quality of open bibliometric datasets in a subset related to Swiss High Educational Institutions (HEIs). The project is co-financed by swissuniversities and ETH Library.

Abstract

We have analyzed over 200+ datasets containing open scholarly metadata. Based on the initial assessment, we have selected a shortlist of data sources for in-depth analysis. We have collected properties such as size, licenses, source datasets, publisher details and collected them in a single table. We compare and discuss our findings. The results give us a basis to decide which datasets will we use for a in-depth data quality analysis.

Introduction

Motivation for surveying open scholarly metadata data sources

Researchers, analysts and developers who want to use scholarly metadata are faced with the vast choice of databases, tools and search engines - Wikipedia lists over 100 academic databases and search engines [1]. The variety of data sources of scholarly metadata has motivated various efforts for systematic characterization of the similarities and differences of the data sources, for example a peer-reviewed comparison of the properties of 40+ datasets accessible via an API [2], or a website SearchSmart [3] helping to choose between 90+ databases depending on more than 500+ criteria.

Nevertheless, as a starting point of project TOBI, we had to undertake our own evaluation of available datasets. For analytical tasks exceeding easy aggregations and calculations provided in the GUI, programmatic access to data is necessary – for example via an unlimited API or through a static copy of a dataset. This functionality is not available for many databases and search engines, e.g. for Google Scholar, which are included in various comparisons. We also have our own specific criteria for choosing the dataset for in-depth evaluation – such as availability of the attribution of a research output to a Swiss HEI, or a particular license for the data dump – which are not covered in tools like SearchSmart, which were developed for a different purpose. Lastly, to compare the metadata quality, we will need to deal with the records where the metadata may contain mistakes, so it is crucial to understand the relationships between the primary and derivative data sources. We haven’t found another survey undertaking this task.

Motivated by that, we have created a summary of the most important datasets in the context of TOBI. In this report we summarize our findings.

Similar work on data quality analysis of scholarly metadata on a micro-level

Similar efforts to track the metadata quality at lower level, e.g. country and institutional level. Canada: match publishers and RORs of nstitutions [Simon van Bellen: 04.Bellen.pdf (cais2023.ca)]. Italy: [4]

Topic of metrics using open bibliometric data is currently a topic that raises attention in Switzerland and in other countries

CWTS will base its ranking on open data: “We are currently working on an ambitious project in which we explore the use of open data sources to create a fully transparent and reproducible version of the Leiden Ranking. We expect to share the outcomes of this project later this year.” (https://www.leidenmadtrics.nl/articles/the-cwts-leiden-ranking-2023)

Articles: - 07.2023: [2] - Other recent articles?

Courses: - CWTS Course Program, from 31th October to 3rd November: “Scientometrics Using Open Data”

Projects: - Project PFI (Deutschland 2021-2025, not specifically open data sources)

Events: - ISSI Conference “Transitioning towards Open Scientometrics with Open Science Graphs” (https://www.conftool.pro/issi2023/index.php?page=browseSessions&form_session=3) - Schweizer Bibliothekskongress 2023 on “Offenheit und Verantwortung” - WOOC 2023: “The value of open scholarly metadata for research assessment purposes” - SYoS (2023-2024) - STI2023: OpenAlex is used as main source for some analyses - Other events? - Dora@10 - SNSF conferences

Long-list of data sources

Selection process

To identify potential priority data sources, we produced a list of data sources with a focus on sources satisfying most of the following criteria:

  • open access to metadata with possibility to download full dumps of the datasets and to use them in a local infrastructure,
  • relevant for Swiss HEIs research output analysis,
  • containing data related to research (research publications, funding details, organizations etc.),
  • not specific to a particular research field and aiming to cover all (or most) of the scientific output,
  • still operating and receiving updates,
  • containing more than one million searchable records.

As starting points, we used existing comparisons, such as the list of potential replacement data sources for Microsoft Academic Graph [5] and a list of academic databases and search engines from Wikipedia [1]. We also undertook our own search to identify other data sources.

Overall, we have identified more than 200 resources which are listed in Table 1 (“the long-list”). The table contains name of the data source, URL to the main page of the project and a qualitative label we used while determining whether to include the dataset in the short-list for detailed analysis (see Section 3.2 for more details).

Table 1: Long-list of datasets with their respective reason for rejection
Name URL Reason for rejection
Loading... (need help?)

The majority of data sources on the long-list are publication metadata datasets, although some research data repositories such as Zenodo or Figshare and other repositories listed in [6] appear in the list as well. At this point, we decided to leave out data sources focused primarily on research data because of the bibliometric focus of the project.

Data from institutional repositories

As a rule, repositories of single institutions were not taken into account. One, because of a sheer volume of research institutions hosting their own repository. Two, because the de-duplication effort required to synchronize them would many times exceed our resources. Three, because most of them are only accessible via an OAI-PMH feed, which is not convenient for downloading big datasets [7]. Four, because they are already being harvested directly or indirectly by bigger datasets such as OpenAlex or FatCat.

In particular, internal repositories of Swiss HEIs have not been included on the long-list. Although those repositories might be helpful to identify the full national output at the later stage of the project, from our discussions with the repository owners we concluded it cannot be taken for granted that they fully cover the research output of the institution that hosts them. Evidence from other countries confirms our conclusion – for example, a survey in Spain has found out that repositories of top Spanish universities cover as little as less than 15% of overall scientific output produced [8]. Swiss HEIs do not yet have a shared standard for which metadata are collected in a repository, neither a shared process to ensure the quality of the metadata. Data might be partly ingested from other data sources, partly entered manually [9], and overall, synchronizing them would require a vast effort. Nevertheless, we recognize the work of the Swiss OAM Repository Monitor [10], which starts to introduce such standarization rules [11]. We will use some of the Repository Monitor data [12] in the later stage of the project.

Reasons for the rejection

We identified more than 200 datasets, most of which do not fulfill the criteria to be included in the short-list. The reasons for rejection may be multiple and vary. They appear in the Table 1 and are summarized in Table 2. In this section, we explain why those criteria for rejection were adopted.

  • Limited coverage and/or primary source of datasets with wider coverage. This concerns data sources belonging to the six following categories:

    • Field specific
    • Country specific
    • Language specific
    • Preprint servers
    • Data from publishers
    • Obsolete

    They were excluded since we can expect that they do not cover the whole research output relevant for Switzerland. We also noted that, if their content is freely and openly available, they (or their sources) are already being used as primary sources of secondary data sources aiming to cover a larger output.

  • Too wide coverage. This concern data sources belonging to the following categories:

    • Not focussed on research articles

    Data sources which contain other types of data such as datasets, guidelines besides the research articles. We have rejected them because we expect too much effort in identifying relevant entities compared to the added value.

  • Not easily exportable data. This concerns data sources belonging to the following category:

    • No free unlimited API, no free dump

    Data sources for which we could not find a free and unlimited API neither a free dump will not be investigated since they are not open or since we expect too much effort extracting the data, e.g. the API is paid or with a very low rate limit.

  • No bibliographic metadata. This concerns data sources belonging to the following category:

    • Policies, guidelines, patents
    • Not a scholarly metadata source

    Data sources which do not contain bibliographic information at all were generally rejected.

Even though in the long-list we identify one reason for rejection for each data source, in reality there might be more than one reason for which we did not keep a data source. On the other hand, in the short-list we have included some data sources that might satisfy one of the rejecting criteria, because we expect that they might be useful for at least some of the next steps of the project. Among those were: data sources which were country specific with a focus on Switzerland; language specific with a focus on Swiss national languages or English; policies, guidelines and patents data sources issued from Swiss institutions or organizations; which might have a strong relevance when analysing Swiss research output e.g. funding data.

Table 2: Summary of reasons for rejection of the datasets in the long-list.
Reason for rejection Number of datasets
Loading... (need help?)