Towards Open Bibliometric Indicators: Report #1

Survey of scholarly open data

Authors

Simon Willemin

Dr. Teresa Kubacka

Knowledge Management Group, ETH Library, ETH Zurich

Last updated on

August 5, 2023

Preliminary version

This is a preliminary version of the report, released for the inital feedback from the community. We are looking forward to hearing from you. See Section 7 for the version history of this document.

This report is part of the project “Towards Open Bibliometric Indicators (TOBI)”, whose goal is to assess the data quality of open bibliometric datasets in a subset related to Swiss High Educational Institutions (HEIs). The project is co-financed by swissuniversities and ETH Library.

Abstract

We have analyzed over 200+ datasets containing open scholarly metadata. Based on the initial assessment, we have selected a shortlist of data sources for in-depth analysis. We have collected properties such as size, licenses, source datasets, publisher details and collected them in a single table. We compare and discuss our findings. The results give us a basis to decide which datasets will we use for a in-depth data quality analysis.

Introduction

Motivation for surveying open scholarly metadata data sources

Researchers, analysts and developers who want to use scholarly metadata are faced with the vast choice of databases, tools and search engines - Wikipedia lists over 100 academic databases and search engines [1]. The variety of data sources of scholarly metadata has motivated various efforts for systematic characterization of the similarities and differences of the data sources, for example a peer-reviewed comparison of the properties of 40+ datasets accessible via an API [2], or a website SearchSmart [3] helping to choose between 90+ databases depending on more than 500+ criteria.

Nevertheless, as a starting point of project TOBI, we had to undertake our own evaluation of available datasets. For analytical tasks exceeding easy aggregations and calculations provided in the GUI, programmatic access to data is necessary – for example via an unlimited API or through a static copy of a dataset. This functionality is not available for many databases and search engines, e.g. for Google Scholar, which are included in various comparisons. We also have our own specific criteria for choosing the dataset for in-depth evaluation – such as availability of the attribution of a research output to a Swiss HEI, or a particular license for the data dump – which are not covered in tools like SearchSmart, which were developed for a different purpose. Lastly, to compare the metadata quality, we will need to deal with the records where the metadata may contain mistakes, so it is crucial to understand the relationships between the primary and derivative data sources. We haven’t found another survey undertaking this task.

Motivated by that, we have created a summary of the most important datasets in the context of TOBI. In this report we summarize our findings.

Similar work on data quality analysis of scholarly metadata on a micro-level

Similar efforts to track the metadata quality at lower level, e.g. country and institutional level. Canada: match publishers and RORs of nstitutions [Simon van Bellen: 04.Bellen.pdf (cais2023.ca)]. Italy: [4]

Topic of metrics using open bibliometric data is currently a topic that raises attention in Switzerland and in other countries

CWTS will base its ranking on open data: “We are currently working on an ambitious project in which we explore the use of open data sources to create a fully transparent and reproducible version of the Leiden Ranking. We expect to share the outcomes of this project later this year.” (https://www.leidenmadtrics.nl/articles/the-cwts-leiden-ranking-2023)

Articles: - 07.2023: [2] - Other recent articles?

Courses: - CWTS Course Program, from 31th October to 3rd November: “Scientometrics Using Open Data”

Projects: - Project PFI (Deutschland 2021-2025, not specifically open data sources)

Events: - ISSI Conference “Transitioning towards Open Scientometrics with Open Science Graphs” (https://www.conftool.pro/issi2023/index.php?page=browseSessions&form_session=3) - Schweizer Bibliothekskongress 2023 on “Offenheit und Verantwortung” - WOOC 2023: “The value of open scholarly metadata for research assessment purposes” - SYoS (2023-2024) - STI2023: OpenAlex is used as main source for some analyses - Other events? - Dora@10 - SNSF conferences

Long-list of data sources

Selection process

To identify potential priority data sources, we produced a list of data sources with a focus on sources satisfying most of the following criteria:

  • open access to metadata with possibility to download full dumps of the datasets and to use them in a local infrastructure,
  • relevant for Swiss HEIs research output analysis,
  • containing data related to research (research publications, funding details, organizations etc.),
  • not specific to a particular research field and aiming to cover all (or most) of the scientific output,
  • still operating and receiving updates,
  • containing more than one million searchable records.

As starting points, we used existing comparisons, such as the list of potential replacement data sources for Microsoft Academic Graph [5] and a list of academic databases and search engines from Wikipedia [1]. We also undertook our own search to identify other data sources.

Overall, we have identified more than 200 resources which are listed in Table 1 (“the long-list”). The table contains name of the data source, URL to the main page of the project and a qualitative label we used while determining whether to include the dataset in the short-list for detailed analysis (see Section 3.2 for more details).

Table 1: Long-list of datasets with their respective reason for rejection
Name URL Reason for rejection
Loading... (need help?)

The majority of data sources on the long-list are publication metadata datasets, although some research data repositories such as Zenodo or Figshare and other repositories listed in [6] appear in the list as well. At this point, we decided to leave out data sources focused primarily on research data because of the bibliometric focus of the project.

Data from institutional repositories

As a rule, repositories of single institutions were not taken into account. One, because of a sheer volume of research institutions hosting their own repository. Two, because the de-duplication effort required to synchronize them would many times exceed our resources. Three, because most of them are only accessible via an OAI-PMH feed, which is not convenient for downloading big datasets [7]. Four, because they are already being harvested directly or indirectly by bigger datasets such as OpenAlex or FatCat.

In particular, internal repositories of Swiss HEIs have not been included on the long-list. Although those repositories might be helpful to identify the full national output at the later stage of the project, from our discussions with the repository owners we concluded it cannot be taken for granted that they fully cover the research output of the institution that hosts them. Evidence from other countries confirms our conclusion – for example, a survey in Spain has found out that repositories of top Spanish universities cover as little as less than 15% of overall scientific output produced [8]. Swiss HEIs do not yet have a shared standard for which metadata are collected in a repository, neither a shared process to ensure the quality of the metadata. Data might be partly ingested from other data sources, partly entered manually [9], and overall, synchronizing them would require a vast effort. Nevertheless, we recognize the work of the Swiss OAM Repository Monitor [10], which starts to introduce such standarization rules [11]. We will use some of the Repository Monitor data [12] in the later stage of the project.

Reasons for the rejection

We identified more than 200 datasets, most of which do not fulfill the criteria to be included in the short-list. The reasons for rejection may be multiple and vary. They appear in the Table 1 and are summarized in Table 2. In this section, we explain why those criteria for rejection were adopted.

  • Limited coverage and/or primary source of datasets with wider coverage. This concerns data sources belonging to the six following categories:

    • Field specific
    • Country specific
    • Language specific
    • Preprint servers
    • Data from publishers
    • Obsolete

    They were excluded since we can expect that they do not cover the whole research output relevant for Switzerland. We also noted that, if their content is freely and openly available, they (or their sources) are already being used as primary sources of secondary data sources aiming to cover a larger output.

  • Too wide coverage. This concern data sources belonging to the following categories:

    • Not focussed on research articles

    Data sources which contain other types of data such as datasets, guidelines besides the research articles. We have rejected them because we expect too much effort in identifying relevant entities compared to the added value.

  • Not easily exportable data. This concerns data sources belonging to the following category:

    • No free unlimited API, no free dump

    Data sources for which we could not find a free and unlimited API neither a free dump will not be investigated since they are not open or since we expect too much effort extracting the data, e.g. the API is paid or with a very low rate limit.

  • No bibliographic metadata. This concerns data sources belonging to the following category:

    • Policies, guidelines, patents
    • Not a scholarly metadata source

    Data sources which do not contain bibliographic information at all were generally rejected.

Even though in the long-list we identify one reason for rejection for each data source, in reality there might be more than one reason for which we did not keep a data source. On the other hand, in the short-list we have included some data sources that might satisfy one of the rejecting criteria, because we expect that they might be useful for at least some of the next steps of the project. Among those were: data sources which were country specific with a focus on Switzerland; language specific with a focus on Swiss national languages or English; policies, guidelines and patents data sources issued from Swiss institutions or organizations; which might have a strong relevance when analysing Swiss research output e.g. funding data.

Table 2: Summary of reasons for rejection of the datasets in the long-list.
Reason for rejection Number of datasets
Loading... (need help?)

Figure 1: Data sources included in the long-list, grouped by their reason for rejection.

Short-list of data sources

The goal of the short-list of data sources is to enable us to choose priority data sources for the in-depth data quality analysis. For each data source on the short-list, its properties have been investigated and summarized: size (Section 4.1), license (Section 4.2), whether it is a primary or a derivative data source (Section 4.3), the category to which it belongs (Section 4.3.3), whether it is still active (Section 4.3.4), and whether it is freely available (Section 4.3.5). The results are to be found in Table 3.

Table 3: Short-list of datasets with additional metadata
Name License (name) Primary or derivate TOBI Dataset group Freely Available Active
Loading... (need help?)

Size

The first criterium for the best candidates for priority data sources is the number of entries related to research publications. A direct comparison is challenging, because each data source has its own definition of what is included. Additionally, by definition the data sources do not aim to cover the same content. For example, Crossref is restricted to publications with DOIs registered with them; Unpaywall to publications in open access; Base and OpenAlex are not restricted to research articles.

Among the analyzed data sources, the best candidates have more than 10 million entries, although their coverage varies a lot (Figure 2).

Figure 2: Comparison of sizes of data sources covering more than 10 million entries.

Those biggest data sources are Base, CORE, The Lens, OpenAlex, Semantic Scholar, OpenAire Graph, Crossref, BIP! Finder, Fatcat, The General Index, OpenCitations Meta, Science Open, OpenCitations COCI, Refcat, DataCite, Unpaywall and OpenCitations POCI. Although it does not appear in our shortlist, we note that Microsoft Academic Graph, which stopped its activity at the end of 2021, contains about 238 million publications.

Licenses

We note that the licenses of the analyzed open datasets vary, and different copyright or additional guidelines may apply, especially when copyrighted abstracts have been integrated into the datasets. The type of license most often used in the datasets from the short-list is a Creative Commons License. Some datasets are under CC0 (OpenAlex, OpenCitations) or CC BY 4.0 (OpenAire Graph). Other data sources are released, at least for some of the content, under the Open Data Commons Attribution License ODC-BY (CORE; this is also the case of Microsoft Academic Graph), while still others are under own specific license or under an unidentified license (Base, Semantic Scholar, Unpaywall).

We compare the most important licenses in Table 4. Their respective conditions are important when integrating and modifying the datasets, e.g. in the context of creating a unified database of metadata of Swiss research outputs.

Table 4: Comparison of licenses and their respective conditions.
TOBI? License Alternative terms Link Author Recommended types Publication year Link to License text Copy and publish Author can remain uncredited Commercial use Modify Change license
Loading... (need help?)

Interestingly, Crossref is not licensing its dataset, for the following reason:

Since 2000 Crossref has stated that it considers basic bibliographic metadata to be “facts.” And under US law (Crossref is registered in the US) these facts are not subject to copyright at all. Note also that, given that this data is not subject to copyright at all, there is no way Crossref can “waive the copyright” under CC0. In short, this metadata has no restrictions on reuse. [13]

More recently, some of our members have been submitting abstracts to Crossref. These are copyrighted. In the case of subscription publishers, the copyright usually belongs to the publisher. In the case of open access publishers, the copyright most often belongs to the authors. In both cases, Crossref cannot waive copyright under CC0 because the copyright is not ours to waive. However, we are allowed to redistribute the abstracts with our metadata because that is part of the terms and conditions we have with our members. [13]

Another aspect to the copyright is that although the metadata might not be copyrightable, gathering metadata in a database is copyrightable:

The metadata for any one work likely would not be copyrightable, but a database of metadata could be [14]

Among the datasets from the short-list, we also note that some of them (Retracted Watch DB, ETER) cannot be published or redistributed.

Regarding databases, there might be significant differences in the United States and in the European Union:

While the United States does not have any sui generis (or unique) protection for unoriginal databases, other countries do provide such protections. The European Union, for example, has a Database Directive, which provides for 15 years of protection for databases even if they do not reflect protectable expression. [14]

[Write something about the Berne Convention (?)]

[Difference between the copyright for the abstract/metadata, and copyright for the database as an assembly of publicly available data. If Crossref doesn’t license the data, it doesn’t mean that the DB law doesn’t apply? To be checked.]

[Permissions for derivative databases?]

Primary and derivative datasets

Categorization

Many of the bigger datasets are built on top of other datasets, for example OpenAlex integrates Crossref, Arxiv and many others. In the context of TOBI it means that some errors that may appear in Crossref may be inherited by OpenAlex. To have a better overview of those relationships, we split the datasets into primary and derivative datasets.

As a primary dataset we understand a dataset in which the scholarly metadata are primarily created manually by collecting data from users, or e.g. by web scraping or another manual or targeted process. For example, Arxiv is categorized as a primary dataset, because the users upload the articles and their metadata themselves. A derivative dataset is a dataset which combines more than 1 primary dataset and builds up on them. An example is OpenAlex, which integrates data sources like Crossref and Arxiv, and integrates them into a normalized database, enriching them with other metadata.

Those definitions are not sharp, as e.g. some primary datasets in our classification may have used the help of external APIs to help with data collection. In the attribution we tried to look at the inferred intent of dataset authors. For example, OpenAPC contains data submitted by the libraries, which combine scholarly metadata with data on the monetary value of the APCs. We have classified it as a primary source although some of the scholarly metadata might have originated from other databases like Crossref or Web of Science.

Identification of primary sources of derivative data sources

We traced whether a primary source is used by derivative sources mostly by using online documentation and publications if available.

Except for Crossref and DataCite, all datasets on our shortlist with more than 10 million entries are classified as derivative datasets. The identification of their primary sources has not always been an easy task. Here are some examples of difficulties encountered when trying to determine which are the most important primary sources or whether a data source is used as a primary source:

  • The Lens gives a list of “Data partners” as well as a list of “Data sources and collaborators” which do not completely overlap, which might not all be primary sources for the data, and which might also not represent the whole primary sources used.

  • In OpenAlex, the sources are part of the dataset. One could expect that this list is complete, be in the presentation page we find a short list of “key sources” which includes Microsoft Academic Graph and Crossref, neither of which appears in the listed sources.

  • Base lists more than 10’000 sources. Although Crossref is used, it does not appear as a single source, but indirectly in parentheses indicating “via Crossref” next to the name of some publishers.

  • Semantic Scholar provides a list of “publisher partners” without explicitly indicating whether or how they contribute to the constitution of the data source. The preprint mentions over 50 primary sources, without naming them [15].

  • The General Index does not provide information on the sources used, but since it includes the content of paywalled papers, other sources than open bibliometric data sources have been used to generate the whole index.

Figure 3: Illustration of which derivative data sources integrate which primary data sources.

The relations between the datasets are summarized in Figure 3. We note the following:

  • There is no primary sources which is used by all the selected derivative data sources from the short-list.

  • Most of the derivative data sources use Crossref as a primary source. Other frequently used primary sources are PubMed, HAL, arXiv and DataCite.

  • It appears that Base uses much less resources than The Lens (although it contains more entries - Figure 2). This might be due to the fact that Base aggregates mostly data from original sources, whereas The Lens uses also already aggregated data.

  • The main OpenCitations datasets are derivatives of a unique source, Crossref for COCI, Pubmed for POCI. This is by definition of those projects.

  • Refcat is also build on a unique source, but not a pure primary source, Fatcat. Those data sources are special cases because they are citation indices.

  • Some of the derivative data sources also integrate other derivative data sources. Hence Unpaywall is a source for both OpenAire Graph and OpenAlex.

We also note that, at this point, we do not identify circularity, that is two derivative data sources that use each other as a primary data source. However this might be happening, for example in cases where we couldn’t identify the primary sources, for the datasets outside of the list, or may happen in the future.

Categories of data sources

We further classify the datasets on the short-list into the following categories:

  • datasets containing only classical metadata, that is bibliographic metadata to publications or single research outputs (type 1),
  • dataset with altmetric data (type 2),
  • datasets with funding relevant data (type 3),
  • dataset helping with disambiguation of institutions and persons (type 4),
  • other datasets, such as educational datasets or citation indices (type 5).

The categorization is not always clear, since OpenCitations for example treats citations as entries and gives non classical metadata (each citation is associated with dates). Although some datasets like ETER might be associated to more than one type, we choosed to associate each dataset to a unique type, which was chosen according to the most likely way we might use it in the next steps of the project.

Figure 4 summarizes the counts in categories. Since our selection process focussed on datasets from the first type, it is well represented in the short-list. The other groups have much less representatives, except for the last type.

Figure 4: Count of datasets per category for the datasets included on the short-list.
Type 1: classical metadata

Almost all the datasets with more than 10 million bibliographic entries belong to the first category. In the same category appear smaller datasets such as SciELO.

Type 2: altmetric data

The short-list contains datasets such as: Wikidata, Retraction Watch, Paperbuzz, PubPeer, BIP! Finder.

Open datasets containing altmetric data are rare. Although the altmetric manifesto [16] dates from more than ten years, we have discovered that most tools that were proposed do not work anymore [17] and bibliographic data sources still focus on classical bibliographic metadata, although there are exceptions such as OpenCitations. It seems that the most important tools for altmetrics are not open anymore (Publons) or never have been [18]. Although OurResearch (formerly Impactstory) continues to develop open tools, the project of creating an altmetric data commons, an opensource altmetrics webapp and an open altmetric data platform described in an article of 2014 [19] might not have had the highest priority. Nevertheless, OurResearch has developed tools such as Paperbuzz (active since 2018), which should enable us to compute altmetric indicators using open data sources.

We can also speculate that, in the future, more altmetric data will be integrated in the biggest datasets, similarly to how Dimensions indexes Altmetric score for each publication when possible. OpenAlex has been successively integrating non-standard metadata alongside classical scholarly metadata, such as the APC prices.

Type 3: funding-relevant data

The short-list contains three datasets in this category: SNSF Data Portal, Cordis EU, Open APC.

Type 4: datasets helping with disambiguation of institutions and persons

The short-list contains four datasets: GRID, ORCID, ROR and ETER. Since GRID is no longer mainatined, we expect to use ORCID (for persons) and ROR (for institutions). We also note that recently, ORCID has announced that it switches to ROR for institutional disambiguation, which will help in building systems integrating those two data sources.

Type 5: other datasets

In this category belong various datasets which do not belong to the previous categories. For example: some of them provide indicators that can be used as monitoring tools (COKI), other provide data enabling to construct knowledge graphs (Data Set Knowledge Graph, ORKG), contain metadata at the level of journals instead of articles (DOAJ, Index Copernicus etc.) or are mostly data repositories (Zenodo, figshare).

Active and expired data sources

The most prominent discontinued open dataset is Microsoft Academic Knowledge Graph. Its main successor is OpenAlex, which received a grant during the period in which Microsoft announced MAG would be shutted down. MAG is not the only resource that is not active anymore. Other examples are Publons, which ceased its activity after having being bought, and two OpenCitations indexes, for which better alternatives exist. AMiner dataset seems to be stale and not accessible, therefore has been left out.

Availability and pricing

One of the most important criteria for the selection of priority datasets is that a full dump of the data is freely available (or at most, that the download costs need to be paid). This is the case for most of the sources that we considered, although there might be some restrictions or limitations, such as the need to get a token or an account. In some cases, free access is only granted for non-profit personal use or non-profit institutional use (The Lens).

Summary

We have performed a survey of the most prominent sources for open scholarly metadata, as well as data sources which contain supplementary data such as information about the funding or affiliations. We have analyzed the suitability of the datasets for the project. The long-list of data sources is to be found in Table 1.

We have identified a short-list of the most promising datasets. We have collected the properties of those datasets in Table 3.

Additionally, we performed an analysis of the most important relationships between the data sources in an effort to track which dataset is based on which sources (Figure 3).

Based on the results on the short-list, we will pick the most suitable datasets to perform the in-depth data quality analysis.

Appendix

Helper data

Collection of more than 200 data sources (WP1_collection.xlsx)

Name URL Reason for rejection
Loading... (need help?)

Short list of 49 data sources (WP1_shortlist.xlsx)

Name License (name) Primary or derivate TOBI Dataset group Freely Available Active
Loading... (need help?)
  • Registry of Swiss HEIs in ROR and OpenAlex [WP1_swissHEIs.xlsx or just link to NOAM sheet] [We might also give the following notebook in the appendix, to enrich the Excel file with OpenAlex institution ID: https://gitlab.ethz.ch/kom/projects/tobi/-/blob/main/CH_names/CH_institutions_enrichment.ipynb]

Open Metadata Tracker

In addition to the metadata analysis, we have developed an app to track open scholarly metadata: TOBI Open Metadata Tracker. The app allows the user to compare citation, reference and author counts for a set of DOIs in prominent open bibliometric datasets. It enables a self-service quality check of the scholarly metadata on a micro-level of an individual researcher.

Version history

05.08.2023: first draft has been published for community feedback.

References

[1]
Wikipedia, “List of academic databases and search engines,” Wikipedia. Jul. 05, 2023. Accessed: Jul. 07, 2023. [Online]. Available: https://en.wikipedia.org/w/index.php?title=List_of_academic_databases_and_search_engines
[2]
A. Velez-Estevez, I. J. Perez, P. García-Sánchez, J. A. Moral-Munoz, and M. J. Cobo, “New trends in bibliometric APIs: A comparative analysis,” Information Processing & Management, vol. 60, no. 4, p. 103385, Jul. 2023, doi: 10.1016/j.ipm.2023.103385.
[3]
M. Gusenbauer, “A free online guide to researchers’ best search options,” Nature, vol. 615, no. 7953, pp. 586–586, Mar. 2023, doi: 10.1038/d41586-023-00845-0.
[4]
F. Bologna, A. Di Iorio, S. Peroni, and F. Poggi, “Open bibliographic data and the italian national scientific qualification: Measuring coverage of academic fields.” arXiv, May 13, 2022. Accessed: Aug. 02, 2023. [Online]. Available: http://arxiv.org/abs/2110.02111
[5]
M. Weishuhn, “Testing replacements for microsoft academic graph. The inciteful blog,” Oct. 11, 2021. https://blog.inciteful.xyz/posts/testing-mag-replacements/ (accessed Jun. 15, 2023).
[6]
GFZ German Research Centre For Geosciences et al., “Registry of research data repositories,” 2013, doi: 10.17616/R3D.
[7]
P. Knoth et al., CORE: A global aggregation service for open access papers,” Sci Data, vol. 10, no. 1, p. 366, Jun. 2023, doi: 10.1038/s41597-023-02208-w.
[8]
Á. Borrego, “Institutional repositories versus ResearchGate: The depositing habits of spanish researchers: Institutional repositories versus ResearchGate,” Learned Publishing, vol. 30, no. 3, pp. 185–192, Jul. 2017, doi: 10.1002/leap.1099.
[9]
E.-P. /. R. C. ETH-Bibliothek, “How to publish - manual research collection - documentation.” https://documentation.library.ethz.ch/display/RC/How+to+publish (accessed Aug. 05, 2023).
[10]
S. O. A. Monitor, “Repository monitor – swiss open access monitor.” https://oamonitor.ch/charts-data/repository-monitor/ (accessed Aug. 05, 2023).
[11]
S. O. A. Monitor, “Data sources and methodology – swiss open access monitor.” https://oamonitor.ch/wiki/methodology/ (accessed Aug. 05, 2023).
[12]
S. O. A. Monitor, “Participating institutions – swiss open access monitor.” https://oamonitor.ch/wiki/participating-institutions-repo-monitor/ (accessed Aug. 05, 2023).
[13]
Rosa-Clark, REST API metadata license information. Crossref.” https://www.crossref.org/documentation/retrieve-metadata/rest-api/rest-api-metadata-license-information/ (accessed Jul. 07, 2023).
[14]
K. L. Cox, “Metadata and copyright: Should institutions license their data about scholarship?” 2017.
[15]
R. Kinney et al., “The semantic scholar open data platform.” arXiv, Jan. 24, 2023. Accessed: Aug. 05, 2023. [Online]. Available: http://arxiv.org/abs/2301.10140
[16]
“Altmetrics: A manifesto – altmetrics.org.” http://altmetrics.org/manifesto/ (accessed Jul. 07, 2023).
[17]
“Tools – altmetrics.org.” https://altmetrics.org/tools/ (accessed Jul. 07, 2023).
[18]
“Discover the attention surrounding your research. Altmetric.” https://www.altmetric.com/ (accessed Jul. 07, 2023).
[19]
S. Konkiel, H. Piwowar, and J. Priem, “The imperative for open altmetrics,” Journal of Electronic Publishing, vol. 17, no. 3, Sep. 2014, doi: 10.3998/3336451.0017.301.