Knowledge Graph Infrastructure

The aim of the Knowledge Graph (KG) infrastructure is to build an infrastructure for GESIS-wide linking of social science research data and resources and their interoperability and findability on the Web. This is based on the development of a social science knowledge graph which links the GESIS data collections among themselves and these with established vocabularies, social science data sources and established knowledge bases on the web such as web sites like Wikidata.

The KG will also be enriched by extracted entities such as variables and links, for example, between publications and research data. Survey data as well as digital behavioral data are taken into account. The rich information in the Social Science Knowledge Graph will be integrated into GESIS services such as the GESIS-wide search to support users, e.g. during their search for research data.

Based on this, additional knowledge graphs will be provided and linked in the infrastructure that hold data, entities and their relationships relevant to social science research topics, such as ClaimsKG, a graph of annotated claims extracted from fact checking websites.

For the development of the knowledge graph infrastructure in general and the social science knowledge graph in particular, methods of information extraction, entity interlinking, coreference resolution and data fusion are being investigated and applied.

  • ClaimsKG: ClaimsKG is a knowledge graph that contains claims and their evaluation from fact checking websites and links relevant entities with concepts of DBpedia. The KG currently holds 28,383 claims from 6 English-language websites.
  • EXCITE: In the EXCITE - Extraction of Citations from PDF Documents project, procedures were developed and developed to extract and structure literature citations from scientific publications. The extracted references (over 1 million) were delivered to the Open Citations Corpus (OCC). Of these, over 300,000 links to publications in GESIS data collections were identified, which will be integrated into the Social Science Knowledge Graph.
  • GESIS Research Graph: In the GESIS Research Graph project, a graph has been developed prototypically that links publications, research data, projects and people. The GESIS Research Graph is based on the Knowledge Graph infrastructure and contains over 110,000 publications, over 6,200 research records, and over 53,000 research projects.
  • GESIS-wide search: The Knowledge Graph infrastructure is integrated into the backend of the GESIS-wide search and thus provides users with structured information on linked research data, publications, etc.
  • InFoLiS: In the project InFoLiS - Integration of Research Data and Literature a method has been investigated and developed which allows for detecting citations of research datasets in scientific publications. The resulting links between publications and research data are integrated into the Social Science Knowledge Graph.
  • MOVING: In the project MOVING, methods were investigated and developed to disambiguate authors. The methods are used to disambiguate person names from various data sources in the Knowledge Graph infrastructure, as well as to identify and resolve duplicates in the records.
  • OpenMinTeD: In the OpenMinTeD project, methods have been investigated and developed to identify the mentions of variables in scientific publications. The generated 415 links between publications and variables will be integrated into the Social Science Knowledge Graph.
  • Question Feature Sample: A sample knowledge graph of GESIS survey questions annotated with question features, concretely the information type.
  • SoMeSci: SoMeSci is the most comprehensive gold standard corpus, exposed as open knowledge graph, about software mentions in scientific articles, providing training samples for Named Entity Recognition, Relation Extraction, Entity Disambiguation, and Entity Linking. The data consists of 4397422 triples, describing metadata and context of 3756 mentions in 1367 articles.
  • SoftwareKG: SoftwareKG is a knowledge graph that contains information about software mention statements from more than 51,000 scientific articles from the social sciences. It enables analysis on the provenance of the research results, the attribution of the developers, and software citation analysis in general. Additionally, providing information about whether and how the software and the source code are available allows an assessment about the state and the role of open source software in science at a general base.
  • SoRa: In the project SoRa - Social Spatial Research Data Infrastructure a knowledge graph is under development that describes social science survey data at study, variable and question level. So far, the graph represents two complementary datasets of different institutes and will be extended by links to spatial data.
  • TheSoz: The Thesaurus for the Social Sciences (TheSoz) is a controlled vocabulary which contains about 8,000 concepts (recommended keywords) from the Social Sciences. Topics from all social science disciplines are included.
  • TweetsCOV19: TweetsCOV19 is a semantically annotated corpus of Tweets about the COVID-19 pandemic. It is a subset of TweetsKB and aims at capturing online discourse about various aspects of the pandemic and its societal impact. This dataset consists of 20,112,480 tweets in total, posted by 7,384,417 users and reflects the societal discourse about COVID-19 on Twitter in the period of October 2019 until December 2020.
  • TweetsKB: TweetsKB is a knowledge graph hosted at GESIS that includes metadata about 1.5 billion tweets (Feb. 2013 - Mar. 2018) and serves as a resource for social science research. Using information extraction methods, sentiments, entities, hashtags, and user mentions were extracted and published as linked data through a structured RDF schema.
  • Schindler, David, Benjamin Zapilko, and Frank Krüger. 2020. "Investigating software usage in the social sciences: A knowledge graph approach." In The semantic web: 17th international conference, ESWC 2020, Heraklion, Crete, Greece, May 31–June 4, 2020, proceedings, edited by Andreas Harth, Sabrina Kirrane, and Axel-Cyrille Ngonga Ngomo, Lecture notes in computer science 12123, 271-286. Springer Cham. https://doi.org/10.1007/978-3-030-49461-2_16.
  • Tchechmedjiev, Andon, Pavlos Fafalios, Katarina Boland, Malo Gasquet, Matthäus Zloch, Benjamin Zapilko, Stefan Dietze, and Konstantin Todorov. 2019. "ClaimsKG: A Knowledge Graph of Fact-Checked Claims." In The Semantic Web – ISWC 2019. ISWC 2019, edited by Chiara Ghidini, Olaf Hartig, and Maria Maleshkova, Lecture Notes in Computer Science 11779, 309-324. Cham: Springer. doi: https://doi.org/10.1007/978-3-030-30796-7_20.
  • Heling, Lars, Felix Bensmann, Benjamin Zapilko, Maribel Acosta, and York Sure-Vetter. 2019. "Building Knowledge Graphs from Survey Data: A Use Case in the Social Sciences (Extended Version)." In The Semantic Web: ESWC 2019 Satellite Events. ESWC 2019 Satellite Events, Portorož, Slovenia, June 2–6, 2019, Revised Selected Papers, edited by Pascal Hitzler, Sabrina Kirrane, and Olaf Hartig, Lecture Notes in Computer Science 11762, 285-299 . Cham: Springer. https://doi.org/10.1007/978-3-030-32327-1_48.
  • Hienert, Daniel, Dagmar Kern, Katarina Boland, Benjamin Zapilko, and Peter Mutschke. 2019. "A digital library for research data and related information in the social sciences." In Proceedings of 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 148-157. Piscataway, NJ: IEEE. doi: https://doi.org/10.1109/JCDL.2019.00030.
  • Zapilko, Benjamin, Katarina Boland, and Dagmar Kern. 2018. "A LOD backend infrastructure for scientific search portals." In The Semantic Web. 15th Extended Semantic Web Conference (ESWC) - Proceedings, 729-744. Cham: Springer International Publishing. doi: https://doi.org/10.1007/978-3-319-93417-4_47.

Find out more about our consulting and services: