- Home
- Services
- Finding and accessing data
- GESIS Web Data
- Digital Behavioral Data: Datasets
Our Digital Behavioral Datasets
German Federal Elections
Topical Collection
Source: Twitter, Facebook
These datasets present results from the social media monitoring of Facebook and Twitter for the German federal election campaigns 2013, 2017, and 2021. The project collects the tweets and Facebook posts of political candidates and organizations and the engagement of users with these contents.
2013 Data | 2017 Data | 2013 Report (637 kB) | 2017 Report | 2021 Data | 2021 Report | Paper | MTE Talk
TweetsCOV19
Longitudinal Crawl
Source: Twitter
Semantically annotated corpus of tweets related to the COVID-19 pandemic capturing online discourse about various aspects of the pandemic and its societal impact from October 2019 onwards.
The dataset contains precomputed entity and sentiment annotations and extracted tweet metadata. The data are publicly available.
Description | Report | Data
'Call me sexist but' (CMSB)
Topical Collection, Training Data
Source: Twitter, Crowdsourced
The 'Call me sexist but' dataset (CMSB) is part of our work to analyze different dimensions of sexism in social media, including overt hostile sexism, 'benevolent' sexism, or more subtle forms that pose a particular challenge for automatic detection techniques. With this we aim at improving methods for, e.g., addressing sexism on online platforms.
Just Another Day on Twitter
Platform Data, Baseline Data
Source: Twitter
The dataset "Just another day on Twitter" presents a complete dump of one day on Twitter (September 20-21, 2022), generated by a globally co-ordinated effort from 80 scholars. Being "just another day" the 24 hours covered fall into a turbulent period with Elon Musk about to acquire the platform.
VACOS-NLQ
Topical Data, Natural Language Data
Source: Crowdsourced
The natural language query dataset (VACOS-NLQ) is a collection of 3540 written queries for e-commerce product search (laptops and jackets). The queries are enriched with information about age, gender, and domain knowledge of the participants.
Data | Source code on GitLab | Paper
Incels Forum Data
Topical Data, Platform Data
The dataset consists of all publicly visible posts and the data that comes with each post of the online forum incels.is during one week in Nov 2022. The data allows to investigate questions surrounding involuntary celibates, hate-speech, communication in online forums, the emergence of acts of terrorism, and suicide prevention.
SocioPatterns
Social Network Data
RFID Sensor Data, Sociodemographic Data
The SocioPatterns infrastructure enables the collection of data on face-to-face interactions in social contexts via wearable sensors. GESIS has used this to obtain contact data at academic conferences and in other social settings. Data access is restricted due to anonymity reasons. However, limited sets of conference contact data and sociodemographic meta data have been made available.
Contact Data | Meta Data | Data Paper | Paper | Book Chapter | Paper | SocioPatterns
Invasion@Ukraine
Invasion@Ukraine
Topical Collection
Source: Twitter
A dataset of raw tweets collected via the Twitter Streaming API in the context of the onset of the war. In total, we gathered 8.7 million original tweets between February 17 and March 3, 2022, produced by 2.3 million individual user accounts.
In addition, the data has been annotated with availability tags, resulting from rehydration. This may provide information on Twitter moderation policies.
TweetsKB
Longitudinal Crawl
Source: Twitter
TweetsKB is a public corpus of semantically annotated tweets based on a permanent Twitter crawl. The dataset currently contains data for more than 2.0 billion tweets, spanning from February 2013 to now.
Metadata about the tweets, extracted entities, sentiments, hashtags, and user mentions are shared as a public knowledge base.
Description | Data | Paper
ClaimsKG
Claims, Truth Ratings
Source: Fact-checking Websites
ClaimsKG is a structured database which serves as a registry of claims. It provides an entry point for researchers to discover claims and involved entities, also providing links to fact-checking sites. Basis of the database is a knowledge graph which provides data about claims, metadata (e.g., their publishing site), involved entities (annotated with NLP techniques) and some normalized truth ratings.
Data | Paper | Description
SciTweets
Topical Collection, Training Data
Source: Twitter, Crowdsourced
SciTweets is an expert-annotated dataset of 1261 tweets from the following three categories of science-relatedness: (1) Scientific knowledge (scientifically verifiable claims), (2) Reference to scientific knowledge, and (3) Related to scientific research in general. Further, the annotations include the annotators' confidence scores as well as labels for compound claims and ironic tweets.
Historical Narratives
Topical Collection
Source: Wikipedia
These data allow mining timelines and detecting temporal focal points of written history across languages on Wikipedia. Articles related to the history of all UN member states were extracted and compared in 30 language editions. Our computational approach allows to identify historical focal points quantitatively and carve out different cultures of historiography.
Politicians in Wikipedia
Topical Collection
Source: Wikipedia, DBPedia
The dataset contains information about international politicians from DBpedia, including name, gender, nationality, and for many also their political party affiliation. The dataset is based on the English DBpedia dump from October 2015.
The data was used to create an interactive visualization of politicians' networks.
TokTrack Wikipedia
Platform Data
Source: Wikipedia
This dataset contains every instance of all tokens (≈ words) ever written in undeleted, non-redirect English Wikipedia articles until October 2016, in total 13,545,349,787 instances. We also offer "WikiWho" – a service tool for tracking collaborative knowledge production on Wikipedia.
Data | WikiWhoTool | WikiWhoTutorial | Report
Edit Histories of Wikipedia References
Scientometrical Data
Source: Wikipedia
This dataset includes the historical versions of all individual references per article in the English Wikipedia until June 2019. Each reference object also contains information about its original creating editor, editors implementing changes to it, and timestamps of all actions. The dataset contains 55,503,998 references with 164,530,374 actions.
More on DBD?
Find research tools and guidelines for analyzing digital behavioral data; videos, tutorials & other training materials for CSS capacity building or upcoming courses from GESIS Training and visit our digital behavioral data overview page for news and upcoming services.