Skip to main content

The Data Knowledge Hub for Monitoring Online Discourse

Editorial – What, why, and how does it help you?#

Cathleen Berger
Upgrade Democracy | Bertelsmann Stiftung
Charlotte Freihse
Upgrade Democracy | Bertelsmann Stiftung

The Data Knowledge Hub for Monitoring Online Discourse (Data Knowledge Hub) is an initiative that aims to provide a central resource for researchers, social scientists, data scientists, journalists, and other practitioners, and policy makers interested in monitoring social media and online discourse more broadly.

Why do we feel this is necessary?#

Online discourse has changed how we inform ourselves, what and who to trust, as well as how information is quite simply accessed. Notably on online platforms and social media, recommender systems and other design features can be gamed to fuel disinformation, hate speech, and outrage. In addition, messaging services and alternative platforms are increasingly falling risk to exploitation and provide agitators with vast audiences to spread falsehoods. But how and why exactly this is happening remains under-researched and merely anecdotally illustrated. If we want to strengthen our information ecosystem and increase each other’s ability to decide what’s trustworthy and what’s not, we need to move away from anecdotes towards broad, continuous, and ideally real-time data-driven insight.

The challenge#

Due to the increasing number of social media and other digital platforms as well as the huge amounts of data to analyse, it is critical to enable and empower more researchers, social as well as data scientists, on two fronts:

  1. to conduct monitoring of social media and online discourse on a technical level, and
  2. to assess the data from a socio-political context.

There are already renowned, well-established organisations that do incredible work on Social Media Monitoring, including CeMAS, Democracy Reporting International, the SPARTA Project of the Bundeswehr University Munich, or the Institute for Strategic Dialogue. Yet even these established players face several challenges, among others:

  • the multitude of digital platforms;
  • the sheer amount of data and necessary server capacities;
  • fast-developing and constantly changing narratives;
  • new and changing actors and agitators.

Building a foundation for solving these challenges#

To reduce the obstacles and lower the threshold to monitoring online discourse, we are launching this Data Knowledge Hub. Hosted open source and under a Creative Commons license on GitHub, it continuously welcomes contributions of new data, code, and written content, fostering a collaborative environment for all. Cooperation and collaboration on development, design, content, and scope among established actors is key to turning this Data Knowledge Hub into a useful tool and an enabler for future research.

For first publication in September 2023, we gathered initial contributions on legal basis and ethical standards, good practices and exemplary research for webscraping, data collection on Twitter and TikTok as well as code samples to monitor various platforms. This Data Knowledge Hub will be continuously updated and reviewed, and, with the help of community and crowdsourced contributions, we hope to include a broad range of samples and organic input, over time providing all relevant information for monitoring and understanding the dynamics of online discourse.

You can help and contribute, too#

We welcome additional contributions on a rolling basis. Right now, we would be particularly interested in including and discussing chapters on:

  • Social media usage: users worldwide, number of posts/messages, regional differences etc.
  • Data access and ethics:
    • How to deal with dark socials?
    • Data access rights beyond the European Union and the U.S.
  • Data collection: sock puppet, snowball sampling and other innovative approaches
  • Examples of data collection: Facebook, Instagram, YouTube, Fediverse and others
  • Data analysis:
    • Topic modelling,
    • Infrastructure as code
  • Additional aspects that benefit from monitoring as a research method

Living Document - How to navigate the Knowledge Hub#

Johannes Müller
&effect data solutions GmbH

The Data Knowledge Hub is hosted on a GitHub repository. For better usability we use a documentation framework which allows users to switch to a static website for easier reading, accessing content as a digital book This means that all text content is created using Markdown. Code projects are included as a single file (e.g. a Jupyter Notebook) or in folders that can be pulled from GitHub. We intend to continuously update content and invite contributions on additional aspects of monitoring social media and online discourse. A first version was published in September 2023, chapters that are already in the pipeline are marked as “forthcoming”, a list of invited contributions can be found in the “editorial”.

All contributors are listed here as well as named in their respective chapters.

🔜 Code Projects (coming soon)#


The GitHub Page will be made public with the upcoming release.

Here is a table with all projects that are currently included in the Data Knowledge Hub. Click on the link to go to the project page.

tiktok-scrapingCollect data on TikTok using puppeteer JavaScriptTikTokCode
tiktok-hashtag-analysisAnalyse TikTok Hashtags PythonTikTokCode
blog-webscrapingWebscraping using rvest and selenium RBlogsCode
twitter-streamingLarge-scale data collection on Twitter PythonTwitter / XCode
twitter-databaseTBD PythonTwitter / XCode
twitter-social-networkSocial Network Analysis with R RTwitter / XCode

Design Principles#

The editorial team has adopted four guiding principles for content on the Data Knowledge Hub:

  • From general to specific: Cater to different target groups by starting each chapter with a general and easy-to-follow introduction. More specific topics such as content on use cases, projects, or code examples will be added throughout the project. We use three labels to indicate difficulty that will help users to orientate themselves: no code, beginners, advanced.
  • Rich links: Enable non-linear interaction with internal and external links, highlighting diverse initiatives, projects, or code libraries.
  • Reproducibility: For code examples, we focus on Python and R due to their widespread use in data science (however use cases in other languages are also welcome such as JavaScript, Julia or Rust). All code should be reproducible.
  • Open Source: Content and code will be accessible on GitHub under a CC BY License.

Structure of the Knowledge Hub#

The content is structured around three main areas:

  • Ethical, Social, and Legal Context: Overview of key issues and challenges, including regulatory bases, privacy, bias, and transparency.
  • Data Collection: Summary of data collection methods and tools, along with challenges, limitations, and potential.
  • Methods and Analysis: Introduction to research designs and methods like natural language processing, network analysis, and machine learning.

Questions and Improvements#

If you have any questions or ideas, please do not hesitate to contact us at .


Creative Commons License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.