diff --git a/docs/docs/03_data-collection/03_00_platform-specific guidelines/03_00_data-collection_youtube.mdx b/docs/docs/03_data-collection/03_00_platform-specific guidelines/03_00_data-collection_youtube.mdx new file mode 100644 index 0000000..dbf5a8c --- /dev/null +++ b/docs/docs/03_data-collection/03_00_platform-specific guidelines/03_00_data-collection_youtube.mdx @@ -0,0 +1,558 @@ +--- +title: "Data Collection on YouTube" +sidebar_position: 2 +--- + +# Data Collection on YouTube + + + + + + + +## Disinformation on video platforms + +With the visual turn on social media and the growing importance of audio-visual platforms as information spaces, researchers have long acknowledged YouTube’s central role as a conduit of disinformation, conspiracy and extremist discourse (Allgaier, 2019; Knüpfer et al., 2023). + +In the rapidly developing nexus of disinformation and Artificial Intelligence (AI), YouTube already hosts a variety of manipulative synthetic content, exemplified by recent discoveries of Spanish-speaking, anti-European content disseminated massively by disinformation networks (Maldida.es, 2025). + +This underlines the need for researchers to closely monitor activities on the platform and collect large-scale data for analyses. Fortunately, as YouTube has been the dominant video platform for over a decade now and has long provided access to many features through different APIs, there are lots of brilliant resources out there to help collecting different types of data from YouTube (Richardson & Flannery O’Connor, 2023). + +### What you will learn in this chapter + +This tutorial focuses on three main data collection methods, equipping you to monitor: + +- the prevalence of topic specific videos through **search queries**; +- **Comment sections** under specific videos; +- **YouTube channels**, their statistics and video output. + + +### Authentication +Before getting started, make sure to have all necessary authentication requirements. Obtaining an **API key** or **OAuth 2.0 token*+ is the central requirement to making any valid request. However, in contrast to other platforms, there are no huge obstacles to gain access (like vetting processes for researchers). All you need is a Google account with permission to create projects on the Google Cloud Console. ***Step-by-step guides to get an API key** are provided in written form [here](https://medium.com/mcd-unison/youtube-data-api-v3-in-python-tutorial-with-examples-e829a25d2ebd) and in video form [here](https://www.youtube.com/watch?v=th5_9woFJmk). For all the data collection performed in this tutorial, the **OAuth 2.0 token is not required**. + +With a key in hand, it is best to follow along using a [Jupyter Notebook](https://jupyter.org/) either in your browser or any common programming environments (IDEs) like [Visual Studio Code](https://code.visualstudio.com/). + +The tutorial guides you through as follows: +- Basic setup +- Construction of a single query +- Data collection methods: + - Search + - Video information + - Comments + - Channels + +:::hub-note Note +If you want to expand your data collection beyond what is shown here, you can find the **extensive documentation** for this API provided by Google [here](https://developers.google.com/youtube/v3/docs). +There is a **quota** for the API, with standard projects being limited to 10,000 request units per day. Google provides an overview of quota unit calculation [here](https://developers.google.com/youtube/v3/determine_quota_cost). +::: + +## Basic Setup + +In our python environment, we will use the following packages, all easily installable via pip (PyPI). First, install them using the command below: + +```python +pip install jsonlines tqdm pandas google-api-python-client +#The exclamation point is used to signal to your machine that this shell command (“pip install something”) should be run externally +``` +Then, import them into your environment: + +```python +import jsonlines +import json +import pandas as pd +from datetime import datetime +from tqdm import tqdm +import os + +import googleapiclient.discovery +from googleapiclient.discovery import build +import googleapiclient.errors +``` + +Here, you insert the necessary information for the authentication: + +```python + +# You can disable OAuthlib's HTTPS verification when running locally. +# Please *DO NOT* leave this option enabled in production. +os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1" + +API_key = "YOUR_API_KEY_HERE" ## Replace this string with your actual API key +API_service_name = "youtube" ## Specify what Google API you want to use +API_version = "v3" ## Specify the version +``` + +:::hub-note Note +Storing credentials like API keys or other sensitive information in plain sight is fine when running your scripts locally. However, a more secure approach is to set them up as environment variables outside your script or to use a configuration file. An easy-to-follow tutorial on the different ways to store sensitive information securely can be accessed [here](https://saturncloud.io/blog/how-to-set-environment-variables-in-jupyter-notebooks-a-guide-for-data-scientists/). +::: + +Now you can construct your **API client**. This object `youtube` is how you interact with and make calls to retrieve YouTube data. + +```python +youtube = build(API_service_name, API_version, developerKey=API_key) +``` +## Construction of a single query + +In essence, all YouTube data is **categorized into different resources** (channels, videos, playlists, thumbnails, etc.). Each resource affords different methods to retrieve data of interest but, luckily, the query structure is largely the same. + +Let us look at a **single query example**. We want to query the YouTube Data API for the top 50 most viewed videos related to election integrity in the weeks leading up to the US election. We use our client object to access the `search()` resource. The `list()` function allows us to retrieve a collection of results that match the query – this can be videos but also channels or playlists. Inside the `list()` function, we specify some necessary and optional parameters. + +```python +## Initial API request +request = youtube.search().list( + part="snippet", #neccessary parameter, where snippet contains more detailed information + maxResults=50, #default value is 5, max value is 50 + publishedAfter="2024-10-01T00:00:00Z", + publishedBefore="2024-11-06T00:00:00Z", #timeframe of interest + order="viewCount", #alternative would be by “date” in reverse chronological order + q="election fraud | stolen election | election lie", + #Use this “|” OR separator or the NOT (-) operator to further specify your keywords of interest + relevanceLanguage="en", #returns videos most relevant to the specified language + type="video" #we only want videos as results +) + + +response = request.execute() #Use this command to execute our API call +``` +If the request was successful, the response contains a dictionary with the following keys: + +```python +response.keys() +``` + +```bash +dict_keys(['kind','etag','nextPageToken','regionCode','pageInfo','items']) +``` + +The video results are stored inside items, you can see the exemplary information we retrieved for the first video here: + +```python +response["items"][0] +``` + +```bash + +{'kind': 'youtube#searchResult', + 'etag': '2PoKoFaYrW3QLMB7vec-RZEr_rM', + 'id': {'kind': 'youtube#video', 'videoId': 'IPUhRjAMCTo'}, #videoId is important for later! + 'snippet': {'publishedAt': '2024-10-08T03:00:17Z', + 'channelId': 'UCwWhs_6x42TyRM4Wstoq8HA', + 'title': 'Jon Stewart on Elon Musk, Free Speech & Trump's Election Interference Claims | The Daily Show', + 'description': 'With less than a month until Election Day, Jon Stewart unpacks how Trump and his newest "dark MAGA" henchman, Elon Musk, ...', + 'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/IPUhRjAMCTo/default.jpg', + 'width': 120, + 'height': 90}, + 'medium': {'url': 'https://i.ytimg.com/vi/IPUhRjAMCTo/mqdefault.jpg', + 'width': 320, + 'height': 180}, + 'high': {'url': 'https://i.ytimg.com/vi/IPUhRjAMCTo/hqdefault.jpg', + 'width': 480, + 'height': 360}}, + 'channelTitle': 'The Daily Show', + 'liveBroadcastContent': 'none', + 'publishTime': '2024-10-08T03:00:17Z'}} + +``` + +## Data collection methods: Search + +While we have already successfully run a YouTube search query, in any research effort, 50 results are hardly enough to obtain meaningful insights. You can retrieve more data through the `nextPageToken`. This is insofar important, as most APIs rely on pagination to control the amount of data access to their servers. + +If there are more results to your query, the response will include a `nextPageToken`, which you can include in your next query to get the 50 next results – a process we can iterate as long as we want to. Let us generalize our previous code to collect the N first pages of results for our query: + +```python +## > Loop to retrieve videos related to search query from multiple pages < + +N = 2 # We set N to 2 to define that we want the top 2 pages of results. + +# The “nextPageToken” variable will be used in the for loop to store the ID of the next page. +next_page_token = None +search_results = list() # An empty list to store the query results + +for i in tqdm(range(N)): # The "tqdm" wrapper around the "ids_list" variable allows us to see a progress bar + + # Retrieve a page of results + if next_page_token is None: +# i.e. if this is the request for the first page, we do not use it as a parameter + request = youtube.search().list( + part="snippet", + maxResults=50, + publishedAfter="2024-10-01T00:00:00Z", + publishedBefore="2024-11-06T00:00:00Z", + order="viewCount", + q="election fraud | stolen election | election lie", + relevanceLanguage="en", + type="video" + ) + page_response = request.execute() + search_results.append(page_response) + else: + # If it not None however, we use "nextPageToken" to specify the "pageToken" as a query parameter + request = youtube.search().list( + part="snippet", + maxResults=50, + publishedAfter="2024-10-01T00:00:00Z", + publishedBefore="2024-11-06T00:00:00Z", + order="viewCount", + q="election fraud | stolen election | election lie", + relevanceLanguage="en", + type="video", + pageToken=next_page_token # here goes our token + ) + page_response = request.execute() + search_results.append(page_response) + + # Try to retrieve the "nextPageToken" if there is one. + try: + next_page_token = page_response["nextPageToken"] + + # If the response does not have a "nextPageToken" field, we simply break out of the loop + except KeyError: + break + +``` + +:::hub-note Note +If you are only interested in the video results, you can also extract the respective data inside this loop by extending the `search_results` list with `page_response["items"]`. +::: + +## Data collection methods: Video information + +As seen above, the information for each video we can collect directly from the search query is limited. To obtain more detailed data like the views or comment count, we turn to the `videos()` resource and again use the `list()` method. For instance, the same video shown above offers the following statistics: + +```bash +{'viewCount': '5325494', 'likeCount': '143592', 'favoriteCount': '0', 'commentCount': '8695'} +``` + +Let’s write a function that takes a video_id as input und calls the API to retrieve more detailed information. Crucially, the `youtube.videos().list()` request can take multiple id’s as input, so we can speed up our data collection with batches. + +```python +# Our function to get video details +def get_video_details(video_ids): + if not video_ids: + return [] + + # define batch size (limit is 50 again) + batch_size = 50 + videos = [] + + # process video id's in batches + for i in range(0, len(video_ids), batch_size): + batch = video_ids[i:i + batch_size] + + details_request = youtube.videos().list( + part="snippet,statistics", + id=",".join(batch) + ) + details_response = details_request.execute() + + videos.extend([ + { + "title": video["snippet"]["title"], + "published_at": video["snippet"]["publishedAt"], + "channel_title": video["snippet"]["channelTitle"], + "view_count": video["statistics"].get("viewCount", 0), + "like_count": video["statistics"].get("likeCount", 0), + "dislike_count": video["statistics"].get("dislikeCount", 0), + "comment_count": video["statistics"].get("commentCount", 0) + } + for video in details_response.get("items", []) + ]) + + return videos +``` + +Now we extract the ID’s of our 100 most viewed videos related to election integrity and use the function above to retrieve more interesting information about the first 5 of them: + +```python +# Extract unique video ID’s from search results +video_ids = list(set(video["id"]["videoId"] for page in search_results for video in page.get("items", []))) + +# limit to first 5 videos +N = 5 +video_ids = video_ids[:N] + +# use the function +latest_videos = get_video_details(video_ids) + +# print (some) info +for video in latest_videos: + print(f"Title: {video['title']}, Published: {video['published_at']}, " + f"Channel: {video['channel_title']}, Views: {video['view_count']}, " + f"Likes: {video['like_count']}, Dislikes: {video['dislike_count']}, " + f"Comments: {video['comment_count']}") +``` + +```bash +Title: DEBUNKING The Latest Election Lies From MAGA Senator | Bulwark Takes, Published: 2024-10-07T02:19:12Z, Channel: The Bulwark, Views: 332857, Likes: 20952, Dislikes: 0, Comments: 2460 + +Title: Voter Registration Fraud Discovered in Pennsylvania, Published: 2024-10-25T18:55:56Z, Channel: The Michael Lofton Show, Views: 567625, Likes: 9243, Dislikes: 0, Comments: 3285 + +Title: Will Trump’s baseless stolen election claims spark another Capitol attack? | ABC News, Published: 2024-11-03T22:56:07Z, Channel: ABC News (Australia), Views: 3994, Likes: 47, Dislikes: 0, Comments: 0 + +Title: Can Kamala Harris defeat Trump’s election lies in battleground Georgia? | Anywhere but Washington, Published: 2024-10-03T11:54:15Z, Channel: The Guardian, Views: 99329, Likes: 1718, Dislikes: 0, Comments: 510 + +Title: "Trump's 2024 Election Strategy: Lies and Controversy!", Published: 2024-11-02T11:09:09Z, Channel: MJ News, Views: 6, Likes: 0, Dislikes: 0, Comments: 1 + +``` + +As of now, this data is stored in `latest_videos` as a list of dictionaries. To make it more manageable, we simply convert it to a pandas `DataFrame` object (a table, basically). This way, we can also easily export it to a CSV or Excel file. + +```python +data = pd.DataFrame(latest_videos) +print(data) +``` + +_Table 1: Results for video detail collection of the first five videos_ +|index | title | published_at | channel_title | view_count | like_count | dislike_count | comment_count | +|:---:|:--------------------------------------------------|:---------------------|:------------------------|:-----------|:-----------|:--------------|:-------------- | +|0 | DEBUNKING The Latest Election Lies From MAGA S... | 2024-10-07T02:19:12Z | The Bulwark | 332857 | 20952 | 0 | 2460 | +|1 | Voter Registration Fraud Discovered in Pennsyl... | 2024-10-25T18:55:56Z | The Michael Lofton Show | 567625 | 9243 | 0 | 3285 | +|2 | Will Trump’s baseless stolen election claims s... | 2024-11-03T22:56:07Z | ABC News (Australia) | 3994 | 47 | 0 | 0 | +|3 | Can Kamala Harris defeat Trump’s election lies... | 2024-10-03T11:54:15Z | The Guardian | 99329 | 1718 | 0 | 510 | +|4 | "Trump's 2024 Election Strategy: Lies and Cont... |2024-11-02T11:09:09Z | MJ News | 6 | 0 | 0 | 1 | + + +## Data collection methods: Comments + +Comment sections can be collected via the `comments()` resource and `list()` method. A single query for the example video looks like this: + +```python +request = youtube.commentThreads().list( + part="snippet,id,replies", + maxResults=100, #For this resource, the max amount of results is 100 + order="time", + videoId="IPUhRjAMCTo" +) +comment_response = request.execute() +``` + +With the output data for a single comment in the thread: + +```bash +{'kind': 'youtube#commentThread', + 'etag': 'FjHUXM2rssJNyLjk0GUPrIP1AeY', + 'id': 'Ugy8fFs_mIdgiaBW7wF4AaABAg', + 'snippet': {'channelId': 'UCwWhs_6x42TyRM4Wstoq8HA', + 'videoId': 'IPUhRjAMCTo', + 'topLevelComment': {'kind': 'youtube#comment', + 'etag': 'Kp3rpMeuZ1_egtsYT6K_GX9-rrU', + 'id': 'Ugy8fFs_mIdgiaBW7wF4AaABAg', + 'snippet': {'channelId': 'UCwWhs_6x42TyRM4Wstoq8HA', + 'videoId': 'IPUhRjAMCTo', + 'textDisplay': 'No do the same for Kamala and Biden. Much more material.', + 'textOriginal': 'No do the same for Kamala and Biden. Much more material.', + 'authorDisplayName': '@stevenberry3294', + 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AIdro_mljzddy7jo9d1eT87Vxkf-wgEsl_KEIealLasN5hw=s48-c-k-c0x00ffffff-no-rj', + 'authorChannelUrl': 'http://www.youtube.com/@stevenberry3294', + 'authorChannelId': {'value': 'UCiJyUwZOM8CL07N7RjjmWdA'}, + 'canRate': True, + 'viewerRating': 'none', + 'likeCount': 0, + 'publishedAt': '2025-03-04T02:12:13Z', + 'updatedAt': '2025-03-04T02:12:13Z'}}, + 'canReply': True, + 'totalReplyCount': 0, + 'isPublic': True}} +``` + +Having constructed the comment collection query for a single video, we can write a loop to retrieve all comments for the same 100 most viewed videos related to election integrity. + +This loop does two things: +- It iterates over each video ID +- It iterates through all comment results for each video id with pagination + +```python +comment_results = dict() # This time, we create an empty dictionary to store the comment query results + +# iterate over the video IDs +for id in tqdm(video_ids): + + # this initialises the comment results for this particular video ID to be an empty list + comment_results[id] = list() + + # Try to retrieve the first page of comments for the video + try: + request = youtube.commentThreads().list( + part="snippet,id,replies", + maxResults=100, + order="time", + videoId=id + ) + comment_response = request.execute() + comment_results[id].append(comment_response) + +# Some videos might have disable comments. +# If so, these lines of code will catch the error and simply move on to the next video. + except Exception as e: + print(id, e) + continue + + # Try to retrieve the "nextPageToken" if there is one. + try: + nextPageToken = comment_response["nextPageToken"] + + # If the response does not have a "nextPageToken" field, the loop moves on to the next video + except KeyError: + continue + + # Given a value was found, this retrieves the comments until a "nextPageToken" can’t be found + while True: + request = youtube.commentThreads().list( + part="snippet,id,replies", + maxResults=100, + order="time", + videoId=id, + pageToken=nextPageToken + ) + comment_response = request.execute() + comment_results[id].append(comment_response) + try: + nextPageToken = comment_response["nextPageToken"] + except KeyError: + break + +``` + +Now we retrieve the number of comment threads for each of the first three videos and the respective total number of comments we were able to collect. + +```python +stats_list = list() + +for i, id in enumerate(comment_results): + nb_threads = 0 + nb_comments = 0 + + for result in comment_results[id]: + nb_threads += len(result["items"]) + for item in result["items"]: + nb_comments += 1 + if "replies" in item: + nb_comments += len(item["replies"]["comments"]) + + + stats_list.append({"video_id": id, "nb_threads": nb_threads, "nb_comments": nb_comments}) + +stats_df = pd.DataFrame(stats_list) +``` + +_Table 2: Results for comment collection of the first three videos_ + +| video_id | nb_threads | nb_comments | +| :--- | :--- | :--- | +| Header | Title | Here's this | +| Paragraph | Text | And more | + +:::hub-note Note +We can use the comment data collected to analyze the networks that develop in these comment sections. This can be found in the chapter ["Social Network Analysis"](/docs/docs/04_data-analysis/04_03_social-network-analysis). For this, use the unique ID’s we gathered from the comment authors and their replies and convert them to a simple edge list. +::: + +## Data collection methods: Channels + +Lastly, if we have a set of channels we want to monitor in terms of their impact and the contents they disseminate, we can retrieve this data with `channels()` and `list()` method and subsequently utilize the methods we already learned to collect all the information we need. + +We first write a function to retrieve some channel’s general information and statistics. Then, we write a function that displays the unique ID of that channel. Lastly, we write a function to get the latest videos this channel has published. In this example, we retrieve information about the channel “@AntiSpiegel” that is associated with the media outlet “Anti-Spiegel” run by the prominent Russian propagandist Thomas Röper. + +![Gephi Screenshot 1](../../static/img/platforms/youtube/roeper_page_example.png) +_Screenshot of the AntiSpiegel YouTube channel_ + +```python +# define function to get a channel's information +def get_channel_info(user_handle:str): + + request = youtube.channels().list( + part="snippet,statistics", + forHandle= user_handle + ) + + response = request.execute() + + info = response['items'][0]['snippet'] + statistics = response['items'][0]['statistics'] + + return info,statistics + +# define function to get channel id +def get_channel_id(user_handle): + request = youtube.search().list( + part="snippet", + q=user_handle, + type="channel", + maxResults=1 + ) + response = request.execute() + + if response["items"]: + print(f"Channels found: {len(response["items"])}") + return response["items"][0]["id"]["channelId"] + else: + print("No channel found with that username") + return None + + +# define function to get latest video id's +def get_latest_videos(channel_id, max_results:int=5,after:str= '2025-01-01',before:str='2025-02-23'): + request = youtube.search().list( + part="id", + channelId=channel_id, + order="date", + publishedAfter=f"{after}T00:00:00Z", + publishedBefore=f"{before}T00:00:00Z", + maxResults=max_results, + type="video" + ) + response = request.execute() + + video_ids = [video["id"]["videoId"] for video in response.get("items", [])] + video_ids = video_ids[:max_results] + + return video_ids + +``` + +Applying these functions to the "@AntiSpiegel" YouTube channel looks like this: + +```python +channel_info,statistics = get_channel_info("@AntiSpiegel") +print("Channel info:",channel_info['title'],"\n\n","Channel statistics:",statistics) + +# retrieve channel id for any account +channel_id = get_channel_id("@AntiSpiegel") +print(f"\n Channel ID: {channel_id}") + +# retrieve id's of latest videos on XXXX account +video_ids = get_latest_videos(channel_id) +print(video_ids) + +``` + +```bash +Channel info: Anti Spiegel + +Channel statistics: {'viewCount': '14460636', 'subscriberCount': '144000', 'hiddenSubscriberCount': False, 'videoCount': '113'} + +Channel ID: UC93mqUPbNmHZhl4fAVvZWpQ +['IJyCAsBJJEo', 'm1jAhFRq3YI', 'T37-ST2kkiI', 'RyxFcVMDJts', 'O9P1eAZ9Sc0'] +``` + +To sum up, the combination of functions and methods provided in this tutorial equip you to closely monitor and retrieve a comprehensive set of datapoints from YouTube. You can now construct, edit and execute queries for any resource the API provides. Wrapping these queries with some Python code allows you to store and analyse data on channels, videos about topics of interest as well as discourses in the comment sections. Crucially, the steps in this tutorial prepare you to explore the vast landscape of content on YouTube and gain insights into the production and dissemination of disinformation across different geographical or societal contexts. + +## References + +- Allgaier, J. (2019). Science and environmental communication on YouTube: Strategically distorted communications in online videos on climate change and climate engineering. Frontiers in communication, 4, 36. + +- Knüpfer, C., Schwemmer, C., & Heft, A. (2023). Politicization and Right-Wing Normalization on YouTube: A Topic-Based Analysis of the “Alternative Influence Network”. International Journal Of Communication, 17, 23. Retrieved from https://ijoc.org/index.php/ijoc/article/view/20369. + +- Maldida.es. (2025). „European politician crushes Spanish politician in the European Parliament“: A network of disinformation channels on YouTube. Maldita.es. Retrieved April 09, 2025, from https://maldita.es/malditobulo/20250313/network-channels-youtube-disinformation-spanish-politics-eu/. + +- Richardson, L., & Flannery O’Connor, J. (2023, August 24). Complying with the Digital Services Act. The Keyword. Retrieved April 09, 2025, from https://blog.google/around-the-globe/google-europe/complying-with-the-digital-services-act/ + + diff --git a/docs/docs/03_data-collection/03_00_platform-specific guidelines/03_01_data-collection_rumble.mdx b/docs/docs/03_data-collection/03_00_platform-specific guidelines/03_01_data-collection_rumble.mdx new file mode 100644 index 0000000..31c438f --- /dev/null +++ b/docs/docs/03_data-collection/03_00_platform-specific guidelines/03_01_data-collection_rumble.mdx @@ -0,0 +1,355 @@ +--- +title: "Data Collection on Rumble" +sidebar_position: 3 +--- + +# Data Collection on Rumble + + + + + + + +## A video-hub for fringe discourse + +In recent years, Rumble has emerged as one of the **central audiovisual platforms** within alternative media ecosystems (Balci et al., 2024). Initially founded in 2013 as a video-sharing site in Canada with a focus on free speech, Rumble surged in popularity beyond its core audience (mainly from the U.S.) around 2020, capitalizing on growing distrust toward mainstream social media platforms like YouTube, Facebook, or X (Shaughnessy et al., 2024). + +Today, Rumble functions as a central discourse space for the **MAGA** (Make America Great Again) community, positioning itself as a champion of "uncensored" content, creating a fertile ground for **extremist, conspiracist**, and other **fringe communities*+. Users who were deplatformed from larger sites often find a new home on Rumble, enabling the platform to become an essential node in the broader disinformation ecosystem (Balci et al., 2024a; Mell-Taylor, Alex, 2021). + +Prominent figures associated with conspiracy theories — ranging from COVID-19 denialism to election fraud narratives — have amassed large followings on Rumble. Content that would be heavily moderated or banned on larger platforms is often allowed to thrive here, posing the challenge for researchers to gain insights into the networks and narratives permeating the online space (Balci et al., 2024b; Thompson, 2024). + +## The challenges of website architectures + +Rumble is a good example for the **heterogenous landscape of website architectures*+. Much of the relevant information is loaded dynamically via JavaScript, depending on specific trigger actions on the website, like the user clicking a button or scrolling something into view. This is where traditional solutions like `“Beautiful Soup”` in the Python world or `“rvest”` in the R world fall short, as they can’t fetch dynamically generated content. + +[Selenium](https://www.selenium.dev/) is an open-source software framework widely used for automating web browsers. Originally designed for testing web applications, Selenium has become an essential tool for researchers, especially those involved in web scraping and data collection from online platforms. + +At its core, Selenium allows a user to programmatically control a browser like Chrome just as a user would: clicking buttons, filling out forms, scrolling pages, and downloading content. This ability makes it invaluable for collecting data from dynamic websites that rely heavily on JavaScript and interactive elements, which traditional scraping methods often struggle to handle. + +## Ethical considerations + +Crucially, among all types of data collection, webscraping is the most intricate with legal as well as ethical consideration constantly evolving through legislation such as the EU General Data Protection Regulation (GDPR) or the Digital Services Act (DSA). Responsible scraping practices should always include data minimisation, anonymisation, and a clear purpose aligned with public interest or research. Be sure to always check the platform’s or website’s Terms of Service and look for structured alternatives such as APIs. See also the chapter on the [current state of webscraping](/docs/docs/03_data-collection/03_03_web-scraping-intro) on the hub. + +:::hub-note Note +While Selenium offers considerable advantages regarding is adaptability, data collection with it is often resource-heavy with long compute and waiting times. Depending on your project and its demands, consider alternatives like those mentioned above. You can find great tutorials on puppeteer and rvest on the hub. Be sure, however, to check out the programming language they use. If limited to Python, the main alternative to Selenium is [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) — a Python library for extracting data from HTML and XML sources. +::: + +## What you will learn in this chapter: + +This chapter teaches you how to use Selenium for webscraping, including: + +- Basic setup of a WebDriver instance; +- Core functions to find, fetch and interact with web elements; +- Collection of video information from Rumble. + +It will walk through central areas of the video platform – the trending page, search queries and video pages. + +Along with this tutorial, a **custom Python package*+ was developed to help you collect more complex data from Rumble. More information about its capabilities is provided at the end. + +For this tutorial it is best to follow along using a **Jupyter Notebook** either in your browser or any common programming environments (IDEs) like (Peters, 2022). + +## Basic Setup + +We will first install the selenium package, as well as other packages needed later, via pip. + + +```python +!pip install selenium requests +#The exclamation point is used to signal to your machine that this shell command (“pip install something”) should be run externally +``` + +From the selenium package, we will import the **WebDriver module** and launch a web browser instance of Chrome. With the `driver` object, you can now control the browser. We use the `get()` function to navigate to the Rumble homepage. + +```python +from selenium import webdriver +from selenium.webdriver.chrome.options import Options +from selenium.webdriver.chrome.service import Service + +# Launch the browser (Chrome in this example) +driver = webdriver.Chrome() + +# Navigate to Rumble's homepage +driver.get("https://rumble.com") +``` + +:::hub-note Note +While the selenium package should automatically download the necessary browser drivers such as `geckodriver` for using Firefox or the `chromedriver`, it is possible that your browser version is not compatible with any of those. If you want to use chrome, check your browser version and visit the [chrome-for-testing dashboard](https://googlechromelabs.github.io/chrome-for-testing/) to download the corresponding driver for the platform (OS) of your computer. You can the point to the newly downloaded driver file with the Selenium Service module. +::: + +```python +CHROME_DRIVER_PATH = "PATH TO YOUR DRIVER FILE" + +# initialize the Service module and pass the path to the executable (the chromedriver file) +custom_driver_path = Service(CHROME_DRIVER_PATH) + +# create a new instance of the Chrome driver with the specified path +driver = webdriver.Chrome(service=custom_driver_path) + +# Navigate to Rumble's homepage +driver.get("https://rumble.com") +``` + +![Rumble_landing_page](../../static/img/platforms/rumble/rumble_landing_page.png) +_Screenshot of Rumble’s landing page with automation disclaimer at the top_ + +:::community Hint +At the top of the browser window, it reads “Chrome wird von automatisierter Testsoftware gesteuert” – This disclaimer signifies that your browser is controlled by Selenium. At the same time, this kind of information is also passed to the website servers, whenever you make a request. Some pages explicitly prohibit the usage of automated browser clients to visit their pages and have put in place different blockers and CAPTCHA forms. Reacting to this, more advanced instances of the Selenium WebDriver have been developed. For an introduction to the “undetected chromedriver” click [here](https://github.com/ultrafunkamsterdam/undetected-chromedriver). +::: + +## Interacting with elements on a webpage + +Before heading to the breadth of extremist content on Rumble, the following ad appears at the bottom of the page. While we could simply close it manually, this presents a great opportunity to learn the core concept of data collection with Selenium – It’s all about the elements. + +![Rumble_ad_popup](../../static/img/platforms/rumble/rumble_ad_popup.png) +_Screenshot of Rumble’s ad popup page and the associated DOM element_ + +## Understanding Web Elements and the DOM + +When a web browser loads a web page, it reads the HTML (HyperText Markup Language) and constructs a representation of the page in memory called the **Document Object Model (DOM)**. The DOM is essentially a tree-like structure where each piece of HTML —such as `
`,``, `

`, or ``— is represented as an object or node in that tree. Selenium lets us find data on the page by explicitly pointing to a web element. More information about how to use the DOM and DevTools can be found [here](https://www.freecodecamp.org/news/chrome-devtools/). + +Instead of manually clicking the “X” button, we find the corresponding element in the DOM, so we can point to it and let the WebDriver close it for us. We will use the `find_element()` function. Every web element has certain properties and attributes. Similarly, there are multiple viable ways to point to an element. Among them, css selectors and XPath are the most prominent ones, striking a good balance between ease of use and explicitness. In this tutorial, we will focus on an element’s XPath to find and interact with it. + +:::hub-note Note +If you want to explore the usage of CSS selectors for Selenium, you can find great resources [here](https://selenium-python.readthedocs.io/locating-elements.html). +::: + +XPath uses path-like syntax. Upon inspecting the button “X” element, we find its node name to be “button”. However, for this ad pop up alone, there are three button nodes. So, we look for more unique attributes, like classes, id’s or aria-labels. Much like a file path on your computer, the longer the path, the more specific the kind of object it is pointing to. In this case, the element has an associated aria-label value “Close” we can use to find the element, store it in an object and use the `click()` function to trigger a mouse click action. + + +```python +from selenium.webdriver.common.by import By + +#Most common XPath syntax: ‚//[@=“value“]‘ + +# Find the "X" button and store it as a webdriver element +close_button = driver.find_element(By.XPATH, '//button[@aria-label="Close"]') + +print(close_button) +``` + +This output shows how Selenium stores a web element: + +```bash + +``` + +Now we can click this button and see the pop-up being closed. + +```python +# Click the "X" button to close the pop-up +close_button.click() +``` + +## Rumble's trending page + +Rumble’s trending page is a good seismograph of the latest MAGA and extremist discourses in the US. After inspecting the menu bar on the left, you can see that the different sections are not represented as button elements but rather as hyperlinks with which we can navigate to the respective subdomain. Still, we could find and store the element through the href attribute and click it, like any user would do on the website. Alternatively, you can navigate to the subdomain directly via the `get()` function. + +:::hub-note Note +There is no general rule as to either click on hyperlink elements or load the page directly. However, the `get()` function will load the whole page, whereas sometimes click actions on dynamic webpages will only reload a certain part of that page. This is important for large-scale data collection with time constrains but can be neglected within the scope of this tutorial. +::: + +![Rumble_menubar](../../static/img/platforms/rumble/rumble_menubar.png) +_Screenshot of Rumble’s menu bar with `` elements displaying the subdomains_ + +```python +# Find the "Trending" link and store it as a webdriver element, then click it +trending_page = driver.find_element(By.XPATH, '//a[@href="/videos?sort=views&date=today"]') +trending_page.click() + +#Alternatively, you can use the subdomain directly to navigate to it +driver.get("https://rumble.com/videos?sort=views&date=today") +``` + +By default, the trending page lists the most viewed videos of the day. On the right side, there are many different filter options to choose from. Instead of today, we want to inspect the most viewed videos of the past week. Again, we could find the filter option as an element and click it, or we slightly change our URL to directly navigate to that filtered page. + +![Rumble_trending_page](../../static/img/platforms/rumble/rumble_trending_page.png) +_Screenshot of Rumble’s trending page with filter options_ + +Looking at the video elements in the DOM, we can see clearly structured `

  • ` items with the class “video-listing-entry”, which hold different child elements and attributes. On Rumble, there is no infinite scroll, as the content is separated into different pages. Each page contains 25 video items. + +![Rumble_video_item](../../static/img/platforms/rumble/rumble_video_item.png) +_Screenshot of a video item with its associated DOM element_ + +We can identify and store these video items by utilising the `find_elements()` function. While there are already valuable data points visible from the listing view alone, we don’t have access yet to the video description, its caption and the comment section. Therefore, we will only fetch the ‘href’ of each video and later visit each video page retrieve all data in detail. + +Doing so, we write a small function that not only finds the elements but also extracts their values for the ‘href’ attribute. This can be easily done with the `get_attribute()` function. + +```python +def collect_video_links(driver): + + #Define an empty list to store the video links + collected_links = [] + + #Find all video elements that contain the video links + links = driver.find_elements(By.XPATH, "//a[@class = 'video-item--a']") + + #For each link, get the href attribute and add it to the list + for link in links: + href = link.get_attribute("href") + if href not in collected_links: + collected_links.append(href) + + return collected_links + +video_links = collect_video_links(driver) +print(video_links[:3]) +``` + +```bash +['https://rumble.com/v6ri2ut--01-04-2025-makeleio.gr.html', 'https://rumble.com/v6rjszf--why-the-medias-bombshell-deportation-story-is-one-big-lie.html', 'https://rumble.com/v6ro0xt-stock-market-bloodbath-after-china-places-34-tariff-on-us-trump-holds-firm-.html']``` +``` + +If we want to **retrieve the video items for multiple pages**, we can do so by utilising keyboard actions to simulate user behaviour like scrolling. At the bottom of the trending page, there are page elements we can scroll to and click. Ultimately, we want to create a loop that goes through the pages and stores the video links. There are again multiple viable ways to do this, either specifying the number of pages we want to retrieve data from or the number of videos we want to retrieve. In this case, we loop over the pages until we reach the number of videos links that have been defined. We want to fetch the 50 most viewed videos of the last week. + +In prior versions of the Selenium python package, many user actions like scrolling were performed by letting the WebDriver execute some lines of JavaScript code. Now, most common actions are nicely wrapped and easily executable through Python directly. Here, we need the “ActionsChains” classes and utilise the `scroll_to_element()`. + +```python +from selenium.webdriver.common.action_chains import ActionChains + +def collect_multiple_pages(driver,limit:int=50): + + video_links = [] + + # Iterate over as many pages as needed to fetch the desired number of video links + while len(video_links) < limit: + # Scroll to the "Next" button to ensure it's in view + button = driver.find_element(By.XPATH,"//li[@class='paginator--li paginator--li--next']") + ActionChains(driver)\ + .scroll_to_element(button)\ + .perform() + # Use the collect_video_links function to get video links from the current page + page_video_links = collect_video_links(driver) + # Add the new video links to the list + video_links.extend(page_video_links) + button.click() + + return video_links + + +video_items = collect_multiple_pages(driver,limit=50) +``` + +## Automate search queries + +Before turning to the detailed video data, we introduce one last core method for interacting with web pages via Selenium – sending keys or input. Whenever we have an input element like a search bar or a form to fill out, we can send user input automatically and thus automate a variety of processes. In this case, we want to utilise the search bar, insert a search query, and simulate a key press “ENTER” to automate the search on Rumble. + +![Rumble_search_bar](../../static/img/platforms/rumble/rumble_search_bar.png) +_Screenshot of Rumble’s search bar with its associated DOM element_ + +```python +from selenium.webdriver.common.keys import Keys + +def search_query(driver,query:str): + + # Find the search input by its attribute + search_bar = driver.find_element(By.XPATH, '//input[@type="search"]') + + # Send your query + search_bar.send_keys(query) + + # press Enter to submit the search + search_bar.send_keys(Keys.ENTER) + +# Our search query is "trump tarrifs" +search_query(driver,"trump tarrifs") +``` +By default, the resulting page will show the most relevant results based on your query. The same filter options we have introduced on the trending page apply here. You can now use the `collect_multiple_pages()` function to iterate over the result pages and collect the video links. + +## Collect video information + +After we have collected video links from the trending page or through our search query, we can now inspect a single video page for relevant data points. + +![Rumble_video](../../static/img/platforms/rumble/rumble_video.png) +_Screenshot of a single video page with relevant metrics and data points _ + +For every video, Rumble provides a breadth of information we can collect. However, unlike datapoints such as the title or the author channel, much of the information is presented in a user-friendly way unfit for analyses. For instance, the view count is not a classic integer but abbreviated with a capital K to signify 680,000 views – so are the likes and dislikes for that video. Even after inspection of the DOM, one realises that we need to convert this data to make it usable for analysis. + +We can create a function that fetches the view count and – depending on the letter – converts it to its associated integer value. + +```python + def collect_view_count(driver): + #Find the view count element using its XPath + view_count = driver.find_element(By.XPATH, '//div[@class="media-description-info-views"]').get_attribute('outerText') + + # If the value is above 1,000 it will be converted to "1K". So we can identify the k in the string. + if "K" in view_count: + converted_video_count = int(float(view_count.split('K')[0]) * 1000) + # If the value is above 1,000,000 it will be converted to "1M". So we can identify the M in the string. + elif "M" in view_count: + converted_video_count = int(float(view_count.split('M')[0]) * 1000000) + # Any value below that is displayed as a classic integer and can be converted directly. + else: + converted_video_count = int(view_count) + + return converted_video_count + +single_video_view_count = collect_view_count(driver) +print(single_video_view_count) +``` + +```bash +680000 +``` + +Even after retrieving all visible data from the video page, there is limited insight into the video’s content, especially if it’s an hour-long stream. Luckily, Rumble provides captions we can fetch and store as text for content analyses. We will create a function that finds the element and makes a request with the help of the ‘requests’ Python package. It then encodes the response as text if successful or prints out the error message. + +```python +import requests + +def retrieve_captions(driver): + # Find the captions track element and store the src attribute + src_path = driver.find_element(By.XPATH, '//track[@kind="captions"]').get_attribute('src') + + # Utilize the requests library to fetch the captions file + caption_response = requests.get(src_path) + + # Check if the request was successful (status code 200) + if caption_response.status_code == 200: + return caption_response.text + else: + print(f"Error: {caption_response.status_code}") + return None + +captions = retrieve_captions(driver) +print(captions[:90]) +``` + +```bash +WEBVTT +00:00:47.340 --> 00:00:49.340 +Welcome, you're listening to the X-22 Report. +``` + +With the core methods shown in this tutorial, you are now able to navigate to Rumble or any other website (static or dynamic) with an automated browser and retrieve or interact with its web elements. Beyond this, you can expand the code to fetch more data points from the video page and automatically navigate through the different domains of interest on Rumble. Once your code is ready, you can switch to Python scripts and run them daily for monitoring purposes. + +## Advanced usage + +The platform’s web page structure is complex, with many metrics and elements posing challenges for data collection. For instance, metrics like the date format depend on the content being a video or a live stream. The display of the description text depends on its length. Rumble channels can also limit the availability of their video’s comment section to logged-in users. + +Therefore, we at polisphere have created an open-source Python package called “rumble-scraper” to help you with Rumble’s complexity and data collection obstacles. It includes the capabilities to +- Collect videos with all filter options from the trending and browse page +- Collect all visible data points from video pages, including video description and the comment section +- Log-in with user credentials + + + +## References + +- Balci, U., Patel, J., Balci, B., & Blackburn, J. (2024a). iDRAMA-rumble-2024: A Dataset of Podcasts from Rumble Spanning 2020 to 2022. Workshop Proceedings of the 18th International AAAI Conference on Web and Social Media. + +- Balci, U., Patel, J., Balci, B., & Blackburn, J. (2024b). Podcast Outcasts: Understanding Rumble’s Podcast Dynamics. arXiv Preprint arXiv:2406.14460. + +- Mell-Taylor, Alex. (2021). Rumble Is Still Where the Right Goes to Play. Medium. Retrieved April 09, 2025, from https://aninjusticemag.com/rumble-is-still-where-the-right-goes-to-play-d3fe7df98875. + +- Shaughnessy, B., DuBosar, E., Hutchens, M. J., & Mann, I. (2024). An attack on free speech? Examining content moderation, (de-), and (re-) platforming on American right-wing alternative social media. New Media & Society, 14614448241228850. https://doi.org/10.1177/14614448241228850. + +- Thompson, S. (2024). I Traded My News Apps for Rumble, the Right-Wing YouTube. Here’s What I Saw. The New York Times. Retrieved April 09, 2025, from https://www.nytimes.com/interactive/2024/12/13/business/rumble-trump-bongino-kirk.html. + diff --git a/docs/static/img/contributors/schwenn.jpg b/docs/static/img/contributors/schwenn.jpg new file mode 100644 index 0000000..71a387a Binary files /dev/null and b/docs/static/img/contributors/schwenn.jpg differ diff --git a/docs/static/img/platforms/rumble/rumble_ad_popup.png b/docs/static/img/platforms/rumble/rumble_ad_popup.png new file mode 100644 index 0000000..c76cef3 Binary files /dev/null and b/docs/static/img/platforms/rumble/rumble_ad_popup.png differ diff --git a/docs/static/img/platforms/rumble/rumble_landing_page.png b/docs/static/img/platforms/rumble/rumble_landing_page.png new file mode 100644 index 0000000..2e336ac Binary files /dev/null and b/docs/static/img/platforms/rumble/rumble_landing_page.png differ diff --git a/docs/static/img/platforms/rumble/rumble_menubar.png b/docs/static/img/platforms/rumble/rumble_menubar.png new file mode 100644 index 0000000..63b75a7 Binary files /dev/null and b/docs/static/img/platforms/rumble/rumble_menubar.png differ diff --git a/docs/static/img/platforms/rumble/rumble_search_bar.png b/docs/static/img/platforms/rumble/rumble_search_bar.png new file mode 100644 index 0000000..0735bbb Binary files /dev/null and b/docs/static/img/platforms/rumble/rumble_search_bar.png differ diff --git a/docs/static/img/platforms/rumble/rumble_trending_page.png b/docs/static/img/platforms/rumble/rumble_trending_page.png new file mode 100644 index 0000000..61fd6ba Binary files /dev/null and b/docs/static/img/platforms/rumble/rumble_trending_page.png differ diff --git a/docs/static/img/platforms/rumble/rumble_video.png b/docs/static/img/platforms/rumble/rumble_video.png new file mode 100644 index 0000000..05041f6 Binary files /dev/null and b/docs/static/img/platforms/rumble/rumble_video.png differ diff --git a/docs/static/img/platforms/rumble/rumble_video_item.png b/docs/static/img/platforms/rumble/rumble_video_item.png new file mode 100644 index 0000000..e87b002 Binary files /dev/null and b/docs/static/img/platforms/rumble/rumble_video_item.png differ diff --git a/docs/static/img/platforms/youtube/roeper_page_example.png b/docs/static/img/platforms/youtube/roeper_page_example.png new file mode 100644 index 0000000..8b709fb Binary files /dev/null and b/docs/static/img/platforms/youtube/roeper_page_example.png differ