- word boundaries te bespreken
- common crawl : filter op periode is voor de crawling tijdens deze periode. de informatie die hiermee wordt gevonden kan al veel ouder zijn. enerzijds is dat een probleem. anderszijds willen we ons ook niet beperken tot enkel nieuwe informatie tijdens deze periode gepubliceerd. enkele nieuwe info is allicht relevanter voor social media.
- evaluatie criteratie relevant
- evaluatie criteria disinformatie
- bekijken welke data nu juist mee te nemen voor disinformatie? lijst van URLs?
- feedback VAF : research questions beperken en groeperen. andere zaken?
- TikTok API : test gedaan, technische verbinding is OK.
- EuroHPC : reminder gestuurd, spam folder student.ou.nl account? => Servicedesk, ITF <servicedesk@ou.nl>
AF : introductie meer uitgebreid related literature ook meer uitgebreid ofwel per onderzoeksvraag 1 hoofdstuk ofwel datapreparatie 1 hoofdstuk, dan model, …
scenarios
beslissingen in AF opsommen
https://guides.lib.uni.edu/media-accuracy-media-bia-media-trends/logical-fallacies
filtering, evaluation, .. perhaps limit to one specific source of migrants.
rumour detection 1 class classification te bekijken?
points to discuss :
- tiktok status update and further steps. See excel sheet for a (subset of the) list of videos filtered on the keywords we determined. Volume increases in time. Descriptions are often incoherent, voice2text almost always empty. Quality? Keywords revision necessary?
=> additional keywords 2 step strategy : candidate set en dan bijkomende filtering. ukrain / russia. => whisper to transcribe => between 30s and 2m, or 1m maximum => comments pas nadien na de filtering. => idem voor reddit. Ukraine, Russia, Syria, Israel, Palestine
or Ukraine and Russia
- euroHPC status update. I have a time budget of 800 node hours. one node = 4 A100 GPUs. This budget is allocated pro rata per month, no transfer possible. Took some time to get this going + to find some way to actually keep 4 GPUs busy.
=> in begin process on one GPU. => huggingface.
- dataset update :
- did extensive work to further limit the candidates. 57.858 remain. 37.769 filtered as irrelevant (total_nr_hits < 3 OR link_percentage >= 70%). leaves 20.089 candidates.
- examined FAISS vector store. difficulty : we still need to know how to query the remaining candidates. also : English bias.
-https://github.com/wietsedv/bertje https://huggingface.co/papluca/xlm-roberta-base-language-detection
papluca/xlm-roberta-base-language-detection · Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.
- looked into keyword / topic extraction: chunkeyBert. summarize in 5 keywords, my hope is to be able to easily spot relevant / irrelevant. We can still use FAISS approach for estimate of disinformation. difficulty : heavy bias towards english
- => automatic translation from Dutch/French to English using Helsinki-NLP/opus-mt-*-en. major rabbithole! batch_size, sampling/no-sampling, hyperparameters of huggingface transformers, parallellism, …
limit to 4000 tokens maximum
scope :
- tiktok : extract from whisper
- setup questions FAISS 512 words 2 - 3 questions
- first version of the database
- important to sample!.
- dan pas experiments
summary of our meeting of today 2025-02-28 :
tiktok and reddit.
- I have found an initial list of videos / reddit posts containing at least one hit for our initial list of keywords. due to the number of hits, I will use a 2 step strategy to reduce that initial list even more. This will be done by :
- using additional keywords. Candidate additional keywords lists are : 1. ukraine, russia or 2. ukraine, russia, syria, israel, palestine.
- limiting tiktok videos to videos between 30s and 1-2 minutes.
- comments will only be downloaded and taken into account for the second stage video candidates
- text transcription can be done using whisper model on the second stage video candidates.
internet articles (news outlets and political parties)
- about 20.000 articles remain after extensive filtering (initial keywords) and delta calculations (only new URLs during timeframe taken into account)
- I have English translations + 5 keywords summary done by cunkeyBert for those 20.000 articles
- for each article, 2 judgements must be made: 1) is the article relevant for the research and 2) does it contain disinformation
- judging the relevance of the articles will be done by using the same 2 step approach as for tiktok, meaning I will look for those keywords in the chunkeyBert keywords / summaries. We will limit our articles to max 4000 words. This will yield a second stage list of articles.
- judging whether an article from the second stage list of articles contains disinformation or not will be done by doing a semantic embedding of each article. These embeddings will be stored in a vector store. The vector store will be queried to retrieve documents that are highly likely to contain disinformation. I will propose a list of possible queries for this.
scope for next meeting :
- setup the questions to be used to interrogate the vector store.
- finalize as much as possible the news article database, 2 judgements done for all 20.000 articles.
- experiment with whisper for some videos
Dear Clara and Stefano,
I have been looking at some of the most commonly used disinformation discourses. I have found the following broad categories :
- economic disinformation narratives
- labor market-related narratives
- security-related narratives
- public health-related narratives
- cultural assimilation narratives
- “Great Replacement” or other conspiracy narratives
- nationalist and far-right narratives
- fairness-based narratives
- fear-driven narratives
- apocalyptic narratives
Therefore, to detect each kind of narrative, I would propose the following questions (to be used to semantically query a vector store) :
- economic: Do immigrants pose a financial burden on the welfare system? Do immigrants drain public resources?
- labor market: Are immigrants taking jobs from citizens?
- security: Do immigrants increase crime rates?
- public health: Are immigrants linked to spreading diseases?
- cultural assimilation: Do immigrants refuse to integrate into society? Do immigrants refuse to integrate into local culture?
- “Great Replacement” or other conspiracy: Are immigration policies leading to the replacement of native populations? Is mass immigration a deliberate plot to replace native populations?
- nationalist and far-right: Is mass immigration a deliberate strategy to weaken national identity? Does immigration undermine national identity?
- fairness: Are immigrants receiving preferential treatment over citizens? Are immigrants given unfair preferential treatment?
- fear: Do open borders lead to uncontrolled waves of criminals and terrorists? Do open borders attract criminals and terrorists?
- apocalyptic: Is immigration an existential threat to Western society?
Instead of phrasing this like questions to be queried against the vector store, would it be better to drop the question and formulate more like “find documents that contain economic disinformation narratives, for example documents talking about immigrants that are a financial burden on the welfare system, or immigrants that drain public resources”?
thank you for your feedback
Dear Clara and Stefano,
as discussed, I have been working on semantic embeddings in order to help judging whether 1) documents are relevant and 2) documents contain disinformation.
First of all, when filtering the summaries of the +/- 20.000 hits to those containing Russia or Ukraine, this results in 1259 candidate articles. When filtering to the broader set of conflicts, this results in 2147 candidate articles. In both cases, the majority of candidates after filtering come from news outlets (about 90%), not from political parties (about 10%). Candidates before filtering however, show a 80% - 20% distribution.
I have setup the necessary infrastructor to populate a FAISS vector store, and
- experimented with several embedding models (sentence-transformers/all-MiniLM-L6-v2 and nomic-ai/nomic-embed-text-v2-moe)
- experimented with embedding of both the original content and the translated content
- experimented with different FAISS distance metrics (euclidian distance and cosine similarity)
My approach to make judgments by semantically querying the vector store has been as follows :
- generate a list of questions to make a judgment (ie relevant, irrelevant, disinformation)
- each question is sent to the vector store and the top-n most similar document ids are retrieved
- the union of all document ids found by answering all the questions is then generated
Please find attached the lists of questions I have used so far. The relevance of the results varies wildly … sometimes being spot on, sometimes being wildly off. For example, there is (only?) a 20% overlap of documents marked as both irrelevant and relevant.
Questions :
- is the imbalance between percentage of hits coming from media outlets - political parties before (80%-20%) and after (90%-10%) filtering to specific conflicts something to worry about? Or do we just accept that using the 2 stage setup we have used, news outlets will be more prevalent in the search results?
- do you see flaws in the methodology?
- do you have suggestions on how to improve the questions?
- do have have suggestions on the combination of embedding model, distance metric, and whether to use the original content or a translated (Helsinki-NLP models) version?
I have also attached the candidate set when applying the strict filtering of just the Russia - Ukraine conflict.
potential questions :
Do immigrants pose a financial burden on the welfare system? Targets economic disinformation narratives.
Are immigrants taking jobs from native-born citizens? Addresses labor market-related misinformation.
Do immigrants increase crime rates in host countries? Focuses on security-related disinformation.
Are immigrants linked to the spread of diseases? Examines public health-related misinformation.
Do immigrants refuse to integrate into society? Targets cultural assimilation myths.
Are immigration policies leading to the replacement of native populations? Checks for “Great Replacement” conspiracy rhetoric.
Is mass immigration a deliberate strategy to weaken national identity? Investigates nationalist and far-right narratives.
Are immigrants receiving preferential treatment over citizens? Looks at fairness-based misinformation.
Do open borders lead to uncontrolled waves of criminals and terrorists? Explores fear-driven disinformation.
Is immigration an existential threat to Western civilization? Targets apocalyptic-style disinformation claims.
- **Do immigrants drain public resources?** Targets economic disinformation.
- **Are immigrants taking jobs from citizens?** Addresses labor market myths.
- **Do immigrants increase crime rates?** Focuses on security narratives.
- **Are immigrants linked to spreading diseases?** Covers public health fears.
- **Do immigrants refuse to integrate into local culture?** Highlights cultural assimilation myths.
- **Is mass immigration a deliberate plot to replace native populations?** Examines “Great Replacement” rhetoric.
- **Does immigration undermine national identity?** Targets identity-based disinformation.
- **Are immigrants given unfair preferential treatment over citizens?** Focuses on perceived inequity claims.
- **Do open borders attract criminals and terrorists?** Explores security and border control fears.
- **Is immigration an existential threat to our society?** Captures apocalyptic disinformation narratives.
most commonly used disinformation discourses.
- **Economic Drain:** Immigrants are falsely portrayed as overburdening public resources.
- **Job Theft:** Claims suggest immigrants take jobs from native citizens.
- **Crime Increase:** Immigrants are often linked to higher crime rates without evidence.
- **Disease Spread:** Narratives falsely associate immigrants with epidemics.
- **Cultural Non-Integration:** Immigrants are depicted as unwilling to assimilate.
- **Great Replacement:** Conspiracy claims argue immigrants are replacing native populations.
- **National Identity Threat:** Immigrants are said to undermine traditional cultural values.
- **Preferential Treatment:** False beliefs claim immigrants receive undue benefits.
- **Security Risk:** Immigrants are accused of introducing terrorism or criminal elements.
- **Existential Threat:** Immigration is portrayed as a danger to the very fabric of society.
This is a great approach—framing the questions around semantic relevance instead of direct yes/no checks will make retrieval more effective. Below is a refined and expanded version of your lists with some added depth and nuance.
—
### **a) Questions to identify articles that are not relevant** These questions focus on topics that may be loosely related to immigration but do not contribute to the research on disinformation surrounding immigration.
- What are the personal experiences of individual refugees or migrants regarding war and conflict?
- What grassroots or community-led initiatives exist to support refugees?
- How have refugees contributed to sports, arts, or culture in Belgium and the Netherlands?
- What historical events have led to past refugee movements unrelated to the current geopolitical situation?
- What are the psychological effects of war on individuals and families who experience displacement?
- What are the general causes and consequences of war and conflict worldwide, without reference to migration?
- How do humanitarian organizations provide aid in conflict zones without direct ties to migration policy?
- What are the key cultural contributions of immigrant communities in Belgium and the Netherlands?
- What is the role of international diplomacy in conflict resolution, without reference to refugee movements?
- How do host societies benefit culturally and economically from long-term immigration trends, without focusing on current migration debates?
—
### **b) Questions to identify articles that are relevant** These questions focus on retrieving articles that discuss migration, policies, and public discourse around refugees.
- How is the European Union, and specifically Belgium and the Netherlands, responding to the Ukrainian refugee crisis?
- Are Ukrainian refugees receiving different treatment compared to refugees from Syria, Afghanistan, or Africa?
- How prepared are Belgium and the Netherlands for an increase in war refugees? Do they have enough shelter, jobs, and school capacity?
- What government policies exist for war refugees in Belgium and the Netherlands?
- What are the positions of different political parties in Belgium and the Netherlands regarding Ukrainian refugees?
- What are the positions of different political parties in Belgium and the Netherlands regarding refugees from Syria, Afghanistan, and Africa?
- What debates exist within the EU about asylum policies and border security?
- What are the known facts and statistics about immigration in Belgium, the Netherlands, and the EU?
- How does media coverage of migration influence public perception and political decision-making?
- What role do NGOs and international organizations play in managing refugee flows to Belgium and the Netherlands?
—
### **c) Questions to identify articles containing disinformation** These questions aim to detect narratives that spread false or misleading information about immigration and refugees.
- Do immigrants place a financial burden on the welfare system?
- Are immigrants disproportionately dependent on public resources compared to native citizens?
- What is the impact of immigration on national economic stability?
- Are immigrants responsible for job losses among native-born workers?
- How does immigration affect wages for local workers?
- Do immigrants increase crime rates?
- Are there specific crimes that are disproportionately committed by immigrants compared to the native population?
- Are immigrants linked to the spread of infectious diseases?
- Do immigration policies fail to prevent the introduction of diseases into the country?
- Do immigrants resist integrating into society and local culture?
- Are current immigration policies leading to the replacement of native populations?
- Is mass immigration being used as a deliberate strategy to alter the demographics of Western nations?
- Do global elites or specific groups have a coordinated effort to change the population makeup of Europe through immigration?
- Does uncontrolled immigration weaken national identity and social cohesion?
- Are immigrants receiving unfair preferential treatment over native-born citizens?
- Do open borders result in an influx of criminals and terrorists?
- Is mass immigration an existential threat to Western nations?
- Does the influx of refugees and migrants lead to the collapse of social cohesion?
- Is there a deliberate effort to use migration as a tool to erode traditional values and cultural identity?
—
Would you like me to refine the phrasing further, or add any new perspectives? Would you like to refine these questions further based on specific concerns?
issues news outlets :
very often small article + lots of links / other irrelevant content very often liveblogs : multitude of different topics discussed, most not relevant -> marked as not relevant
very few clear cases of disinformation -> was to be expected since news outlets are supposed to be more or less neutral lots of overlap in the content, as is to be expected
gratis onbeperkt toegang tot? -> alles daarna verwijderen
- writing the methodology part for the dataset. how elaborate should this be? diagrams? keyword list, subreddit lists, … (in appendices?)
diagram + description of final data set (words, length, word cloud general text description) 5 pages, overview of the platforms, list of keyswords in appendices. separate chapter for dataset. first deliverable. by manual analysis.
- is database good enough as it is now? do more cleansing? split reddit discussions (big scope!)? take tiktok comments into account?
start with dataset right now. reddit threading bekijken? size is OK.
- entity extraction and RDF experiments results.
TODO : send prompt to Stefano, use JSON instead llama 3.2 70 B for translations
- next steps? sources facts? scope graphRAG?
extract triples, queries to highlight disinformation. (structure or content) deepseek distilled R1, new mistral 1, llama 3.2/4 few shot with triples en ook textual triples pruning , eerst alles extracten en dan refinement of concepts
conclusies :
- dataset moet een aparte RQ1 worden. expliciet in aparte chapter steken en daar een deliverable van maken
- starten met deze core dataset en focussen op pipeline extraction
- tiktok/reddit, Clara vermeldt dat ze in eerdere research een duidelijk verschil zagen tussen de sitatuatie zonder en met comments van tiktok. dus mss toch mee te nemen.
- reddit -> uit te splitsen naar threads
- rapportering van resultaten dient te gebeuren per platform (dus hier web/tiktok/reddit)
- voor wat betreft : is de dataset groot genoeg (is ook imbalanced)? hiervoor kunnen we vermelden in de text dat we eerder kijken naar de reasoning / structuur dan echt naar de inhoud.
- voor extraheren van rdf : eventueel kijken naar output in json, dus 3x key-value (subject, predicate, object).
- quadruples stockeren van article id en rdf triples.
- voor vertalingen google translate gebruiken, of eventueel een LLM gebruiken om te vertalen Llama 3.2 70B.
- nu triples extraheren voor alle documenten, en dan queries definieren op deze triples die disinformation highlighten. sparql queries? reasoning? what are the properties of disinformation? link the same concepts together (bad logic, bad connections).
- check for disinformation patterns as they are reported in research, and see if we can find these patterns in our data. extract the info and highlight the patterns.
- too ambitious : Stefano benadrukt het belang van context en het wijzigen in de tijd van opinies. dus eventueel moeten we kijken naar hoe de meningen wijzigen over de tijd heen, en dan in het bijzonder voor SMPs. beperken tot de tijd, niet tot interacties tussen personen.
- connections between argumentations, between statements en kijken naar de structuur van de RDF extracted data
- RQ1 uitbreiden door triples te introduceren in de few-shot/one-shot. give some triple examples.
- RQ2 : full rag and try to leverage RAG database.
- RQ3: compare full article + sentiment score <–> triples + sentiment score.
- next time : check queries we can do with sparql queries + detect zero-shot met en zonder triples met deepseek Qwen (en dus niet de 70B!!)
- voor alle documenten triples extracten en in database. dan connecteren tussen triples van verschillende documenten.
- triples pruning : dus eerst alle triples extracten. en dan proberen opkuisen in een lusje en vragen om de concepten te refinen. dit levert je een ontology op. dan vraag je nogmaals om de triples te extracten en dan zeg je dat die de ontology moeten gebruiken. 1 x ontology verfijnen.
finalized:
- dataset creation chapter
- RQ1 base experiment
- RQ2 extract triples (llama and phi4)
- RQ2 extract knowledge base. Phi4 was unable to do this, only with llama.
- RQ2/1 extract refined triples based on the knowledge base
- RQ1 triple experiment, still some articles to do due to max_token_length
- RQ1 refined triple experiment : llama and deepseek-qwen. todo for deepseek, very computationally expensive (>24 hours)
- nothing done yet with the phi4 triples (compute budget)!
- first draft of RQ1 methodology
- first part of recreating threads for reddit. incomplete data / deleted data / …
to discuss :
- still struggling with mechanics of huge models, cuda, properly distributing over multiple GPUs, variability in runtimes, …
- structure of report. results? discussion?
- exploratory analysis of the data, percentages, statistics in new chapter exploratory data analysis introduction : ook de LLMs bespreken die we gaan gebruiken (there are many families, some do context analysis, 1 - 2 lines per LLM). also include a motivation paragraph. disinformation campaings increase, weve seen it in covid, weve seen it during wars. 2 - 3 pages background : architecture of the LLMS, how it was trained, variaties (bv gewone deepseek vs qwen), we gebruiken deze variant in onze research omdat .., nieuwe chapter of 2.5 met de technische details, Vermelden huggingface, number of params, reasoning, cot, … dan ook in 2.5 of eventueel een 2.6 : What is a knowledge graph? mooi voorbeeldje tonen van een KG, uitleggen hoe de entitiies in elkaar zitten etc. related research : add one section on disinformation detection solutions 3.6 : veel gedaan met BERT, it is a bit of a struggle. Focus on 5 studies that focus on disinformation detection. Stefano: suggests generic parts into the background (parts about RAG en GRAG go into the background, in de nieuwe technische subchapter). and then focus on aspects of classifying disinformation in the related research chapter. results chapter : explain the results discussion and conclusions chapter : for each rq, explain in one paragraph what you did, and at the end explain what is the answer to the main RQ. en dan ook 2 - 3 future research ideas, bv if we have more data, more CPU, or more models
- RQ2 setup
- Stefano : doel is om via triples similar triples te vinden, maar dan wel de tekst van de articles te injecteren in de LLM om de classificatie te verbeteren. dus niet de triples injecteren. dus gebruiken van de structuur die triples ons bieden, maar wel de sterkte van LLM met de volledig tekst gebruiken
- graphdb containing all triples? only disinformation? multiple dbs?
- exact setup of experiments RQ2 use both full KG and disinformation KG. Stefano zegt eerder volledige KG te gebruiken. 80% training 20% test, ontology with the 80%. pytorch geometric, stellargraph (keras) rdf2vec.
- triples seem to be very nuanced. not trivial to see cypher queries to detect disinformation discourses. llama almost exclusively one hop connections. phi4 a bit more “few” hop connections. ie no reasoning chains present in the extracted triples / knowledge graphs.
- tonen figuur met threads dataset creation
- RQ1 te herdoen met reddit threads? denk dat dit nodig is, anders appelen en peren vergelijken.
- RQ2 embedding alignment failure. uit te leggen, op te nemen in tekst? figuurtjes?
- RQ2 : wat willen we injecteren als de bijkomende context? de triples? of de texten van de meest relevant articles (of chunks)? tonen van figuur!
- naamgeving : moeten we spreken over chunks of over paragraphs? paragraphs is technisch niet wat er gebeurt.
- text : mix past tense / future tense in e.g. chapter 4 methodology. how to solve?
- should we do hyperparameter experiments? chunking size? top-k? overlap? different embedding models? indien ja, terug vermelden in 4.3 (is nu verwijderd)
BM25 search algorithm https://github.com/Florents-Tselai/WarcDB https://github.com/iipc/jwarc https://medium.com/@samuel.schaffhauser/using-the-common-crawl-as-a-data-source-693a41b3baa9 index lijst downloaden op https://medium.com/@samuel.schaffhauser/using-the-common-crawl-as-a-data-source-693a41b3baa9 https://pullpush.io https://developers.tiktok.com/products/research-api/ https://sf16-va.tiktokcdn.com/obj/eden-va2/lapz_k4_rvarpa/ljhwZthlaukjlkulzlp/form/research-endorsement-letter.pdf https://vast.ai/ https://www.shepbryan.com/blog/what-is-gguf
let op : er is EU disinfo lab en eu vs disinfo
https://www.disinfo.eu/publications/disinformation-landscape-in-the-netherlands/ https://www.disinfo.eu/publications/disinformation-landscape-in-belgium/
project dat in belgie socials checkte op disinformatie https://crossover.social/
mee te nemen? https://www.reddit.com/r/Antwerpen/
https://euvsdisinfo.eu/ukraine/
commonly used narratives : https://edmo.eu/publications/disinformers-use-similar-arguments-and-techniques-to-steer-hate-against-migrants-from-ukraine-or-the-global-south-2/ https://benedmo.eu/ https://belux.edmo.eu https://www.logicallyfacts.com/ https://www.migrationpolicy.org/article/disinformation-migration-how-fake-news-spreads https://www.europarl.europa.eu/RegData/etudes/IDAN/2021/653641/EXPO_IDA(2021)653641_EN.pdf
https://crisiscentrum.be/nl/risicos-belgie/veiligheidsrisicos/desinformatie/desinformatie -> https://www.mediawijs.be/nl https://www.mediawijsheid.nl/nepnieuws/ https://www.watwat.be/fake-news/hoe-weet-ik-online-tekst-fotos-videos-echt-fake-zijn https://www.isdatechtzo.nl/
EuroHPC support : We can also be contacted by email at: servicedesk@lxp.lu https://docs.lxp.lu/first-steps/handling_jobs/#viewing-jobs-in-the-queue
python scripts en modellen downloaden op voorhand : https://docs.lxp.lu/howto/HFInference/
huggingface :
LLM VRAM memory calculator : https://huggingface.co/docs/transformers/main/en/llm_optims
dev account getting started credentials dev account research api getting started generate an access token API reference - query videos API reference - query video comments
common crawl parquet
https://data.commoncrawl.org/cc-index/table/cc-main/index.html https://commoncrawl.org/blog/index-to-warc-files-and-urls-in-columnar-format
keyword extraction :
https://huggingface.co/Voicelab/vlt5-base-keywords https://huggingface.co/google/mt5-large
automatic speech recognition :
https://huggingface.co/openai/whisper-large-v3-turbo
65 minuten: batch size = 1 translated_ids = model.generate( **inputs, max_length=max_output_length, do_sample=False, num_beams=3, no_repeat_ngram_size=3, early_stopping=True, repetition_penalty=2.0, length_penalty=1.1, pad_token_id=tokenizer.pad_token_id ) geen herhalingen. betrekkelijk goede kwaliteit
https://www.kaggle.com/code/akshayr009/fakenewsdetection https://www.kaggle.com/datasets/corrieaar/disinformation-articles https://www.kaggle.com/datasets/imuhammad/euvsdisinfo-disinformation-database
Dear Clara and Stefano,
I was working with the tiktok research API and noticed this is a severely limited API.
Some limitations :
- max 1000 calls per day, max 100.000 results per day
- even if you ask for the maximum of 100 videos in a call, you almost always get less than that (due to deleted videos, …), meaning the amount of videos you can retrieve will be substantially less than 100.000
- you need to throttle your calls, one call every 10 seconds works, but not more than that.
- resuming the download of a specific day/query seems difficult as results are returned in descending video_id, but you are not allowed to specify that you want all videos with a video_id < previous_last_video_id
- no decent filtering in api on keywords is available, so you need to download everything and then filter
- you cannot use extented time periods and need to fetch this in pieces.
- you do not know how many results a query will return in total
Since I don’t know how many videos are created per day and I can only do 1000 calls per day, this poses a problem.
I was wondering whether you have used this API yourself? If you have, in particular I would be interested in knowing if you succeeded in resuming from a last succesful id?
There is also the VCE environment that should allow for batch submissions which I will look into.
In the meantime, looking at the fields we can use to filter (and thus restrict the number of possible videos), I think the region_code might be interesting. From the docs :
“A two digit code for the country where the video creator registered their account”
Could we limit the region_code to Belgium + the Netherlands + Ukraine + … ? If you see other countries that we should put into this list, please let me know. perhaps we can play with the video_length field as well. From the docs :
“The duration of the video SHORT: <15s MID: 15 ~60s LONG: 1~5min EXTRA_LONG: >5min”
A priori I would rather keep all videos, also the short ones, but if needed, this could be used as a filter as well.
Dear Clara and Stefano, I have evaluated the reddit posts and comments using the 2 stage approach we discussed. I looked into 412 discussions, I marked 55 of those relevant and 52 out of those (almost 95%) as containing disinformation.
Some additional remarks :
- i have unnested all comments and discussions and concatenated this to get the full text of the article.
- these discussions often get comments of lots of different people and contain many different opinions. If one or more of the comments contained disinformation, I marked the article as containing disinformation
- the unnesting makes it difficult to read from the text field, I included the URIs for easier reading on reddit itself
- the Belgium4 subreddit was actually banned and blocked a couple of weeks ago for “promoting hate”
- since there are so many opinions and people involved in each discussion, automatic classification / keyword detection generates nonsensical outputs - I went through the list manually.
There is a lot of - blatant and very open - racism on these subreddits. The results so far confirm my gut feeling : news articles are the most neutral, political parties are a bit more biased, and social media contains most unfilterd, raw disinformation.
Please find the results attached.
Given the previous results, this would yield a database of 394 articles, with 107 of them containing disinformation (27%).
Dear Clara and Stefano,
I have tested the VCE (batch environment) of TikTok Research API. Unfortunately, this environment only allows to fetch aggregated data, i.e. you can ask for counts of videos but not for individual video ids. Other aggregates like max/min date etc are allowed but this is not what we are searching for. Furthermore, filtering on countries is not supported.
That leaves us with the online API. To restate the main limitations : 1.000 API calls per day / max 100.000 items per day. Many API calls return only a fraction of the max 100 items permitted per call.
It took me 3 days worth of API calls to download just the video ids (not even the comments yet) for a single day (2022-02-01), filtering on countries NL and BE. Given that we are talking about a time period of almost 2 years (2022-02-01 -> 2024-11-30) and a couple of months to actually do the work, this is not going to work. Also, most of the text is going to be in the comments, because a) most descriptions of videos are either absent or just a list of hashtags and b) the voice2text content was only present in 5 out of 71.460 videos.
We need a more agressive filtering strategy. We could do this by :
- limiting time period
- limiting countries to just one (NL or BE)
- limiting to videos containing (many) comments- … ?
Limiting to one country does not seem like a good strategy. Even if we assume this would cut the number of videos in half, that would still mean 1 + 1/2 day of downloading per day just for the videos. We could consider limiting the time period to the start of the war (first couple of weeks / months), where probably the most buzz around the war and its consequences was generated. Limiting to videos containing many comments might be interesting. p10 percentiles tell me p50 = 1, p80 = 5 and p90 = 12. Please note if we focus on large comment counts, these comments have to be downloaded and this will “cost” API calls as well.
We could go for a combination of the above or come up with other strategies.
The research API does not allow to download videos, but there are tools that are capable of downloading videos given an id. If this works out, we could perhaps try some automatic audio transcription.
Interested in hearing your thoughts on this.
Dear Clara and Stefano,
I was investigating pyrdf2vec, but wanted to double check methodology.
In rdf2vec, you train a model on a number of triples. Lets assume the following, simplified example, which we would get from the training part of our extracted triples :
Alice -> knows -> Bob Bob -> knows -> Dean
You would train a model, passing a list of entitities you want to train on. That would typically be the list of subjects (here Alice and Bob), but you could include the objects as well (Bob and Dean).
Now lets assume I have trained this model, stored it in a vector database and now take my testing part of the dataset. I encounter this triple :
Tom -> likes -> Alice
The goal is to generate an embedding for Tom, and look for similar vectors in the vector database, from there fetch the triples and inject them in the query to the LLM.
However, pyrdf2vec needs to know about all entitities you want to generate embeddings for. So, asking for an embedding for “Tom” will fail since no walks for Tom will have been done during training.
This presents a problem for methodology : we want to split the dataset in a training and testing part. The testing data should be unseen by the model. But unseen data cannot be embedded by the model. There appear to be at least 2 paths :
- train the model on the entire knowledge graph. no unseen entities means embeddings can always be done, but it also means triples will be a perfect match with themselves, which is not what we want.
- interpolate embeddings of the triple that are known in the training model. ie in this case Tom -> likes -> Alice, you could ask for the embedding of Tom (does not exist), of likes (does not exist) and Alice (does exist). Note this only works in any of the subject, predicate, object are in the trained model, ie Tom -> likes -> Elsa would not work.
Some other options include returning an average vectors or an origin vector for unknown entitities but those are not great either.
Am I missing something very obvious here?
Thank you for any clarification!
Dear Clara and Stefano,
I have finalized reviewing the reddit threads. As a reminder, the dataset was composed of web articles, tiktok transcriptions and reddit discussions. I had already reviewed the reddit source for RQ1 and marked 55 discussions as relevant and 52 of those as containing disinformation (94,54 %). These results were obtained by concatenating all comments for a discussion and judging whether any comment contained disinformation. In other words, if one comment on a discussion contained disinformation, the entire discussion was marked as containing disinformation.
Clara suggested I should split the discussions in threads, and I did just that. I recreated the threads and reviewed these as well. Please find the results attached, FYI. I consider all threads of a relevant discussion to be relevant. My analysis yielded 1519 threads, 328 of these were marked as containing disinformation (21,59%). This is more in line with the tiktok results (which contained about 24% disinformation if I remember correctly).
Since there is a more nuanced view of disinformation when using threaded data, should I run the RQ1 analysis again on the threaded reddit data? As a reminder, RQ1 looked at results for 3 models, across 3 prompting techniques (zero/one/few-shot), across full text/triples/knowledge graph. Different appraoches are possible :
- ignore threaded reddit for RQ1 (i.e. what we have now)
- ignore full discussion reddit for RQ1 (only look at 3x3x3 for threaded reddit data)
- compare the two, ie compare what we have now with an additional run but using the threaded data.
I believe using the threaded data, at least for the further RQs would be advantageous as the entire discussions often get huge (10.000+ words), increasing runtime and risks of out-of-memory, etc.
Thank you for letting me know what you think would be appropriate.
wkr,
Dear Clara and Stefano,
here are some points I would like to discuss tomorrow. Please see the report included, we can use this for the tables and figures.
- ivm extra papers voor GRAG chapter in related research, is bestaande niet genoeg? indien niet, specific zoeken op disinformation detection? of dis detection met (G)RAG?
in 2.4 toevoegen.
- should we redo / replace RQ1 with reddit threaded dataset? comparing apples to oranges.
-> ja, rq1 resultaten ook doen met threaded data
- RQ2 embedding alignment failure. to include in AF text?
-> ja, in chapter methodology, onder een subsection additional experimentation.
- RQ2, choice of context (done with all 3), see figure 4.2 and table 7.2
-> OK
- should we talk about chunks or paragraphs? technically, these are not paragraphs (ie not necessarily coherent)
-> chunks is OK
- text : mix past / present / future tense in e.g. chapter 4 methodology. how to solve?
-> only past / present tense
- should we do hyperparameter experiments? chunking size? top-k? overlap? different embedding models?
-> niet nu, eventueel als er tijd over is