Deepfake Detection – Literature Review Protocol and Data

deepfake-detection-review

Deepfake Detection – Literature Review Protocol and Data

This repository documents the end-to-end process used to scope, retrieve, and clean a corpus for a PhD literature review on deepfake detection with a focus on facial content. It includes the exact query sets, per-database exports, and the de-duplication workflow that produces the final screening list.

1) Scope and sources

Primary databases:

Scopus
ScienceDirect
IEEE Xplore

Discovery only (not used for quantitative counts nor inclusion decisions):

Google Scholar (useful for seeding, novelty spotting, and snowballing)

Rationale: the review prioritizes venues and indexes with stable metadata and peer review. Scholar often surfaces preprints, partial notes, or repositories. When Scholar points to a citable item, it is usually indexed in Scopus, ScienceDirect, or IEEE Xplore.

2) Query sets

Four Boolean sets were defined. Set 0 frames the domain; Sets A–C target specific method families.

Set 0 (domain scoping)
"deepfake detection" AND ("face" OR "faces")

Set A (physiological and geometric cues)
"deepfake detection" AND ("head pose" OR "eye blinking" OR "lip-sync" OR "facial landmarks" OR "physiological signals")

Set B (spectral domain and artifacts)
"deepfake detection" AND ("frequency domain" OR "spectral artifacts" OR "compression artifacts" OR "color filter array" OR "noise patterns")

Set C (video dynamics and graphs)
"deepfake detection" AND ("optical flow" OR "temporal consistency" OR "LSTM" OR "graph neural networks" OR "spatio-temporal")

3) Scoping counts

Used only to frame the size of the space, not for screening.

Engine	Set 0	Set A	Set B	Set C
Google Scholar	11,600	3,640	3,340	5,290

4) Targeted retrieval by database (fielded queries)

Matches constrained to article keywords and related metadata.

Database	Set 0	Set A	Set B	Set C
Scopus	379	19	71	105
ScienceDirect	75	30	30	51
IEEE Xplore	395	41	161	243
Google Scholar	6,670	2,180	1,900	3,410

From the screening stage onward, Google Scholar is excluded from the primary corpus. It remains a discovery channel.

5) De-duplication workflow

Two levels:

Within each set (A, B, C) across Scopus, ScienceDirect, IEEE Xplore
Across sets by merging the three within-set lists

Identity keys

Primary: DOI, normalized by stripping protocol prefixes and trailing punctuation, then lowercasing.
Fallback: Title, normalized with Unicode NFKD, lowercased, punctuation removed, and single-space collapsed.

Retention precedence

When duplicates refer to the same work, keep the record with the most stable metadata:

Scopus > IEEE Xplore > ScienceDirect

Results

Per set

Scope	Input	Duplicates removed	Unique
Query A (physio/geometry)	90	6	84
Query B (spectral/artifacts)	262	25	237
Query C (spatio-temporal)	399	76	323

Cross-set merge (A + B + C)

Initial combined input: 644
Cross-set duplicates removed: 26
Final unique records for screening: 618

Intersections between method families:

A ∩ B = 21
A ∩ C = 3
B ∩ C = 2
A ∩ B ∩ C = 0

These small overlaps are consistent with the targeted design of Sets A–C.

6) Other sources

A small number of items reached the author through professional activities and ancillary channels (seminars, collaborations, challenge or dataset sites that link to citable papers). These are tracked as Other in the PRISMA-style diagram and were admitted only if they satisfied the same inclusion and exclusion criteria as database-retrieved records. Preprints are retained if they are the authoritative reference for a dataset or benchmark, or until a peer reviewed version appears.

7) Repository layout

Structure of this repo:

README.md
included-in-review_25.csv

/raw/
  query-0_Scopus_379.csv
  query-0_ScienceDirect_75.txt
  query-0_IEEEXplore_395.csv
  query-A_Scopus_19.csv
  query-A_ScienceDirect_30.txt
  query-A_IEEEXplore_41.csv
  query-B_Scopus_71.csv
  query-B_ScienceDirect_30.txt
  query-B_IEEEXplore_161.csv
  query-C_Scopus_105.csv
  query-C_ScienceDirect_51.txt
  query-C_IEEEXplore_243.csv

/processed/
  query-A_merged_deduplicated.csv
  query-B_merged_deduplicated.csv
  query-C_merged_deduplicated.csv
  query-ABC_merged_deduplicated.csv

/scripts/
  deduplicate.py
  parse_sciencedirect.py
  make_abc.py

8) Reproduction notes

Normalization helpers (Python):

import re, unicodedata

def norm_doi(s):
    if not s:
        return ""
    s = s.strip().lower()
    s = s.replace("https://doi.org/","").replace("http://doi.org/","").strip().strip(".")
    m = re.search(r"(10\.\d{4,9}/\S+)", s)
    return m.group(1).rstrip(".,;)") if m else ""

def norm_title(s):
    if not s:
        return ""
    s = unicodedata.normalize("NFKD", s).lower()
    s = re.sub(r"[^a-z0-9 ]+", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

Primary key: norm_doi(record) or norm_title(record)
Precedence for ties: Scopus > IEEE Xplore > ScienceDirect
CSV parsing of ScienceDirect exports may need a simple text parser because some exports are TXT blocks. The repository includes parse_sciencedirect.py with a DOI-anchored extractor.

9) Using the outputs

Use processed/query-ABC_merged_deduplicated.csv (618 records) for screening and synthesis.
The PRISMA-style flow in the thesis references these counts. The final box in that diagram points to the evidence map table in the thesis.

10) Citation

If you use this material, please cite the thesis and this repository. Suggested citation:

GitHub repository attribution

Stile, V. (2025). Deepfake Detection – Literature Review, Queries, and Raw Data.
GitHub repository, https://github.com/vstile/deepfake-detection-review
© 2025 Vittorio Stile - Licensed under CC BY 4.0.

or paper attribution

This material reuses data and methods from this paper:
Stile, V., Caldelli, R., Guerrero-Contreras, G., Balderas-Díaz, S., and Medina-Bulo, I. (2025). Analysis of DeepFake Detection through Semi-Supervised Facial Attribute Labeling. Proceedings of the 11th Spanish-German Symposium on Applied Computer Science (SGSOACS 2025), 2831, XX, 138, Wien, Austria. https://link.springer.com/book/9783032148155
© 2025 Vittorio Stile - Licensed under CC BY 4.0.

or Ph.D. thesis attribution

This material reuses data and methods from this Ph.D. Dissertation:
Stile, V. (2026). “AI-generated Deepfakes: Detection and Bias Analysis”. Ph.D. dissertation, Universitas Mercatorum, Roma, Italy.
© 2026 Vittorio Stile - Licensed under CC BY 4.0.

11) Privacy and reuse policy

This repository contains bibliographic metadata and exported references. No personal data are included.
Reuse is permitted provided that you cite the author and this work.
Recommended license: Creative Commons Attribution 4.0 International (CC BY 4.0). You are free to share and adapt the material for any purpose, even commercially, as long as appropriate credit is given, a link to the license is provided, and any changes are indicated.

Availability of materials. The complete search logs, per-database exports, deduplicated corpora for Sets A–C, the cross-set merged list, and the parsing and reconciliation scripts are publicly available in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
BibTeX		BibTeX
processed		processed
raw		raw
scripts		scripts
README.md		README.md
included-in-review_24.csv		included-in-review_24.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

deepfake-detection-review

Deepfake Detection – Literature Review Protocol and Data

1) Scope and sources

2) Query sets

3) Scoping counts

4) Targeted retrieval by database (fielded queries)

5) De-duplication workflow

Identity keys

Retention precedence

Results

6) Other sources

7) Repository layout

8) Reproduction notes

9) Using the outputs

10) Citation

11) Privacy and reuse policy

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

deepfake-detection-review

Deepfake Detection – Literature Review Protocol and Data

1) Scope and sources

2) Query sets

3) Scoping counts

4) Targeted retrieval by database (fielded queries)

5) De-duplication workflow

Identity keys

Retention precedence

Results

6) Other sources

7) Repository layout

8) Reproduction notes

9) Using the outputs

10) Citation

11) Privacy and reuse policy

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages