Skip to content

yeaight7/TFG_CYBER_AI

Repository files navigation

TFG_CYBER_AI

RL-based cybersecurity defender for binary PERMIT / BLOCK decisions on network flows.

The project is organised in two phases:

  • Phase 1: offline training and validation on historical datasets.
  • Phase 2: offline inference on flow features extracted from traffic captured in a private lab.

The current repository uses CICIDS2017 as the main dataset, a fixed canonical schema of 76 flow features, and a 152-dimensional observation vector once the missingness mask is appended.

Current Status

Area Status
Canonical schema Implemented and frozen at 76 features
CICIDS2017 adapter Implemented
NSL-KDD adapter Implemented for historical Phase 1 benchmarking
RL algorithm QRDQN
Validation suite Checks A, B, C + leave-one-exact-CSV-out script
Phase 2 inference Robust offline pipeline available (predict_real_traffic_v2.py)
Active blocking Not implemented

Documentation Map

Repository Structure

TFG_CYBER_AI/
├── .codex/                    # hooks.json for triggering knowledge graph updates
├── .github/                   # Agent guidance and coding/review agent instructions
├── datasets/                  # Local datasets (also tracked via git lfs)
├── docs/                      # Documentation, results, Phase 2 guides, defense material
|   └── Personal Research/     # Personal stuff to guide and track for personal use
├── experiments/               # Experiment archive notes: historical and maintained timelines
├── lab/                       # Lab-related assets
├── models/                    # Trained model files (tracked)
├── pcaps/                     # Extracted flows and captures used for Phase 2 work (tracked)
├── report/                    # Thesis report and sources
├── runs/                      # Run artifacts: config.json, metrics.json, validation_results.json, etc. (tracked)
├── scripts/                   # Phase 2 and utility scripts
└── src/                       # Training, validation, adapters, environment, utilities

Core Technical Invariants

  • FEATURES_CANON contains 76 flow-based features.
  • The observation vector is always 152 dimensions:
    • 76 canonical feature values
    • 76 missingness-mask values
  • The missingness mask uses:
    • 1 for present/valid features
    • 0 for imputed or unavailable features
  • Labels are binary:
    • 0 = BENIGN
    • 1 = ATTACK
  • Leakage-prone fields must not enter the model:
    • IP addresses
    • absolute timestamps
    • Flow IDs or unique identifiers
    • ports used directly as label proxies

Dataset Versions (CICIDS2017)

Two versions of the CICIDS2017 data exist locally:

Version Path Tracked Description
Curated datasets/CICIDS2017/*.csv Yes Leakage-prone and redundant columns removed pre-ingestion. What the adapter loads.
Raw datasets/CICIDS2017/Raw_dataset/ No (gitignored) Original CICFlowMeter CSV exports. All columns preserved. Local reference only.

The adapter (src/load_cicids2017.py) applies further cleaning at load time regardless of which version is used. The anti-leakage policy in code is the authoritative gate.

Quickstart

Install dependencies:

pip install -r requirements.txt

Train the RL model on CICIDS2017:

python src/train_rl_defender.py --smoke
python src/train_rl_defender.py --preset full
python src/train_rl_defender.py --split-mode day

Run the validation suite:

python src/validate_checks.py --model models/<MODEL>.zip --checks A B C

Run leave-one-exact-CSV-out validation:

python src/validate_leave_one_csv_out.py --timesteps 30000
python src/validate_leave_one_csv_out.py --timesteps 5000 --max-rows-per-csv 10000

Run robust Phase 2 offline inference:

python scripts/predict_real_traffic_v2.py \
  --flows pcaps/flows.csv \
  --model models/C03_qrdqn_cicids2017_canonical_full_random_20260223_232439.zip \
  --scaler runs/cicids2017/C03_qrdqn_cicids2017_canonical_full_random_20260223_232439/scaler.joblib \
  --percentiles runs/cicids2017/C03_qrdqn_cicids2017_canonical_full_random_20260223_232439/train_percentiles.npz \
  --clip-z 10.0 \
  --export-diagnostics

Validation Overview

The repository currently includes four validation workflows:

Validation Purpose
Check A Direct prediction on X_test vs y_test without relying on the environment
Check B Shuffled-label anti-leakage test
Check C Hard CSV/day split generalisation test
Leave-one-exact-CSV-out One held-out CICIDS2017 CSV per fold, train on the remaining seven

The leave-one-exact-CSV-out workflow is implemented in code, but this repository does not currently contain a committed full run artifact for it under runs/validation/.

Results Snapshot

Artifact-backed historical results are summarised in docs/results.md. Highlights:

  • Best committed CICIDS2017 run:
    • C03_qrdqn_cicids2017_canonical_full_random_20260223_232439
    • accuracy 0.99859
    • attack recall 0.99945
    • attack F1 0.99876
  • Validation Check C historical artifact:
    • accuracy 0.84135
    • train on Monday–Wednesday patterns
    • test on Thursday–Friday patterns
  • Phase 2:
    • robust offline inference pipeline exists
    • latest committed benign-only v2 artifact shows that behaviour changed over time, so Phase 2 claims must always be tied to the exact run artifact

The longer experiment-by-experiment narrative now lives in experiments/cicids2017_qrdqn_experiments.md for CICIDS2017 and experiments/nslkdd_experiments.md for the older NSL-KDD branch.

Notes for Submission and Defense

  • English is the default language for repository documentation.
  • The two defense-support documents remain in Spanish by design:
  • Historical results are preserved, but they must not be confused with the current code defaults.

Safety and Reproducibility

  • Every training or evaluation workflow should persist a RUN_ID and write artifacts under runs/<category>/<RUN_ID>/.
  • If documentation describes a result, it should reference an artifact that exists in runs/ or be clearly marked as planned or historical.

Releases

No releases published

Packages

 
 
 

Contributors