Article reproducibility – Classification of the Conciliation Profile in Initial Petitions in the Brazilian Judiciary
This repository organizes a pipeline to classify the conciliation profile (e.g., cases from Anonymous), combining:
- Preprocessing and feature generation (regex / tabular attributes / LLM-based features)
- Embeddings (e.g., Gemini / SBERT) with partial checkpointing
- Training and evaluation using classical models (XGBoost / RandomForest / LightGBM / CatBoost)
- A production test version (app + Docker) to serve the trained model
⚠️ Important (LGPD / sensitive data):
This repository is structured to avoid versioning real data (CSV/JSON), embeddings, and heavy/derived artifacts (PKL/NPY/checkpoints).
The documentation below clearly states what remains local versus what is pushed to GitHub.
Below is a view of the local environment, including data and artifacts.
It is provided for organizational reference only.
CLASSIFYING-CONCILIATION-PROFILE/
├─ Abordagem-com-LLM/
│ ├─ Dados/
│ │ └─ dados_processos_Anonymous_14052025.json
│ ├─ Conciliacao.ipynb
│ ├─ dados_processos_completo.csv
│ └─ embeddings_Anonymous.npy
├─ Analise-Estatistica-do-Corpus/
│ └─ Statistical-Corpus-Analysis.ipynb
├─ Conciliacao/
│ ├─ dados/
│ │ ├─ Anonymous_03072025/
│ │ │ ├─ dados_processos_Anonymous_03072025_V1.csv
│ │ │ ├─ dados_processos_Anonymous_03072025_V2.csv
│ │ │ ├─ dados_processos_Anonymous_03072025_V3.csv
│ │ │ ├─ dados_processos_Anonymous_03072025.json
│ │ │ ├─ dados_processos_Anonymous_03072025_V1_enriched.json
│ │ │ └─ artificial_token_frequency.txt
│ │ └─ Anonymous_14052025/
│ │ ├─ dados_processos_Anonymous_14052025_V1.csv
│ │ ├─ dados_processos_Anonymous_14052025_V2.csv
│ │ ├─ dados_processos_Anonymous_14052025_V3.csv
│ │ ├─ dados_processos_Anonymous_14052025.json
│ │ ├─ dados_processos_Anonymous_14052025_V1_enriched.json
│ │ └─ artificial_token_frequency.txt
│ └─ Embeddings/
│ ├─ partial_checkpoint.npy
│ ├─ checkpoint_partial_03072025.npy
│ ├─ dados_processos_Anonymous_14052025_rejected.json
│ ├─ dados_processos_Anonymous_14052025_V1_enriched_Embeddings.json
│ ├─ dados_processos_Anonymous_14052025_V1_enriched_Embeddings.npy
│ └─ ...
├─ Scripts/
│ ├─ gerar_embeddings_gemini.py
│ ├─ gerar_features_gemini.py
│ ├─ gerar_features_sentence_transformers.py
│ ├─ classificar_com_gemini.py
│ ├─ partial_checkpoint.json
│ └─ br-tjgo-cld-02-09b1b22e65b3.json
├─ Notebooks/
│ ├─ XGBoost/
│ │ ├─ XGBoost.ipynb
│ │ └─ xgb_classification_model.pkl
│ ├─ Random-Forest/
│ │ ├─ Random-Forest.ipynb
│ │ └─ rf_classification_model.pkl
│ ├─ LightGBM/
│ │ ├─ LightGBM.ipynb
│ │ └─ lightgbm_model_v1.pkl
│ ├─ CatBoost/
│ │ ├─ CatBoost.ipynb
│ │ ├─ CatBoost.pkl
│ │ └─ catboost_info/ (logs/artifacts)
│ ├─ pre-processing.ipynb
│ └─ regex_analysis.ipynb
└─ XGBoost-Production-Test/
├─ App/
│ ├─ app.py
│ ├─ xgboost_model.pkl
│ ├─ randomforest_model.pkl
│ ├─ templates/index.html
│ └─ static/Logo.png
├─ Dados/
│ ├─ Embeddings/text_vectors.npy
│ ├─ cleaned_Anonymous_processes.json
│ ├─ Anonymous_processes_features.json
│ └─ dados_processos_Anonymous_14052025.csv
├─ Notebooks/
│ ├─ pre-processing-data.ipynb
│ └─ xgboost.ipynb
├─ Dockerfile
├─ docker-compose.yml
└─ requirements.txt
❌ Do not version:
**/Dados/**/*.csv**/Dados/**/*.json**/Conciliacao/dados/****/artificial_token_frequency.txt(if derived from real data)
✅ Recommended approach:
- Commit only the folder structure (empty folders with
.gitkeep) and/or an anonymized sample (e.g.,sample_data.csv).
❌ Do not version:
**/*.npy**/*Embeddings*.json**/*Embeddings*.npy**/Embeddings/**
Reason: large files derived from data that do not add traceability in Git.
❌ Do not version:
partial_checkpoint.jsonpartial_checkpoint.npycheckpoint_partial_*.npy- any checkpoint files created for execution resumption
Reason: temporary, machine- and execution-specific artifacts.
❌ Recommended not to version:
**/*.pkl**/*.joblib
Alternatives:
- publish models via a Model Registry or GitHub Releases
- version only the training pipeline and instructions to reproduce
❌ Do not version:
catboost_info/events.out.tfevents*tmp/- training-generated
*.tsv .ipynb_checkpoints/__pycache__/,*.pyc
To facilitate reproducibility, keep the following folders in the repository (use .gitkeep):
Conciliacao/dados/
Conciliacao/Embeddings/
XGBoost-Production-Test/Dados/
XGBoost-Production-Test/Dados/Embeddings/
docs/ (optional, for figures/screenshots)
# =========================
# DATA (LGPD / do not version)
# =========================
**/Dados/**/*.csv
**/Dados/**/*.json
**/Conciliacao/dados/**
**/dados/**/*.csv
**/dados/**/*.json
# =========================
# EMBEDDINGS / VECTORS
# =========================
**/*.npy
**/Embeddings/**
**/*Embeddings*.json
**/*Embeddings*.npy
# =========================
# CHECKPOINTS
# =========================
**/checkpoint*.json
**/checkpoint*.npy
# =========================
# TRAINED MODELS
# =========================
**/*.pkl
**/*.joblib
# =========================
# LOGS / TRAINING ARTIFACTS
# =========================
**/catboost_info/
**/events.out.tfevents*
**/tmp/
**/*.tsv
# =========================
# JUPYTER / PYTHON
# =========================
.ipynb_checkpoints/
__pycache__/
*.pyc
.venv/
.env- Place the data locally in
Dados/orConciliacao/dados/Anonymous_*/. - Run scripts in
Scripts/to:- generate features (
gerar_features_*.py) - generate embeddings (
gerar_embeddings_*.py)
- generate features (
- Use notebooks in
Notebooks/to train and evaluate models (XGBoost / RF / LGBM / CatBoost). - To serve the model, use
XGBoost-Production-Test/(Docker + app).
Dataset used in this project:
https://huggingface.co/datasets/Anonymous/