Conciliation

Article reproducibility – Classification of the Conciliation Profile in Initial Petitions in the Brazilian Judiciary

CLASSIFYING-CONCILIATION-PROFILE

This repository organizes a pipeline to classify the conciliation profile (e.g., cases from Anonymous), combining:

Preprocessing and feature generation (regex / tabular attributes / LLM-based features)
Embeddings (e.g., Gemini / SBERT) with partial checkpointing
Training and evaluation using classical models (XGBoost / RandomForest / LightGBM / CatBoost)
A production test version (app + Docker) to serve the trained model

⚠️ Important (LGPD / sensitive data):
This repository is structured to avoid versioning real data (CSV/JSON), embeddings, and heavy/derived artifacts (PKL/NPY/checkpoints).
The documentation below clearly states what remains local versus what is pushed to GitHub.

Repository structure (local view)

Below is a view of the local environment, including data and artifacts.
It is provided for organizational reference only.

Structure (summary)

CLASSIFYING-CONCILIATION-PROFILE/
├─ Abordagem-com-LLM/
│  ├─ Dados/
│  │  └─ dados_processos_Anonymous_14052025.json
│  ├─ Conciliacao.ipynb
│  ├─ dados_processos_completo.csv
│  └─ embeddings_Anonymous.npy
├─ Analise-Estatistica-do-Corpus/
│  └─ Statistical-Corpus-Analysis.ipynb
├─ Conciliacao/
│  ├─ dados/
│  │  ├─ Anonymous_03072025/
│  │  │  ├─ dados_processos_Anonymous_03072025_V1.csv
│  │  │  ├─ dados_processos_Anonymous_03072025_V2.csv
│  │  │  ├─ dados_processos_Anonymous_03072025_V3.csv
│  │  │  ├─ dados_processos_Anonymous_03072025.json
│  │  │  ├─ dados_processos_Anonymous_03072025_V1_enriched.json
│  │  │  └─ artificial_token_frequency.txt
│  │  └─ Anonymous_14052025/
│  │     ├─ dados_processos_Anonymous_14052025_V1.csv
│  │     ├─ dados_processos_Anonymous_14052025_V2.csv
│  │     ├─ dados_processos_Anonymous_14052025_V3.csv
│  │     ├─ dados_processos_Anonymous_14052025.json
│  │     ├─ dados_processos_Anonymous_14052025_V1_enriched.json
│  │     └─ artificial_token_frequency.txt
│  └─ Embeddings/
│     ├─ partial_checkpoint.npy
│     ├─ checkpoint_partial_03072025.npy
│     ├─ dados_processos_Anonymous_14052025_rejected.json
│     ├─ dados_processos_Anonymous_14052025_V1_enriched_Embeddings.json
│     ├─ dados_processos_Anonymous_14052025_V1_enriched_Embeddings.npy
│     └─ ...
├─ Scripts/
│  ├─ gerar_embeddings_gemini.py
│  ├─ gerar_features_gemini.py
│  ├─ gerar_features_sentence_transformers.py
│  ├─ classificar_com_gemini.py
│  ├─ partial_checkpoint.json
│  └─ br-tjgo-cld-02-09b1b22e65b3.json
├─ Notebooks/
│  ├─ XGBoost/
│  │  ├─ XGBoost.ipynb
│  │  └─ xgb_classification_model.pkl
│  ├─ Random-Forest/
│  │  ├─ Random-Forest.ipynb
│  │  └─ rf_classification_model.pkl
│  ├─ LightGBM/
│  │  ├─ LightGBM.ipynb
│  │  └─ lightgbm_model_v1.pkl
│  ├─ CatBoost/
│  │  ├─ CatBoost.ipynb
│  │  ├─ CatBoost.pkl
│  │  └─ catboost_info/ (logs/artifacts)
│  ├─ pre-processing.ipynb
│  └─ regex_analysis.ipynb
└─ XGBoost-Production-Test/
   ├─ App/
   │  ├─ app.py
   │  ├─ xgboost_model.pkl
   │  ├─ randomforest_model.pkl
   │  ├─ templates/index.html
   │  └─ static/Logo.png
   ├─ Dados/
   │  ├─ Embeddings/text_vectors.npy
   │  ├─ cleaned_Anonymous_processes.json
   │  ├─ Anonymous_processes_features.json
   │  └─ dados_processos_Anonymous_14052025.csv
   ├─ Notebooks/
   │  ├─ pre-processing-data.ipynb
   │  └─ xgboost.ipynb
   ├─ Dockerfile
   ├─ docker-compose.yml
   └─ requirements.txt

What should NOT be pushed to GitHub (and why)

1) Raw and derived data (LGPD / volume / reproducibility)

❌ Do not version:

**/Dados/**/*.csv
**/Dados/**/*.json
**/Conciliacao/dados/**
**/artificial_token_frequency.txt (if derived from real data)

✅ Recommended approach:

Commit only the folder structure (empty folders with .gitkeep) and/or an anonymized sample (e.g., sample_data.csv).

2) Embeddings and vectors (heavy + derived)

❌ Do not version:

**/*.npy
**/*Embeddings*.json
**/*Embeddings*.npy
**/Embeddings/**

Reason: large files derived from data that do not add traceability in Git.

3) Checkpoints and partial states

❌ Do not version:

partial_checkpoint.json
partial_checkpoint.npy
checkpoint_partial_*.npy
any checkpoint files created for execution resumption

Reason: temporary, machine- and execution-specific artifacts.

4) Trained models (PKL / binaries)

❌ Recommended not to version:

**/*.pkl
**/*.joblib

Alternatives:

publish models via a Model Registry or GitHub Releases
version only the training pipeline and instructions to reproduce

5) Training logs / execution artifacts

❌ Do not version:

catboost_info/
events.out.tfevents*
tmp/
training-generated *.tsv
.ipynb_checkpoints/
__pycache__/, *.pyc

Folders that SHOULD exist (even if empty)

To facilitate reproducibility, keep the following folders in the repository (use .gitkeep):

Conciliacao/dados/
Conciliacao/Embeddings/
XGBoost-Production-Test/Dados/
XGBoost-Production-Test/Dados/Embeddings/
docs/   (optional, for figures/screenshots)

Suggested `.gitignore`

# =========================
# DATA (LGPD / do not version)
# =========================
**/Dados/**/*.csv
**/Dados/**/*.json
**/Conciliacao/dados/**
**/dados/**/*.csv
**/dados/**/*.json

# =========================
# EMBEDDINGS / VECTORS
# =========================
**/*.npy
**/Embeddings/**
**/*Embeddings*.json
**/*Embeddings*.npy

# =========================
# CHECKPOINTS
# =========================
**/checkpoint*.json
**/checkpoint*.npy

# =========================
# TRAINED MODELS
# =========================
**/*.pkl
**/*.joblib

# =========================
# LOGS / TRAINING ARTIFACTS
# =========================
**/catboost_info/
**/events.out.tfevents*
**/tmp/
**/*.tsv

# =========================
# JUPYTER / PYTHON
# =========================
.ipynb_checkpoints/
__pycache__/
*.pyc
.venv/
.env

How to reproduce

Place the data locally in Dados/ or Conciliacao/dados/Anonymous_*/.
Run scripts in Scripts/ to:
- generate features (gerar_features_*.py)
- generate embeddings (gerar_embeddings_*.py)
Use notebooks in Notebooks/ to train and evaluate models (XGBoost / RF / LGBM / CatBoost).
To serve the model, use XGBoost-Production-Test/ (Docker + app).

Dataset

Dataset used in this project:
https://huggingface.co/datasets/Anonymous/

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Abordagem-com-LLM		Abordagem-com-LLM
Análise-Estatística-do-Corpus		Análise-Estatística-do-Corpus
Conciliação/Notebooks		Conciliação/Notebooks
Scripts		Scripts
XGBoost-Producao-Teste		XGBoost-Producao-Teste
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Conciliation

CLASSIFYING-CONCILIATION-PROFILE

Repository structure (local view)

Structure (summary)

What should NOT be pushed to GitHub (and why)

1) Raw and derived data (LGPD / volume / reproducibility)

2) Embeddings and vectors (heavy + derived)

3) Checkpoints and partial states

4) Trained models (PKL / binaries)

5) Training logs / execution artifacts

Folders that SHOULD exist (even if empty)

Suggested `.gitignore`

How to reproduce

Dataset

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Conciliation

CLASSIFYING-CONCILIATION-PROFILE

Repository structure (local view)

Structure (summary)

What should NOT be pushed to GitHub (and why)

1) Raw and derived data (LGPD / volume / reproducibility)

2) Embeddings and vectors (heavy + derived)

3) Checkpoints and partial states

4) Trained models (PKL / binaries)

5) Training logs / execution artifacts

Folders that SHOULD exist (even if empty)

Suggested .gitignore

How to reproduce

Dataset

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Suggested `.gitignore`

Packages