This project implements an automatic medical specialty classification system using advanced Natural Language Processing (NLP) and Artificial Intelligence techniques. The main objective is to analyze clinical histories in text format and predict their corresponding medical specialty, facilitating the classification and organization of medical records.
The system combines the power of BioBERT (a pre-trained language model on biomedical literature) with classical Machine Learning algorithms to create a robust classifier that:
- 📋 Processes complete medical transcriptions
- 🧬 Extracts semantic features from medical language
- 🎯 Predicts medical specialty with confidence levels
- 📊 Can classify among multiple specialties simultaneously
- BioBERT v1.1: Transformer model specialized in biomedical domain
- PyHealth: Framework for healthcare data analysis
- Scikit-learn: Machine learning model implementation
- PyTorch: Deep learning framework
- Transformers (Hugging Face): Library for pre-trained language models
- Pandas & NumPy: Data manipulation and analysis
Clinical History (Text)
↓
BioBERT Tokenizer → Embeddings (768 dimensions)
↓
Logistic Regression → Multi-class Classification
↓
Prediction + Probabilities per Specialty
- Dataset: Medical Transcriptions (Kaggle)
- Source:
mtsamples.csv - Content: Real medical transcriptions with labeled specialties
- Null data cleaning
- Selection of the 5 most frequent specialties
- Filtering of complete clinical histories
- Tokenization with BioBERT tokenizer
- Maximum length: 128 tokens
- Extraction of 768-dimensional vector per document
- Algorithm: Logistic Regression with
class_weight='balanced' - Split: 80% training, 20% testing
- Label encoding with
LabelEncoder
- Metrics: Precision, Recall, F1-Score
- Prediction with confidence levels per category
pip install pyhealth "pandas==1.5.3" "numpy<2.0.0"
pip install transformers torch- Clone the repository:
git clone https://github.com/tulio3101/PTIA.git
cd PTIA- Open the notebook:
jupyter notebook PTIA_PROYECTO_FINAL.ipynb- Execute cells sequentially to:
- Download the dataset automatically
- Train the model
- Make predictions
The model is capable of classifying clinical histories into the following main specialties:
- Surgery
- Orthopedic
- Cardiovascular / Pulmonary
- Radiology
- Gastroenterology
# Sample medical note
note = """CHIEF COMPLAINT: Shortness of breath and palpitations.
HISTORY: Patient with coronary artery disease.
IMPRESSION: Atrial Fibrillation, Heart Failure."""
# System predicts: Cardiovascular / Pulmonary (with % confidence)Institution: Escuela Colombiana de Ingeniería Julio Garavito
Course: Principles and Technologies of Artificial Intelligence
Period: 2025-2
Type: Final Project
✅ Develop an AI model for disease prediction
✅ Apply preprocessing and data cleaning techniques
✅ Implement machine learning with clinical data
✅ Evaluate and validate the model with standard metrics
✅ Reinforce theoretical and practical concepts from the course
- @tulio3101 - Main development
- @sebasPuentes - Project collaborator
This project is licensed under the MIT License - see the LICENSE file for details.
Note: This project is for academic and research purposes. It should not be used for actual medical diagnoses without proper professional supervision.