Predicting TCS stock price movements using classical ML algorithms โ regression, classification, and sentiment analysis combined.
- Overview
- Key Results
- Project Structure
- Dataset
- Models Implemented
- Sentiment Analysis Pipeline
- Feature Engineering
- Getting Started
- Model Comparison
- Key Findings
The stock market is highly volatile and nonlinear โ classical ML models struggle, but with the right feature engineering they can still extract meaningful signals.
This project applies 6 different ML algorithms (3 regression + 3 classification) to TCS (Tata Consultancy Services) historical stock data, enriched with news sentiment scores from a fine-tuned FinBERT model. Each script is self-contained: load data โ preprocess โ train โ evaluate.
- Predict the next day's opening price (regression)
- Predict whether price will go UP or DOWN the next day (classification)
- Measure the impact of lag features and sentiment on prediction quality
- Compare all models side-by-side with a visualizer script
| Model | Rยฒ Score | RMSE |
|---|---|---|
| Linear Regression | 0.9848 | 49.81 โ Best |
| Random Forest Regressor | 0.6836 | 227.71 |
| KNN Regressor | 0.3824 | 318.15 |
| Model | Accuracy | Notes |
|---|---|---|
| Random Forest | 75.68% โ Best | Highest overall |
| Logistic Regression | 75.09% | Best precision/recall balance |
| KNN Classifier | 70.99% | Improved significantly with lag features |
Baseline: ~50% (coin flip). All models beat the baseline, with Random Forest and Logistic Regression reaching ~75%.
Stock-Market-ML/
โ
โโโ ๐ Data
โ โโโ Stocks_TCS2.csv # Main dataset: OHLCV + Sentiment scores
โ
โโโ ๐ค Models
โ โโโ StockMarket_LinearRegression.py # LR for price prediction
โ โโโ StockMarket_LogisticRegression.py # LogReg for direction (+ ROC curve, confusion matrix)
โ โโโ StockMarket_KNN.py # KNN classifier
โ โโโ StockMarket_KNNReggresor.py # KNN regressor
โ โโโ StockMarket_RandomForestClassifier.py # RFC with decision tree visualization
โ โโโ StockMarket_RandomForestRegressor.py # RFR
โ
โโโ ๐ง Data Pipeline
โ โโโ textb.py # Downloads TCS stock data via yfinance
โ โโโ merge.py # Merges stock data with FinBERT sentiment scores
โ โโโ remover.py # Cleans rows with zero sentiment values
โ
โโโ ๐ Analysis
โ โโโ comparizer.py # Side-by-side bar charts for all model results
โ
โโโ requirements.txt
โโโ README.md
File: Stocks_TCS2.csv
| Column | Description |
|---|---|
Close |
Closing price |
High |
Daily high price |
Low |
Daily low price |
Open |
Opening price |
Volume |
Trading volume |
Sentiment |
FinBERT sentiment score from TCS news headlines (-1 to +1) |
- Source: TCS.NS historical data via
yfinance(2015โ2025) - News Source: CF-AN equities TCS headlines dataset
- Sentiment Model: ProsusAI/FinBERT โ a BERT model fine-tuned on financial text
Positive headline โ +score
Negative headline โ -score
Neutral headline โ 0
Daily sentiment = mean of all headline scores for that trading day
All regression models predict the next day's opening price using Open.shift(-1) as the target.
Target: Next day's Open price
Features: OHLCV columns
Result: Rยฒ = 0.9848, RMSE = 49.81
Note: Lag features slightly hurt LR performance โ dropped for this model
Target: Next day's Open price
Features: OHLCV + lag features
n_neighbors: 10
Scaler: StandardScaler (required for KNN)
Result: Rยฒ = 0.3824, RMSE = 318.15
Target: Next day's Open price
Features: OHLCV + lag features
n_estimators: 100
Result: Rยฒ = 0.6836, RMSE = 227.71
All classification models predict 1 (price UP) or 0 (price DOWN) for the next day.
Target = (Open.shift(-1) > Open).astype(int)n_neighbors: 10
Scaler: StandardScaler
Accuracy: ~70.99%
Impact of lag features: +20% accuracy improvement (from ~50% to ~70%)
max_iter: 1200
class_weight: balanced
Outputs: Confusion Matrix + ROC Curve
Accuracy: ~75.09%
Note: Lag features had minimal impact on LogReg
n_estimators: 200
class_weight: balanced
Outputs: Decision tree visualization
Accuracy: ~75.68%
Impact of lag features: Huge โ from ~55% to ~70%+
The sentiment pipeline uses FinBERT to score TCS news headlines and append daily sentiment to the stock dataset.
textb.py โ Downloads TCS.NS price data from yfinance โ saves TCS_STOCKS!2.csv
โ
merge.py โ Loads headlines CSV + stock CSV
โ Filters stocks to only dates with news coverage
โ Runs FinBERT on each headline
โ Averages daily sentiment scores
โ Saves Stocks_TCS2_date.csv
โ
remover.py โ Removes rows where sentiment = 0 (no news day)
โ Saves Stocks_TCS4.csv
โ ๏ธ GPU recommended โmerge.pyruns FinBERT withdevice=0(CUDA). Switch todevice=-1for CPU.
Lag features are the most impactful preprocessing step in this project:
stocks['lag_1'] = stocks['Open'].shift(1) # Yesterday's open
stocks['lag_2'] = stocks['Open'].shift(2) # Two days ago open
stocks['rolling_3'] = stocks['Open'].rolling(3).mean() # 3-day moving average
stocks['per%_change'] = stocks['Close'].pct_change() # Daily % change| Model | Without Lag Features | With Lag Features | ฮ |
|---|---|---|---|
| KNN Classifier | ~50% | ~70% | +20% |
| Random Forest Classifier | ~55% | ~75% | +20% |
| Logistic Regression | ~75% | ~75% | ~0% |
| Linear Regression | 0.985 Rยฒ | 0.985 Rยฒ | Slightly worse |
| KNN Regressor | baseline | +0.01 Rยฒ, -800 RMSE | Slight improvement |
| Random Forest Regressor | baseline | +0.003 Rยฒ, -1000 RMSE | Marginal improvement |
Python >= 3.11
CUDA GPU (optional, for faster FinBERT inference)git clone https://github.com/your-username/Stock-Market-ML.git
cd Stock-Market-ML
pip install -r requirements.txtpandas
numpy
scikit-learn
matplotlib
transformers
yfinance
statistics
torch # for FinBERT GPU inference
Note: The
requirements.txtlistsTransformerโ install astransformers(lowercase) from Hugging Face.
# Run any individual model
python StockMarket_LinearRegression.py
python StockMarket_RandomForestClassifier.py
python StockMarket_LogisticRegression.py
# Compare all results visually
python comparizer.py
# Re-build the dataset from scratch (requires news headlines CSV)
python textb.py # Download stock data
python merge.py # Merge with sentiment
python remover.py # Clean zerosAll model scripts expect
Stocks_TCS2.csvto be present. The pre-built dataset is already included in the repo.
Run comparizer.py to generate side-by-side bar charts for both regression (RMSE) and classification (accuracy):
Regression:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ KNN โ RF โ Lin Reg โ
โ 318.15 โ 227.7 โ 49.81 โ Best โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Classification:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ KNN โ Log Reg โ Random Forest โ
โ 70.98% โ 75.08% โ 75.68% โ Best โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1. Linear Regression dominates regression tasks With an Rยฒ of 0.985, linear regression is surprisingly the best price predictor. Stock prices on adjacent days are highly correlated โ which is exactly what LR captures well. Lag features actually hurt it slightly, suggesting multicollinearity.
2. Lag features are critical for tree-based and KNN classifiers Adding just two lag values and a rolling mean pushed KNN and Random Forest from ~50โ55% (barely better than random) to ~70โ75%. This is the single biggest lever in the project.
3. FinBERT sentiment adds signal Incorporating news sentiment into the feature set provides additional predictive information beyond price history alone, particularly on high-news days.
4. SVM underperformed The SVM (with RBF kernel, C=100, gamma=0.1) performed worse than random on this dataset โ noted in the script itself as "worse than deciding on coin toss." Kernel and hyperparameter tuning would be needed for improvement.
5. The 75% ceiling Both Logistic Regression and Random Forest plateau around 75% accuracy for direction prediction. Breaching this ceiling likely requires deeper features โ options data, order book depth, or LSTM-based sequence modeling.
- Add more technical indicators (RSI, MACD, Bollinger Bands)
- Try LSTM or Transformer-based sequence models
- Expand sentiment to include social media (Twitter/Reddit)
- Hyperparameter tuning via GridSearchCV
- Walk-forward validation instead of a single train/test split