A machine learning system to predict whether a LinkedIn post will result in positive or negative PR using Ridge, Logistic Regression and XGboost.
This project analyzes LinkedIn posts to classify them as generating positive or negative public relations outcomes, using a combination of:
- Gemini AI embeddings for semantic text understanding
- XGBoost classifier for robust prediction
- Engagement metrics & sentiment analysis for label generation
- Posts: ~1200 LinkedIn company posts per company (primarily from 6 different techn companies such as Google, Netflix, Micorsoft and etc.)
- Comments: ~6000 comments with engagement data
- Features: Text content, engagement metrics, media type, temporal patterns
Note: the sections below describe the original research/training workflow in
attempt2.ipynb(Gemini embeddings + XGBoost). The deployed runtime backend is nowservices/ml_apiโ a provider-free TF-IDF recruiting-signal engine that needs no AI provider key. Seedocs/API_README.mdfor its endpoints (/health,/analyze,/analyze/compare,/history) andRENDER_DEPLOY.mdfor deployment. The Next.js app reaches it via the same-origin proxyapp/api/analyze(ML_API_URL).
LinkedIn Posts โ Label Generation โ Feature Engineering โ Model Training โ Prediction
โ (VADER + Engagement) โ โ
Comments Gemini Embeddings XGBoost
+
Metadata Features
- Loaded posts and comments datasets
- Merged posts with comment sentiment
- Explored engagement patterns
- Positive PR: High engagement + positive reactions + positive sentiment
- Negative PR: Low engagement OR negative sentiment OR poor reaction ratio
- Uses VADER sentiment analysis on comments
Text Features via Gemini:
- 768-dimensional embeddings capturing semantic meaning
Metadata Features:
- Text characteristics: length, emojis, URLs, hashtags, mentions
- Temporal: posting hour, day of week, month
- Media: type (image/article/none), count
- Engagement: comment sentiment scores
- Author: follower count
- XGBoost binary classifier
- Regression
- 80/20 train-test split
- Feature scaling with StandardScaler
- Class weighting for imbalanced data
- Classification metrics (accuracy, precision, recall, F1)
- Confusion matrix visualization
- Feature importance analysis
- Sample predictions with confidence scores
lyra_hackathon/
โโโ attempt2.ipynb # Main notebook with full implementation
โโโ data/ # LinkedIn posts and comments datasets
โโโ pr_classifier_model.pkl # Trained XGBoost model
โโโ feature_scaler.pkl # Feature scaler for preprocessing
โโโ post_embeddings.npy # Cached Gemini embeddings
โโโ post_type_encoder.pkl # Categorical encoder for post types
โโโ media_type_encoder.pkl # Categorical encoder for media types
โโโ README.md # This file
-
Set your Gemini API key:
export GEMINI_API_KEY="your-api-key-here"
-
Run the notebook:
jupyter notebook attempt2.ipynb
import joblib
import numpy as np
import google.generativeai as genai
# Load model and preprocessors
model = joblib.load('pr_classifier_model.pkl')
scaler = joblib.load('feature_scaler.pkl')
# Generate embedding for new post
new_post_text = "Your LinkedIn post text here..."
embedding = get_gemini_embedding(new_post_text)
# Extract metadata features (text_length, emoji_count, etc.)
metadata = extract_metadata_features(new_post_text)
# Combine and predict
features = np.concatenate([embedding, metadata])
features_scaled = scaler.transform([features])
prediction = model.predict(features_scaled)
confidence = model.predict_proba(features_scaled)
print(f"PR Prediction: {'Positive' if prediction[0] == 1 else 'Negative'}")
print(f"Confidence: {confidence[0][prediction[0]]:.2%}")The model achieves:
- Binary classification of PR sentiment
- Feature importance insights showing which factors drive positive/negative PR
- Combines deep learning (embeddings) with traditional ML (XGBoost)
Key predictive factors typically include:
- Comment sentiment scores
- Engagement metrics (reactions, comments, reposts)
- Text characteristics (length, emojis, URLs)
- Temporal patterns (posting time)
- Media presence and type
google-generativeai
xgboost
pandas
numpy
scikit-learn
vaderSentiment
matplotlib
seaborn
- Text embeddings are powerful: Gemini embeddings capture semantic nuances in post content
- Engagement patterns matter: Low engagement often correlates with negative PR
- Comment sentiment is predictive: Negative comments are strong indicators of PR issues
- Media enhances engagement: Posts with images/videos tend to perform better
- Combined approach works: Text semantics + metadata features yield robust predictions
- Pre-posting analysis: Predict PR impact before publishing
- Content optimization: Identify what makes posts resonate positively
- Crisis detection: Flag posts likely to generate negative PR
- Strategy refinement: Understand drivers of positive engagement
- Labels are generated automatically from engagement and sentiment (not manually labeled)
- Gemini API key required for embedding generation
- Model can be retrained on domain-specific data for better performance
- Placeholder embeddings used if API key not set (for demonstration)
- Incorporate image/video content analysis
- Add time-series modeling for trend prediction
- Include competitor post analysis
- Real-time monitoring dashboard
- Multi-class classification (positive/neutral/negative/crisis)
Created for Lyra Hackathon | December 2025
- Create a Python venv and install ML API deps:
python -m venv .venv # Windows: .venv\Scripts\activate source .venv/bin/activate pip install -r services/ml_api/requirements.txt
- Install Node deps:
npm install
- Start both FastAPI + Next.js:
Or run separately:
npm run dev
npm run dev:ml # FastAPI at http://localhost:8000 npm run dev:web # Next.js at http://localhost:3000
Environment variables (.env.local) needed for the new pipeline:
ML_API_URL=http://localhost:8000
NEXT_PUBLIC_SUPABASE_URL=YOUR_SUPABASE_URL
NEXT_PUBLIC_SUPABASE_ANON_KEY=YOUR_SUPABASE_ANON_KEY
SUPABASE_SERVICE_ROLE_KEY=YOUR_SUPABASE_SERVICE_ROLE_KEY
Supabase schema for logging requests/responses: docs/supabase.sql (table analyses).
The Next.js AI layer uses the Vercel AI SDK (ai + @ai-sdk/openai +
@ai-sdk/google + zod), so the model provider is swappable. lib/ai/provider.ts
exposes getModel(), which resolves a model from the AI_MODEL env var
(default openai/gpt-4o-mini). The persona-critique / variant-eval client at
lib/google-ai/client.ts (legacy dir name, now provider-agnostic) uses this
resolver with generateObject + zod.
How a model is resolved:
- If
AI_GATEWAY_API_KEYis set (or, on Vercel, OIDC enables the Gateway), theprovider/modelstring is routed through the Vercel AI Gateway, which adds failover and cost tracking. - Otherwise it falls back to a direct provider key. For the default
openai/...model the key is read fromOPENAI_API_KEY(via@ai-sdk/openai). Agoogle/...model still works via@ai-sdk/google, readingGOOGLE_GENERATIVE_AI_API_KEY(preferred), withGEMINI_API_KEYas a fallback.
Switching providers/models is a one-line change: set AI_MODEL and supply the
matching provider key (or use the Gateway).
AI_MODEL=openai/gpt-4o-mini # default
OPENAI_API_KEY=... # direct OpenAI key (used by the default model)
AI_GATEWAY_API_KEY=... # optional: route any provider/model via Vercel AI Gateway
GOOGLE_GENERATIVE_AI_API_KEY=... # only if you switch AI_MODEL to google/... (GEMINI_API_KEY is a fallback)
Two independent limiters protect the public surface:
Next.js inbound limiter (lib/ratelimit.ts) โ an in-memory, per-client-IP
limiter applied to the public POST routes, including /api/analyze,
/api/gemini, /api/analyze-with-images, /api/ab-tests, /api/personas, and
/api/drafts. Over-limit requests get a 429 with a Retry-After header.
Configure with:
RATE_LIMIT_MAX=30 # max requests per client per window
RATE_LIMIT_WINDOW_MS=60000 # window length (ms)
Caveat: the in-memory store does not span serverless instances. For production scale, back it with Upstash (shared Redis).
There is also a separate outbound throttle on calls to the AI provider,
configured with GEMINI_RATE_LIMIT_MAX_REQUESTS (default 15) and
GEMINI_RATE_LIMIT_WINDOW_MS (default 60000).
The services/ml_api backend has no built-in limiter of its own; it is reached
only through the Next.js proxy (app/api/analyze), so the inbound limiter above
covers it. See RENDER_DEPLOY.md for deployment details.