Inference Engine

Streamlit UI for redacting personally identifiable information (PII) from emails by proxying requests to a remote vLLM server. The frontend sends each email to the model with a built‑in system prompt that replaces sensitive spans with [redacted] while keeping all other text verbatim.

Prerequisites

Anaconda or Miniconda
Python 3.11 (installed via Conda)
(Optional) NVIDIA GPU + CUDA drivers if you want to host your own vLLM inference server
Hugging Face access token with permission to download the target model

Local Setup (Anaconda)

# 1. Clone the repo and enter it
git clone <your fork or repo URL>
cd "C:\Users\johna\OneDrive\Documents\Brainqub3\Inference Engine\inference_engine"

# 2. Create/activate the Conda env (only once)
conda create -n private-inference python=3.11 -y
conda activate private-inference

# 3. Install app dependencies
pip install -r requirements.txt

# 4. Configure environment variables
copy .env.example .env   # edit the values to point at your vLLM endpoint + key

# 5. Run the Streamlit UI
streamlit run app/streamlit_app.py --logger.level info

The Makefile mirrors those steps if you prefer make env, make deps, and make chat (Make sure conda run finds the llm-chat environment or override with ENV_NAME=private-inference make chat).

Configuration

VLLM_BASE_URL – OpenAI-compatible base URL (e.g., https://xxxx.proxy.runpod.net/v1)
OPENAI_API_KEY – Any non-empty string; vLLM just requires the header
MODEL_ID – Model name known to the vLLM server (e.g., ibm-granite/granite-4.0-h-1b)

Copy .env.example to .env and edit those values before starting Streamlit.

Running

Activate the Conda environment: conda activate private-inference
Optional: pip install -r requirements.txt if dependencies changed
Launch Streamlit: streamlit run app/streamlit_app.py --logger.level info
Watch the PowerShell window for logs confirming each request (HTTP 200 / errors)

Deploying vLLM on RunPod

Use the official RunPod vLLM inference template (preloaded with GPU drivers and vLLM):

Template link: RunPod vLLM Inference Server

Steps:

Click the template link and choose an appropriate GPU pod type.
Provide your HUGGING_FACE_HUB_TOKEN as an environment variable so the model can download.

Set container command/args similar to:

python3 -m vllm.entrypoints.openai.api_server \
  --model ibm-granite/granite-4.0-h-1b \
  --host 0.0.0.0 \
  --port 8000

Start the pod and wait for the health indicator to turn green.
Copy the forwarded URL (https://<pod-id>-8000.proxy.runpod.net/v1) into your local .env VLLM_BASE_URL.

You can also build/push your own container via the provided Dockerfile if you need custom dependencies, then point RunPod at that image instead of the template default.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference Engine

Prerequisites

Local Setup (Anaconda)

Configuration

Running

Deploying vLLM on RunPod

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Inference Engine

Prerequisites

Local Setup (Anaconda)

Configuration

Running

Deploying vLLM on RunPod