Skip to content

Latest commit

 

History

History
70 lines (49 loc) · 2.94 KB

File metadata and controls

70 lines (49 loc) · 2.94 KB

Inference Engine

Streamlit UI for redacting personally identifiable information (PII) from emails by proxying requests to a remote vLLM server. The frontend sends each email to the model with a built‑in system prompt that replaces sensitive spans with [redacted] while keeping all other text verbatim.

Prerequisites

  • Anaconda or Miniconda
  • Python 3.11 (installed via Conda)
  • (Optional) NVIDIA GPU + CUDA drivers if you want to host your own vLLM inference server
  • Hugging Face access token with permission to download the target model

Local Setup (Anaconda)

# 1. Clone the repo and enter it
git clone <your fork or repo URL>
cd "C:\Users\johna\OneDrive\Documents\Brainqub3\Inference Engine\inference_engine"

# 2. Create/activate the Conda env (only once)
conda create -n private-inference python=3.11 -y
conda activate private-inference

# 3. Install app dependencies
pip install -r requirements.txt

# 4. Configure environment variables
copy .env.example .env   # edit the values to point at your vLLM endpoint + key

# 5. Run the Streamlit UI
streamlit run app/streamlit_app.py --logger.level info

The Makefile mirrors those steps if you prefer make env, make deps, and make chat (Make sure conda run finds the llm-chat environment or override with ENV_NAME=private-inference make chat).

Configuration

  • VLLM_BASE_URL – OpenAI-compatible base URL (e.g., https://xxxx.proxy.runpod.net/v1)
  • OPENAI_API_KEY – Any non-empty string; vLLM just requires the header
  • MODEL_ID – Model name known to the vLLM server (e.g., ibm-granite/granite-4.0-h-1b)

Copy .env.example to .env and edit those values before starting Streamlit.

Running

  1. Activate the Conda environment: conda activate private-inference
  2. Optional: pip install -r requirements.txt if dependencies changed
  3. Launch Streamlit: streamlit run app/streamlit_app.py --logger.level info
  4. Watch the PowerShell window for logs confirming each request (HTTP 200 / errors)

Deploying vLLM on RunPod

Use the official RunPod vLLM inference template (preloaded with GPU drivers and vLLM):

Steps:

  1. Click the template link and choose an appropriate GPU pod type.
  2. Provide your HUGGING_FACE_HUB_TOKEN as an environment variable so the model can download.
  3. Set container command/args similar to:
    python3 -m vllm.entrypoints.openai.api_server \
      --model ibm-granite/granite-4.0-h-1b \
      --host 0.0.0.0 \
      --port 8000
    
  4. Start the pod and wait for the health indicator to turn green.
  5. Copy the forwarded URL (https://<pod-id>-8000.proxy.runpod.net/v1) into your local .env VLLM_BASE_URL.

You can also build/push your own container via the provided Dockerfile if you need custom dependencies, then point RunPod at that image instead of the template default.