Benchmarking Reliability in AI-generated Legal Advice

By Team BenchLAWk:
Wenjie Gong, Simeng Wu, Carol Zhou, Franklin Zhou

Step 1:

Build the automated pipeline to generate the test dataset.
Approach: Scrape a few official "Tenant Rights" PDFs (e.g., New York City, California, etc - in resource folder). Generate synthetic user questions with "Grouth Truth" answers based only on the text. Output into a JSON file.

Step 2:

Run the models.
Approach: Through an automated pipeline, loop through the test set and send the questions to different models. Return the answer, latency, and refusal rate (e.g., I am not a lawyer argument).

Step 3:

Evaluate the model response.
Approach: Implement some metrics evaluating the AI response.

Step 4:

Visualize the results at front-end.
Approach: Use Streamlit to build a dashboard where you can paste a legal question, select a model, and see the real-time evaluation results.

Further questions:

This is benchmarking AI for tenant-landlord law. How about AI for traffic law? AI for family law? AI for immigration law? This project is all about scalability - today is tenants' right, tomorrow is immigration law!

Link to our slides:

https://docs.google.com/presentation/d/1nJkU3DylH-KSh4aq5t62-vs3PI1ns4BNiFfNyZXu5Pg/edit?usp=sharing

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
resource		resource
results		results
README.md		README.md
ai_leaderboard.py		ai_leaderboard.py
ai_leaderboard_consolidated.md		ai_leaderboard_consolidated.md
ai_leaderboard_extended.py		ai_leaderboard_extended.py
dashboard_app.py		dashboard_app.py
inference_engine.py		inference_engine.py
inference_engine_mega.py		inference_engine_mega.py
red_teamer.py		red_teamer.py
red_teamer_adv.py		red_teamer_adv.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking Reliability in AI-generated Legal Advice

Step 1:

Step 2:

Step 3:

Step 4:

Further questions:

Link to our slides:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Benchmarking Reliability in AI-generated Legal Advice

Step 1:

Step 2:

Step 3:

Step 4:

Further questions:

Link to our slides:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages