Diversity in LLM-Simulated Survey Responses: A Research based on Stack Overflow Survey Questions

Steps to run this code

Download dataset: https://cdn.sanity.io/files/jo7n4k8s/production/262f04c41d99fea692e0125c342e446782233fe4.zip/stack-overflow-developer-survey-2024.zip
Clone the code repository and place survey_results_schema.csv and survey_results_public.csv from the dataset into the project directory.
Link to the extracted indexes(compressed), please uncompress and place this indexes folder inside the dataset folder: https://drive.google.com/file/d/1WfLOnpHwQibGF--W0VpoLhJO-aanIEMl/view?usp=drive_link. It is based on the processed RAG vector dataset mentioned below to save time for RAG.
Update schema_path,public_path, index_path,and key with your openai api key in main.py.
Create output folder and update path in main.py.
Based on your requirement, run the zero-shot or RAG as prompt strategy choosing llama, deepseek or chatgpt as using LLM. The using of chatgpt and deepseek needs to open the corresponding file and comment out part of the code according to the comment, and uncomment part of the code. Besides, if you want to use deepseek or llama,you should open the corresponding file and input your LLM key.
Based on your requirement, compare different distribution
Run main.py.

For the RAG vector dataset, here is the link for the original dataset: https://archive.org/details/stackexchange_20240930

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
dataset		dataset
results		results
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
Readme.md		Readme.md
requirement.txt		requirement.txt