awesome-chatgpt-dataset

Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!

Dataset Name	Size	Languages	Source	Cost	License
cc_sbu_align	4K	English	MiniGPT-4 datadset	-	BSD 3-Clause License
ChatAlpaca	10K	English	The data currently contain a total of 10,000 conversations with 95,558 utterances.	-	Apache-2.0 license
Dolly	15K	English	databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT.	-	CC 3.0
Code Alpaca	20K	English	Code generation task involving 20,022 samples	-	-
HC3	37K	English, Chinese	37,175 instructions generated by ChatGPT and human	-	-
Alpaca Dataset	52K	English	175 seed instructions by OpenAI API	<$500	CC By NC 4.0; OpenAI terms of use
Alpaca Data Cleaned	52K	English	Revised version of Alpaca Dataset	-	-
Alpaca GPT-4 Data	52K	English	Generated by GPT-4 using Alpaca prompts	-	-
Alpaca GPT-4 Data (Chinese)	52K	Chinese	Generated by GPT-4 using Chinese prompts translated from Alpaca by ChatGPT	-	-
Cabrita Dataset	52K	Portuguese	Translated from Alpaca Data	-
Japanese Alpaca Dataset	52K	Japanese	Translated from Alpaca Data by ChatGPT API	$45	CC By NC 4.0; OpenAI terms of use
Traditional Chinese Alpaca Dataset	52K	Traditional Chinese	Translated from Alpaca Data by ChatGPT API	$40	Apache-2.0 license
Finance	69K	English	68,912 financial related instructions	-	-
Vicuna Dataset	75K	English	~100k ShareGPT conversations	-	-
InstructionTranslation	80K	Multi-lingual	Translations were generated by M2M 12B and the output generations were limited at 512 tokens due to VRAM limit (40G).	-	MIT
OASST1	89K	Multi-lingual	a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees.	-	apache-2.0
HH-RLHF	91K	English	The data are described in the paper: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.	-	MIT
Guanaco Dataset	98K	English, Simplified Chinese, Traditional Chinese HK & TW, Japanese	175 tasks from the Alpaca model	$6K	GPLv3
InstructionWild	104K	English, Chinese	429 seed instructions and follow Alpaca to generate 52K	$880	Research only; OpenAI terms of use
Camel Dataset	107K	Multi-lingual	Role-playing between AIs (Open AI API)	-
LLaVA Visual Instruct	150K	English	LLaVA Visual Instruct 150K is a set of GPT-generated multimodal instruction-following data. It is constructed for visual instruction tuning and for building large multimodal towards GPT-4 vision/language capability.	-	cc-by-nc-4.0
Prosocial Dialog	166K	English	165,681 instructions produced by GPT-3 rewrites questions and human feedback	-	-
Unnatural Instructions	241K	English	a large dataset of cre- ative and diverse instructions, collected with virtually no human labor.	-	MIT
SHP	358K	English	SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice.	-	Reddit non-exclusive, non-transferable, non-sublicensable, and revocable license
ultrachat	404K	English	To ensure generation quality, two separate ChatGPT Turbo APIs are adopted in generation, where one plays the role of the user to generate queries and the other generates the response.	-	cc-by-nc-4.0
ELI5	559K	English	The ELI5 dataset is an English-language dataset of questions and answers gathered from three subreddits where users ask factual questions requiring paragraph-length or longer answers.
GPT4All Dataset	806K	Multi-lingual	Subset of LAION OIG, StackOverflow Question, BigSciense/p3 dataset. Answered by OpenAI API.	-
Instruct	889K	English	888,969 English instructions, augmentation using AllenAI NLP tools	-	MIT
MOSS	1M	Chinese	Generated by gpt-3.5-turbo		Apache-2.0, AGPL-3.0 licenses
LaMini-Instruction	3M	English	a total of 2.58M pairs of instructions and responses using gpt-3.5-turbo based on several existing resources of prompts	-	cc-by-nc-4.0
Natural Instructions	5M	Multi-lingual	5,040,134 instructions collected from diverse NLP tasks	-	-
BELLE	10M	Chinese	The 10M Chinese dataset is composed of subsets spanning multiple (instruction) types and multiple fields.	-	Research only; OpenAI terms of use
Firefly	16M	Chinese	1,649,398 Chinese instructions in 23 NLP tasks	-	-
OIG-43M Dataset	43M	Multi-lingual	Together, LAION, and Ontocord.ai.	-
xP3	79M	Multi-lingual	78,883,588 instructions collected by prompts & datasets across 46 languages & 16 NLP tasks	-	-
Alpaca-CoT Dataset	-	Multi-lingual	Instruction Data Collection	-	ODC-By
stack-exchange-paired	-	English	This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training.	-	cc-by-sa-4.0
CodeParrot	-	python	The database was queried for all Python files with less than 1MB in size resulting in a 180GB dataset with over 20M files.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
A cat to Unlock the Power of LLM Explore These Datasets to Train Your Own ChatGPT!.gif		A cat to Unlock the Power of LLM Explore These Datasets to Train Your Own ChatGPT!.gif
A cat to Unlock the Power of LLM Explore These Datasets to Train Your Own ChatGPT!.png		A cat to Unlock the Power of LLM Explore These Datasets to Train Your Own ChatGPT!.png
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

awesome-chatgpt-dataset

Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

awesome-chatgpt-dataset

Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages