| cc_sbu_align |
4K |
English |
MiniGPT-4 datadset |
- |
BSD 3-Clause License |
| ChatAlpaca |
10K |
English |
The data currently contain a total of 10,000 conversations with 95,558 utterances. |
- |
Apache-2.0 license |
| Dolly |
15K |
English |
databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. |
- |
CC 3.0 |
| Code Alpaca |
20K |
English |
Code generation task involving 20,022 samples |
- |
- |
| HC3 |
37K |
English, Chinese |
37,175 instructions generated by ChatGPT and human |
- |
- |
| Alpaca Dataset |
52K |
English |
175 seed instructions by OpenAI API |
<$500 |
CC By NC 4.0; OpenAI terms of use |
| Alpaca Data Cleaned |
52K |
English |
Revised version of Alpaca Dataset |
- |
- |
| Alpaca GPT-4 Data |
52K |
English |
Generated by GPT-4 using Alpaca prompts |
- |
- |
| Alpaca GPT-4 Data (Chinese) |
52K |
Chinese |
Generated by GPT-4 using Chinese prompts translated from Alpaca by ChatGPT |
- |
- |
| Cabrita Dataset |
52K |
Portuguese |
Translated from Alpaca Data |
- |
|
| Japanese Alpaca Dataset |
52K |
Japanese |
Translated from Alpaca Data by ChatGPT API |
$45 |
CC By NC 4.0; OpenAI terms of use |
| Traditional Chinese Alpaca Dataset |
52K |
Traditional Chinese |
Translated from Alpaca Data by ChatGPT API |
$40 |
Apache-2.0 license |
| Finance |
69K |
English |
68,912 financial related instructions |
- |
- |
| Vicuna Dataset |
75K |
English |
~100k ShareGPT conversations |
- |
- |
| InstructionTranslation |
80K |
Multi-lingual |
Translations were generated by M2M 12B and the output generations were limited at 512 tokens due to VRAM limit (40G). |
- |
MIT |
| OASST1 |
89K |
Multi-lingual |
a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. |
- |
apache-2.0 |
| HH-RLHF |
91K |
English |
The data are described in the paper: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. |
- |
MIT |
| Guanaco Dataset |
98K |
English, Simplified Chinese, Traditional Chinese HK & TW, Japanese |
175 tasks from the Alpaca model |
$6K |
GPLv3 |
| InstructionWild |
104K |
English, Chinese |
429 seed instructions and follow Alpaca to generate 52K |
$880 |
Research only; OpenAI terms of use |
| Camel Dataset |
107K |
Multi-lingual |
Role-playing between AIs (Open AI API) |
- |
|
| LLaVA Visual Instruct |
150K |
English |
LLaVA Visual Instruct 150K is a set of GPT-generated multimodal instruction-following data. It is constructed for visual instruction tuning and for building large multimodal towards GPT-4 vision/language capability. |
- |
cc-by-nc-4.0 |
| Prosocial Dialog |
166K |
English |
165,681 instructions produced by GPT-3 rewrites questions and human feedback |
- |
- |
| Unnatural Instructions |
241K |
English |
a large dataset of cre- ative and diverse instructions, collected with virtually no human labor. |
- |
MIT |
| SHP |
358K |
English |
SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. |
- |
Reddit non-exclusive, non-transferable, non-sublicensable, and revocable license |
| ultrachat |
404K |
English |
To ensure generation quality, two separate ChatGPT Turbo APIs are adopted in generation, where one plays the role of the user to generate queries and the other generates the response. |
- |
cc-by-nc-4.0 |
| ELI5 |
559K |
English |
The ELI5 dataset is an English-language dataset of questions and answers gathered from three subreddits where users ask factual questions requiring paragraph-length or longer answers. |
|
|
| GPT4All Dataset |
806K |
Multi-lingual |
Subset of LAION OIG, StackOverflow Question, BigSciense/p3 dataset. Answered by OpenAI API. |
- |
|
| Instruct |
889K |
English |
888,969 English instructions, augmentation using AllenAI NLP tools |
- |
MIT |
| MOSS |
1M |
Chinese |
Generated by gpt-3.5-turbo |
|
Apache-2.0, AGPL-3.0 licenses |
| LaMini-Instruction |
3M |
English |
a total of 2.58M pairs of instructions and responses using gpt-3.5-turbo based on several existing resources of prompts |
- |
cc-by-nc-4.0 |
| Natural Instructions |
5M |
Multi-lingual |
5,040,134 instructions collected from diverse NLP tasks |
- |
- |
| BELLE |
10M |
Chinese |
The 10M Chinese dataset is composed of subsets spanning multiple (instruction) types and multiple fields. |
- |
Research only; OpenAI terms of use |
| Firefly |
16M |
Chinese |
1,649,398 Chinese instructions in 23 NLP tasks |
- |
- |
| OIG-43M Dataset |
43M |
Multi-lingual |
Together, LAION, and Ontocord.ai. |
- |
|
| xP3 |
79M |
Multi-lingual |
78,883,588 instructions collected by prompts & datasets across 46 languages & 16 NLP tasks |
- |
- |
| Alpaca-CoT Dataset |
- |
Multi-lingual |
Instruction Data Collection |
- |
ODC-By |
| stack-exchange-paired |
- |
English |
This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. |
- |
cc-by-sa-4.0 |
| CodeParrot |
- |
python |
The database was queried for all Python files with less than 1MB in size resulting in a 180GB dataset with over 20M files. |
|
|