Skip to content

AdityaPatange1/synthlite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

18 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

SynthLite ๐ŸŒž

๐Ÿšจ Disclaimer: SynthLite is a work in progress. Expect bugs, fun, and room for improvement.

SynthLite is more than just a toolโ€”itโ€™s a spark in our larger SynthArt vision: to democratize synthetic data generation for everyone. By combining cutting-edge AI with deep research, SynthLite empowers you to create reliable, high-quality synthetic datasets in a minutes. We believe the future of data solutions should be private, open source, scalable, and accessible and safe โ€” and SynthLite is here to make that future a reality. ๐Ÿ”ฎ

License: AGPL v3 TypeScript Node.js Contributions Welcome

Introduction ๐Ÿ’ก

SynthLite โšก๏ธ is a synthetic data generation CLI tool and library written in TypeScript. Itโ€™s designed to help you quickly produce high-quality synthetic datasetsโ€”perfect for development, testing, or even for product features and experiments. ๐Ÿฅข

๐Ÿ’ฌ Why? Because synthetic data opens new frontiers for experimentation, privacy-friendly testing, and robust model trainingโ€”helping developers and researchers alike! ๐Ÿ˜Ž

Under the hood, synthlite demonstrates the speed and power of various large language models (LLMs), including those from OpenAI, Anthropic, Meta, and Groq, showcasing how seamlessly they can integrate for data generation. โš™

Partnerships & Future Collaboration ๐Ÿค

๐Ÿšฆ synthlite is not affiliated with any of the mentioned organizations and is an independent "hacker" project. However, in this note, I wish to propose future partnerships or collaborations with any or all of OpenAI, Anthropic, Meta, and Groq.

Here at synthlite, we're always on the lookout for meaningful collaborations to take synthetic data generation to the next level. While our current setup already demonstrates the capabilities of various LLMs, we envision broader use cases and accelerated growth through strategic partnerships with:

  • OpenAI: Explore how advanced AI models can be utilized effectively across diverse tasks, moving closer to Artificial General Intelligence (AGI) by leveraging existing technologies.

  • Anthropic: Investigate the potential of AI models in creating nuanced synthetic data, contributing to the development of safe and reliable AI systems.

  • Meta: Examine how Llama 3.x and future Llama variants can seamlessly integrate with synthlite for more sophisticated data generation scenarios.

  • Groq: Further explore advanced hardware acceleration and develop cutting-edge benchmarks that highlight how synthlite combined with Groq can enhance synthetic data pipelines.

If you have any leads or are directly affiliated with these organizations (or similar), feel free to reach out! We believe that combining our open-source vision with innovative partners can push synthetic data tools even furtherโ€”providing the community with faster, safer, and more adaptable ways to generate synthetic datasets. ๐Ÿš€

๐Ÿ’ก It would be amazing to see where our vision and hacker-like execution (which I picked up by studying Meta's culture) could take us. ๐Ÿบ


Innovative Approaches ๐Ÿ’ก

In the development of SynthLite, we have employed several innovative approaches and techniques to enhance the synthetic data generation process. These methods draw from various computer science and engineering concepts, ensuring the generation of high-quality, diverse, and realistic synthetic data. Below are some of the key approaches:

Injecting Randomness via Mutation of Duplicates ๐ŸŒฑ

To ensure the uniqueness and diversity of the generated data, we inject randomness by mutating duplicates. This involves making minor adjustments to existing data points to create new, unique entries. This technique helps in avoiding repetitive patterns and ensures that the synthetic data remains varied and realistic.

Mutation Concept and Genetic Programming ๐Ÿž

The concept of mutation used in SynthLite is inspired by genetic programming. In genetic programming, mutation is a genetic operator used to maintain genetic diversity within a population of solutions. Similarly, in SynthLite, mutation is applied to duplicate data points to introduce variations and prevent redundancy. This approach ensures that the generated data evolves and adapts, much like in genetic algorithms.

Schema-Driven Data Generation ๐Ÿฅค

SynthLite leverages JSON schemas to define the structure and constraints of the synthetic data. By converting JSON schemas to Zod schemas, we ensure that the generated data adheres to the specified format and validation rules. This schema-driven approach provides flexibility and precision in data generation.

AI-Powered Data Generation ๐Ÿช„

SynthLite integrates AI models to generate synthetic data. By providing prompts and schemas to the AI, we harness the power of language models to create realistic and contextually appropriate data points. This AI-driven approach enhances the quality and coherence of the generated data.

Event-Driven Architecture ๐ŸŽช

The use of an event-driven architecture with the SynthliteEmitter allows for efficient handling of data generation events. This architecture enables real-time processing and writing of generated data, ensuring a smooth and responsive data generation workflow.

Performance Optimization โšก๏ธ

To optimize performance, SynthLite processes data in batches and measures the time taken for each batch. This approach helps in identifying bottlenecks and ensures efficient utilization of resources during data generation.

By combining these innovative approaches, SynthLite provides a robust and flexible solution for synthetic data generation, catering to a wide range of use cases and ensuring high-quality outputs.


Table of Contents ๐Ÿ“š


Core Features ๐Ÿ”ง

  • Optimized Generation: Harness the power of various LLMs for efficient synthetic data generation.

  • TypeScript Library & CLI: Use synthlite as a standalone CLI or integrate directly into your projects.

  • Schema-Based Datasets: Initialize a dataset with your jsonSchema for structured, valid data every time.

  • Flexible Output Formats: Save generated data in JSON or CSVโ€”or just work with it in-memory as a JavaScript object.

  • LLM Integration: (Optional) Use the power of models like Llama 3.x to enhance realism and variety in your synthetic data.


Potential Problem Statements & Research Areas ๐Ÿ”Ž

  1. Privacy & Compliance: Generate synthetic datasets that mimic real-world data distributions without exposing sensitive information.

  2. High-Volume Testing: Rapidly create large datasets for load testing or performance benchmarking.

  3. AI Model Training: Explore how synthetic data can be used to train or fine-tune AI models while preserving privacy.

  4. Performance Research: Investigate how hardware acceleration can supercharge the synthetic data generation process.

  5. Multi-Modal Future: Potential exploration of text, image, or even audio synthetic data using advanced AI models.

Relevance: As the need for large, diverse, and privacy-friendly datasets grows, synthlite aims to deliver a swift, flexible solution that caters to the modern data-driven ecosystem.


How It Works โš™๏ธ

  1. Create a Dataset

    const dataset = new SynthliteDataset(jsonSchema);

    This sets up your data structure based on the JSON schema you provide.

  2. Generate Data

    const generatedDataset = dataset.generate({ count: 1000 });

    Produces a GeneratedDataset object containing your synthetic samples.

  3. Save the Output

    generatedDataset.save("output.json", "json");

    Exports the generated data in JSON or CSV formatsโ€”whichever you prefer.

All these steps leverage the efficiency of various LLMs and can optionally tap into models like Llama 3.x for enhanced generative capabilities.


Setup Instructions ๐Ÿ”ง

You can install SynthLite using your favorite package managerโ€”npm, yarn, or pnpm. Just pick one of the commands below:

npm

npm install synthlite

yarn

yarn add synthlite

Usage

npx synthlite <options>
Example: npx synthlite -sc schema.json -o output.json -env .env -r 20

Prerequisites

  • Node.js v16+

  • TypeScript 4.x

  • Access to relevant AI models.

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd synthlite
  2. Install dependencies:

    npm install

Build and Run

  1. Build the project:

    npm run build
  2. Use the CLI (example):

    npm start -- --schema ./mySchema.json --count 1000 --output data.json

    This will generate 1,000 samples using mySchema.json and save them to data.json.


Usage Instructions ๐Ÿ•ต๏ธโ€โ™‚๏ธ

  1. Library Usage (TypeScript)

    import { SynthliteDataset, SynthliteGeneratedDataset } from "synthlite";
    
    const jsonSchemaPath = "./schema.json";
    const dataset = await SynthliteDataset.fromSchemaFile(jsonSchemaPath);
    
    const generatedDataset: SynthliteGeneratedDataset = dataset.generate({
      count: 500,
    });
    await generatedDataset.save("output.csv", "csv");
  2. CLI Usage

    • Basic Command

      npm start -- --schema ./mySchema.json --count 500

      This generates 500 samples and prints them to stdout.

    • Save to File

      npm start -- --schema ./mySchema.json --count 1000 --output data.csv

      Exports 1,000 samples to a data.csv file.

  3. Optional Llama 3.1 Hook If you have Llama 3.3 integrated, you can configure your dataset to add advanced generative power to your fields. See our docs for usage examples (if available).


Examples ๐Ÿ“Š

Generating a Simple JSON File

> npm start -- --schema ./mySchema.json --count 50 --output myData.json

Contributors ๐Ÿ’–

โšก๏ธ synthlite is a product of AdiPat Labs.

  • Aditya Patange (Founder, AdiPat Labs)

We welcome contributions! Feel free to open issues, fork the repo, and submit pull requests.


License ๐Ÿ“œ

This project is licensed under the AGPL v3. See the LICENSE file for details.


โœจ "It's not fake data dude, it's 'synthetic' data. ๐Ÿฅผ" โ€” Oen