SFaker is one data generator. It implemented with Spark DataSourceV2. SFaker can generate rows according to specified schemas.
| Feature | |
|---|---|
| Support Batch | ✅ |
| Support Stream | TBD |
| Support DataFrameReader API | ✅ |
| Support Spark SQL Create Statement | ✅ |
| Support Unsafe Row | ✅ |
| Support Codegen | ✅ |
| Support Limit Push Down | ✅ |
| Support Columns Pruning | ✅ |
Support spark sql types, more details about types click here.
| Spark Type | |
|---|---|
| Byte | ✅ |
| Short | ✅ |
| Integer | ✅ |
| Long | ✅ |
| Float | ✅ |
| Double | ✅ |
| Decimal | TBD |
| String | ✅ |
| Varchar | TBD |
| Char | TBD |
| Binary | TBD |
| Boolean | ✅ |
| Date | TBD |
| Timestamp | TBD |
| TimestampNTZ | TBD |
| YearMonthInterval | TBD |
| DayTimeInterval | TBD |
| Array | ✅ |
| Map | ✅ |
| Struct | ✅ |
| Conf | Type | Default | Description |
|---|---|---|---|
spark.sql.fake.source.unsafe.row.enable |
Boolean | false | If true, all row generated will been stored in UnsafeRow. |
spark.sql.fake.source.unsafe.codegen.enable |
Boolean | false | If true, the row-generated process, which produce rows according to schema, will been executed in JIT mode. |
spark.sql.fake.source.partitions |
Integer | 1 | Number of source partitions. |
spark.sql.fake.source.rowsTotalSize |
Integer | 8 | Number of rows generated according to schema. |
val schema = new StructType()
.add("id", DataTypes.IntegerType)
.add("sex", DataTypes.BooleanType)
.add("roles", DataTypes.createArrayType(DataTypes.StringType));
val df = spark.read
.format("FakeSource")
.schema(schema)
.option(FakeSourceProps.CONF_ROWS_TOTAL_SIZE, 100)
.option(FakeSourceProps.CONF_PARTITIONS, 1)
.option(FakeSourceProps.CONF_UNSAFE_ROW_ENABLE, true)
.option(FakeSourceProps.CONF_UNSAFE_CODEGEN_ENABLE, true)
.load();val spark = SparkSession
.builder()
.master("local[*]")
.appName("Case0")
.config(
"spark.sql.catalog.spark_catalog",
classOf[FakeSourceCatalog].getName
)
.getOrCreate();
val df = spark.sql("""
|create table fake (
| id int,
| sex boolean
|)
|using FakeSource
|tblproperties (
|spark.sql.fake.source.rowsTotalSize = 10000000,
|spark.sql.fake.source.partitions = 1,
|spark.sql.fake.source.unsafe.row.enable = true,
|spark.sql.fake.source.unsafe.codegen.enable = true
|)
|""".stripMargin)
spark.sql("select id from fake limit 10").explain(true);