From 43522d7edffbe9b915fd12707f3c02e85a4d7e99 Mon Sep 17 00:00:00 2001 From: Lizy Date: Fri, 29 Mar 2019 18:56:14 +0530 Subject: [PATCH 1/3] Changes to the How-tos: * How to load data from External Data Stores - Added a stem sentence to explain that the data load from CSV file using SQL is done from the local file system. * How to Load Data into SnappyData Tables - Added a Troubleshootin tip as suggested by Trilok and Amogh. Organized content and edited based on suggestions from Grammarly. --- .../load_data_from_external_data_stores.md | 8 ++-- .../howto/load_data_into_snappydata_tables.md | 37 ++++++++++--------- 2 files changed, 25 insertions(+), 20 deletions(-) diff --git a/docs/howto/load_data_from_external_data_stores.md b/docs/howto/load_data_from_external_data_stores.md index 36c15b501b..13287bc179 100644 --- a/docs/howto/load_data_from_external_data_stores.md +++ b/docs/howto/load_data_from_external_data_stores.md @@ -3,7 +3,9 @@ SnappyData comes bundled with the libraries to access HDFS (Apache compatible). You can load your data using SQL or DataFrame API. -## Example - Loading data from CSV file using SQL +## Example - Loading Data from CSV File using SQL + +The following example demonstrates how you can load data from the CSV file, in a local file system, by using SQL: ```pre // Create an external table based on CSV file @@ -14,7 +16,7 @@ CREATE TABLE CUSTOMER using column options() as (select * from CUSTOMER_STAGING_ ``` !!! Tip - Similarly, you can create an external table for all data sources and use SQL "insert into" query to load data. For more information on creating external tables refer to, [CREATE EXTERNAL TABLE](../reference/sql_reference/create-external-table/) + Similarly, you can create an external table for all data sources and use SQL "insert into" query to load data. For more information on creating external tables refer to, [CREATE EXTERNAL TABLE](../reference/sql_reference/create-external-table/). ## Example - Loading CSV Files from HDFS using API @@ -73,7 +75,7 @@ val df = session.createDataFrame(rdd, ds.schema) df.write.format("column").saveAsTable("columnTable") ``` -## Importing Data using JDBC from a relational DB +## Importing Data using JDBC from Rrelational DB !!! Note Before you begin, you must install the corresponding JDBC driver. To do so, copy the JDBC driver jar file in **/jars** directory located in the home directory and then restart the cluster. diff --git a/docs/howto/load_data_into_snappydata_tables.md b/docs/howto/load_data_into_snappydata_tables.md index cbd0b7864c..23c59ece31 100644 --- a/docs/howto/load_data_into_snappydata_tables.md +++ b/docs/howto/load_data_into_snappydata_tables.md @@ -3,16 +3,13 @@ SnappyData relies on the Spark SQL Data Sources API to parallelly load data from a wide variety of sources. By integrating the loading mechanism with the Query engine (Catalyst optimizer) it is often possible to push down filters and projections all the way to the data source minimizing data transfer. Here is the list of important features: -**Support for many Sources**
There is built-in support for many data sources as well as data formats. Data can be accessed from S3, file system, HDFS, Hive, RDB, etc. And the loaders have built-in support to handle CSV, Parquet, ORC, Avro, JSON, Java/Scala Objects, etc as the data formats. +* **Support for many Sources**
There is built-in support for many data sources as well as data formats. Data can be accessed from S3, file system, HDFS, Hive, RDB, etc. Moreover, loaders have built-in support to handle CSV, Parquet, ORC, Avro, JSON, Java/Scala Objects, etc. as the data formats. +* **Access virtually any modern data store**
Virtually all major data providers have a native Spark connector that complies with the Data Sources API. For example, you can load data from any RDB like Amazon Redshift, Cassandra, Redis, Elastic Search, Neo4J, etc. While thee connectors are not built-in, you can easily deploy these connectors as dependencies into a SnappyData cluster. All the connectors are typically registered in spark-packages.org. +* **Avoid Schema wrangling**
Spark supports schema inference. Which means, all you need to do is point to the external source in your 'create table' DDL (or Spark SQL API) and schema definition is learned by reading in the data. There is no need to define each column and type explicitly. This is extremely useful when dealing with disparate, complex and wide data sets. +* **Read nested, sparse data sets**
When data is accessed from a source, the schema inference occurs by not just reading a header but often by reading the entire data set. For instance, when reading JSON files, the structure could change from document to document. The inference engine builds up the schema as it reads each record and keeps unioning them to create a unified schema. This approach allows developers to become very productive with disparate data sets. -**Access virtually any modern data store**
Virtually all major data providers have a native Spark connector that complies with the Data Sources API. For e.g. you can load data from any RDB like Amazon Redshift, Cassandra, Redis, Elastic Search, Neo4J, etc. While these connectors are not built-in, you can easily deploy these connectors as dependencies into a SnappyData cluster. All the connectors are typically registered in spark-packages.org - -**Avoid Schema wrangling**
Spark supports schema inference. Which means, all you need to do is point to the external source in your 'create table' DDL (or Spark SQL API) and schema definition is learned by reading in the data. There is no need to explicitly define each column and type. This is extremely useful when dealing with disparate, complex and wide data sets. - -**Read nested, sparse data sets**
When data is accessed from a source, the schema inference occurs by not just reading a header but often by reading the entire data set. For instance, when reading JSON files the structure could change from document to document. The inference engine builds up the schema as it reads each record and keeps unioning them to create a unified schema. This approach allows developers to become very productive with disparate data sets. - -**Load using Spark API or SQL**
You can use SQL to point to any data source or use the native Spark Scala/Java API to load. -For instance, you can first [create an external table](../reference/sql_reference/create-external-table.md). +## Loading Data using Spark API or SQL +You can use SQL to point to any data source or use the native Spark Scala/Java API to load. For instance, you can first [create an external table](../reference/sql_reference/create-external-table.md). ```pre CREATE EXTERNAL TABLE USING OPTIONS @@ -20,15 +17,17 @@ CREATE EXTERNAL TABLE USING OPTIONS +For example, `snc.sparkContext.hadoopConfiguration.set("fs.s3a.connection.maximum", "1000")` \ No newline at end of file From 7cccce19a6476b8900f87b62c55fa0fe681738c7 Mon Sep 17 00:00:00 2001 From: Lizy Date: Fri, 29 Mar 2019 19:01:22 +0530 Subject: [PATCH 2/3] Minor edit as suggested by chandresh. --- docs/programming_guide/tables_in_snappydata.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/programming_guide/tables_in_snappydata.md b/docs/programming_guide/tables_in_snappydata.md index c80b890d9b..d4e4ecdd45 100644 --- a/docs/programming_guide/tables_in_snappydata.md +++ b/docs/programming_guide/tables_in_snappydata.md @@ -31,7 +31,7 @@ CREATE TABLE [IF NOT EXISTS] table_name ) [AS select_statement]; -DROP TABLE [IF EXISTS] table_name +DROP TABLE [IF EXISTS] table_name; ``` Refer to the [Best Practices](../best_practices/design_schema.md) section for more information on partitioning and colocating data and [CREATE TABLE](../reference/sql_reference/create-table.md) for information on creating a row/column table.
From 26c2499f12f74b141c79f8c44798f0097408d067 Mon Sep 17 00:00:00 2001 From: Lizy Date: Thu, 4 Apr 2019 12:48:10 +0530 Subject: [PATCH 3/3] minor edits --- docs/howto/load_data_from_external_data_stores.md | 2 +- docs/reference/command_line_utilities/modify_disk_store.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/howto/load_data_from_external_data_stores.md b/docs/howto/load_data_from_external_data_stores.md index 13287bc179..6fa9616ab9 100644 --- a/docs/howto/load_data_from_external_data_stores.md +++ b/docs/howto/load_data_from_external_data_stores.md @@ -75,7 +75,7 @@ val df = session.createDataFrame(rdd, ds.schema) df.write.format("column").saveAsTable("columnTable") ``` -## Importing Data using JDBC from Rrelational DB +## Importing Data using JDBC from Relational DB !!! Note Before you begin, you must install the corresponding JDBC driver. To do so, copy the JDBC driver jar file in **/jars** directory located in the home directory and then restart the cluster. diff --git a/docs/reference/command_line_utilities/modify_disk_store.md b/docs/reference/command_line_utilities/modify_disk_store.md index 93aef166dc..fad38926d7 100644 --- a/docs/reference/command_line_utilities/modify_disk_store.md +++ b/docs/reference/command_line_utilities/modify_disk_store.md @@ -16,6 +16,8 @@ Snappy>create region --name=regionName --type=PARTITION_PERSISTENT_OVERFLOW **For non-secured cluster** +## Description + The following table describes the options used for `snappy modify-disk-store`: | Items | Description | @@ -27,8 +29,6 @@ The following table describes the options used for `snappy modify-disk-store`: !!! Note The name of the disk store, the directories its files are stored in, and the region to target are all required arguments. -## Description - ## Examples **Secured cluster**