diff --git a/AdAnalytics_Architecture.png b/AdAnalytics_Architecture.png index b45c3d0..b3eae52 100644 Binary files a/AdAnalytics_Architecture.png and b/AdAnalytics_Architecture.png differ diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..c4ec2c3 --- /dev/null +++ b/LICENSE @@ -0,0 +1,32 @@ +Copyright 2019. TIBCO Software Inc. + +Apache License +Version 2.0, January 2004 +http://www.apache.org/licenses/ +TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION +1. Definitions. +"License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. +"Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. +"Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. +"You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. +"Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. +"Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. +"Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). +"Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. +"Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." +"Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. +2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. +3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. +4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: +You must give any other recipients of the Work or Derivative Works a copy of this License; and +You must cause any modified files to carry prominent notices stating that You changed the files; and +You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and +If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. + +You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. +5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. +6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. +7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. +8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. +9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. +END OF TERMS AND CONDITIONS diff --git a/README.md b/README.md index 5bb63dd..795e4f3 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,4 @@ -##### We benchmarked this code example against the [MemSQL Spark Connector](https://github.com/memsql/memsql-spark-connector) and the [Cassandra Spark Connector](https://github.com/datastax/spark-cassandra-connector). SnappyData outperformed Cassandra by 45x and MemSQL by 3x on query execution while concurrently ingesting. The benchmark is described [here](http://www.snappydata.io/blog/snappydata-memsql-cassandra-a-performance-benchmark). -##### There is a screencast associated with this repo [here](https://youtu.be/bXofwFtmHjE) ##### [Skip directly to instructions](#lets-get-this-going) @@ -14,170 +12,196 @@ 7. [Slack/Gitter/Stackoverflow discussion](#ask-questions-start-a-discussion) ### Introduction -[SnappyData](https://github.com/SnappyDataInc/snappydata) aims to deliver real time operational analytics at interactive speeds with commodity infrastructure and far less complexity than today. SnappyData fulfills this promise by -- Enabling streaming, transactions and interactive analytics in a single unifying system rather than stitching different solutions—and -- Delivering true interactive speeds via a state-of-the-art approximate query engine that leverages a multitude of synopses as well as the full dataset. SnappyData implements this by deeply integrating an in-memory database into Apache Spark. +[SnappyData](https://github.com/SnappyDataInc/snappydata) aims to deliver real time operational analytics at interactive +speeds with commodity infrastructure and far less complexity than today. +SnappyData fulfills this promise by +- Enabling streaming, transactions and interactive analytics in a single unifying system rather than stitching different +solutions +- Delivering true interactive speeds via a state-of-the-art approximate query engine that leverages a multitude of +synopses as well as the full dataset. SnappyData implements this by deeply integrating an in-memory database into +Apache Spark. ### Purpose -Here we use a simplified Ad Analytics example, which streams in [AdImpression](https://en.wikipedia.org/wiki/Impression_(online_media)) logs, pre-aggregating the logs and ingesting into the built-in in-memory columnar store (where the data is stored both in 'exact' form as well as a stratified sample). +Here we use a simplified Ad Analytics example, which streams in [AdImpression](https://en.wikipedia.org/wiki/Impression_(online_media)) +logs, pre-aggregating the logs and ingesting into the built-in in-memory columnar store (where the data is stored both +in 'exact' form as well as a stratified sample). We showcase the following aspects of this unified cluster: -- Simplicity of using SQL or the [DataFrame API](http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes) to model streams in spark. -- The use of SQL/SchemaDStream API (as continuous queries) to pre-aggregate AdImpression logs (it is faster and much more convenient to incorporate more complex analytics, rather than using map-reduce). -- Demonstrate storing the pre-aggregated logs into the SnappyData columnar store with high efficiency. While the store itself provides a rich set of features like hybrid row+column store, eager replication, WAN replicas, HA, choice of memory-only, HDFS, native disk persistence, eviction, etc we only work with a column table in this simple example. +- Simplicity of using the [DataFrame API](http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes) to model streams in Apache Spark. +- The use of Structured Streaming API to pre-aggregate AdImpression logs (it is faster and much more convenient to incorporate more complex analytics, rather than using map-reduce). +- Demonstrate storing the pre-aggregated logs into the SnappyData columnar store with high efficiency. While the store itself provides a rich set of features like hybrid row+column store, eager replication, WAN replicas, high-availability, choice of memory-only, HDFS and native disk persistence, eviction, etc, We only work with a column table in this simple example. - Run OLAP queries from any SQL client both on the full data set as well as sampled data (showcasing sub-second interactive query speeds). The stratified sample allows us to manage an infinitely growing data set at a fraction of the cost otherwise required. ### Ad Impression Analytics use case -We borrow our use case implementation from this [blog](https://chimpler.wordpress.com/2014/07/01/implementing-a-real-time-data-pipeline-with-spark-streaming/) - We more or less use the same data structure and aggregation logic and we have adapted this code to showcase the SnappyData programming model extensions to Spark. We retain the native Spark example for comparison. +We borrow our use case implementation from this [blog](https://chimpler.wordpress.com/2014/07/01/implementing-a-real-time-data-pipeline-with-spark-streaming/) +\- We more or less use the same data structure and aggregation logic and we have adapted this code to showcase the +SnappyData programming model extensions to Spark. For comparison, we are also having the [native Spark example](src/main/scala/io/snappydata/adanalytics/SparkLogAggregator.scala) using structured streaming. Our architecture is depicted in the figure below. -We consider an adnetwork where adservers log impressions in [Apache Kafka](http://kafka.apache.org/) (distributed publish-subscribe messaging system). These impressions are then aggregated by [Spark Streaming](http://spark.apache.org/streaming/) into the SnappyData Store. External clients connect to the same cluster using JDBC/ODBC and run arbitrary OLAP queries. -As AdServers can feed logs from many websites and given that each AdImpression log message represents a single Ad viewed by a user, one can expect thousands of messages every second. It is crucial that ingestion logic keeps up with the stream. To accomplish this, SnappyData collocates the store partitions with partitions created by Spark Streaming. i.e. a batch of data from the stream in each Spark executor is transformed into a compressed column batch and stored in the same JVM, avoiding redundant shuffles (except for HA). +We consider an adnetwork where AdServers log impressions in [Apache Kafka](http://kafka.apache.org/). These impressions +are then aggregated using [Spark Structured Streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) into the SnappyData Store. External clients connect to the same cluster using JDBC/ODBC and run arbitrary OLAP queries. +As AdServers can feed logs from many websites and given that each AdImpression log message represents a single Ad viewed +by a user, one can expect thousands of messages every second. It is crucial that ingestion logic keeps up with the +stream. To accomplish this, SnappyData collocates the store partitions with partitions created by Spark Streaming. +i.e. a batch of data from the stream in each Spark executor is transformed into a compressed column batch and stored in +the same JVM, avoiding redundant shuffles (except for HA). ![Architecture Kinda](AdAnalytics_Architecture.png) The incoming AdImpression log is formatted as depicted below. - -|timestamp |publisher |advertiser | website |geo|bid |cookie | -|-----------------------|-----------|------------|----------|---|--------|---------| -|2016-05-25 16:45:29.027|publisher44|advertiser11|website233|NJ |0.857122|cookie210| -|2016-05-25 16:45:29.027|publisher31|advertiser18|website642|WV |0.211305|cookie985| -|2016-05-25 16:45:29.027|publisher21|advertiser27|website966|ND |0.539119|cookie923| -|2016-05-25 16:45:29.027|publisher34|advertiser11|website284|WV |0.050856|cookie416| -|2016-05-25 16:45:29.027|publisher29|advertiser29|website836|WA |0.896101|cookie781| - - -We pre-aggregate these logs by publisher and geo, and compute the average bid, the number of impressions and the number of uniques (the number of unique users that viewed the Ad) every 2 seconds. We want to maintain the last day’s worth of data in memory for interactive analytics from external clients. + +|timestamp |publisher |advertiser |website |geo|bid |cookie | +|-----------------------|-----------|------------|----------|---|-------------------|---------| +|2020-02-17 16:37:59.289|publisher24|advertiser5 |website478|NJ |0.6682117005884909 |cookie649| +|2020-02-17 16:37:59.289|publisher31|advertiser19|website337|NE |0.5697320252959912 |cookie340| +|2020-02-17 16:37:59.289|publisher27|advertiser14|website364|OK |0.2685715410844016 |cookie536| +|2020-02-17 16:37:59.289|publisher23|advertiser26|website531|MT |0.7226818935272965 |cookie487| +|2020-02-17 16:37:59.289|publisher3 |advertiser15|website937|MT |0.48053937420374915|cookie605| + + +We pre-aggregate these logs by publisher and geo, and compute the average bid, the number of impressions and the number +of uniques (the number of unique users that viewed the Ad) every second. We want to maintain the last day’s worth of +data in memory for interactive analytics from external clients. Some examples of interactive queries: -- **Find total uniques for a certain AD grouped on geography;** -- **Impression trends for advertisers over time;** -- **Top ads based on uniques count for each Geo.** +- **Find total uniques for a certain AD grouped on geography** +- **Impression trends for advertisers over time** +- **Top ads based on uniques count for each Geo** So the aggregation will look something like: - -|timestamp |publisher |geo | avg_bid |imps|uniques| -|------------------------|-----------|----|------------------|----|-------| -|2016-05-25 16:45:01.026 |publisher10| UT |0.5725387931435979|30 |26 | -|2016-05-25 16:44:56.21 |publisher43| VA |0.5682680168342149|22 |20 | -|2016-05-25 16:44:59.024 |publisher19| OH |0.5619481767564926|5 |5 | -|2016-05-25 16:44:52.985 |publisher11| VA |0.4920346523303594|28 |21 | -|2016-05-25 16:44:56.803 |publisher38| WI |0.4585381957119518|40 |31 | + +|time_stamp |publisher |geo|avg_bid |imps|uniques| +|----------------------|-----------|---|-------------------|----|-------| +|2020-02-17 16:39:28.0 |publisher16|NY |0.4269814055107817 |190 |158 | +|2020-02-17 16:39:31.0 |publisher30|CT |0.4482890418617008 |19 |19 | +|2020-02-17 16:39:26.0 |publisher37|HI |0.21539768570303286|2 |2 | +|2020-02-17 16:39:33.0 |publisher38|ID |0.3639807522416625 |15 |15 | +|2020-02-17 16:39:27.0 |publisher37|OH |0.381703659839993 |25 |26 | ### Code highlights -We implemented the ingestion logic using 3 methods mentioned below but only describe the SQL approach for brevity here. -- [Vanilla Spark API](https://github.com/SnappyDataInc/snappy-poc/blob/master/src/main/scala/io/snappydata/adanalytics/SparkLogAggregator.scala) (from the original blog). -- [Spark API with Snappy extensions](https://github.com/SnappyDataInc/snappy-poc/blob/master/src/main/scala/io/snappydata/adanalytics/SnappyAPILogAggregator.scala) to work with the stream as a sequence of DataFrames. (btw, SQL based access to streams is also the theme behind [Structured streaming](https://issues.apache.org/jira/browse/SPARK-8360) being introduced in Spark 2.0 ) -- [SQL based](https://github.com/SnappyDataInc/snappy-poc/blob/master/src/main/scala/io/snappydata/adanalytics/SnappySQLLogAggregator.scala) - described below. - -#### Generating the AdImpression logs -A [KafkaAdImpressionGenerator](src/main/scala/io/snappydata/adanalytics/KafkaAdImpressionProducer.scala) simulates Adservers and generates random [AdImpressionLogs](src/avro/adimpressionlog.avsc)(Avro formatted objects) in batches to Kafka. - ```scala +We implemented the ingestion logic using [Vanilla Spark Structured Streaming](src/main/scala/io/snappydata/adanalytics/SparkLogAggregator.scala) +and [Spark Structured Streaming with Snappy Sink](src/main/scala/io/snappydata/adanalytics/SnappyLogAggregator.scala) +to work with the stream as a sequence of DataFrames. + +#### Generating the AdImpression logs +A [KafkaAdImpressionProducer](src/main/scala/io/snappydata/adanalytics/KafkaAdImpressionProducer.scala) simulates +Adservers and generates random [AdImpressionLogs](src/avro/adimpressionlog.avsc)(Avro formatted objects) in batches to Kafka. + +```scala val props = new Properties() - props.put("serializer.class", "io.snappydata.adanalytics.AdImpressionLogAvroEncoder") - props.put("partitioner.class", "kafka.producer.DefaultPartitioner") - props.put("key.serializer.class", "kafka.serializer.StringEncoder") - props.put("metadata.broker.list", brokerList) - val config = new ProducerConfig(props) - val producer = new Producer[String, AdImpressionLog](config) - sendToKafka(generateAdImpression()) - - def generateAdImpression(): AdImpressionLog = { - val random = new Random() - val timestamp = System.currentTimeMillis() - val publisher = Publishers(random.nextInt(NumPublishers)) - val advertiser = Advertisers(random.nextInt(NumAdvertisers)) - val website = s"website_${random.nextInt(Constants.NumWebsites)}.com" - val cookie = s"cookie_${random.nextInt(Constants.NumCookies)}" - val geo = Geos(random.nextInt(Geos.size)) - val bid = math.abs(random.nextDouble()) % 1 - val log = new AdImpressionLog() - } - - def sendToKafka(log: AdImpressionLog) = { - producer.send(new KeyedMessage[String, AdImpressionLog]( - Constants.kafkaTopic, log.getTimestamp.toString, log)) + props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer") + props.put("value.serializer", "io.snappydata.adanalytics.AdImpressionLogAVROSerializer") + props.put("bootstrap.servers", brokerList) + + val producer = new KafkaProducer[String, AdImpressionLog](props) + + def main(args: Array[String]) { + println("Sending Kafka messages of topic " + kafkaTopic + " to brokers " + brokerList) + val threads = new Array[Thread](numProducerThreads) + for (i <- 0 until numProducerThreads) { + val thread = new Thread(new Worker()) + thread.start() + threads(i) = thread + } + threads.foreach(_.join()) + println(s"Done sending $numLogsPerThread Kafka messages of topic $kafkaTopic") + System.exit(0) } - ``` -#### Spark stream as SQL table and Continuous query - [SnappySQLLogAggregator](src/main/scala/io/snappydata/adanalytics/SnappySQLLogAggregator.scala) creates a stream over the Kafka source. The messages are converted to [Row](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html) objects using [AdImpressionToRowsConverter](src/main/scala/io/snappydata/adanalytics/Codec.scala) to comply with the schema defined in the 'create stream table' below. -This is mostly just a SQL veneer over Spark Streaming. The stream table is also automatically registered with the SnappyData catalog so external clients can access this stream as a table. -Next, a continuous query is registered on the stream table that is used to create the aggregations we spoke about above. The query aggregates metrics for each publisher and geo every 1 second. This query runs every time a batch is emitted. It returns a SchemaDStream. - -```scala - val sc = new SparkContext(sparkConf) - val snsc = new SnappyStreamingContext(sc, batchDuration) - - /** - * AdImpressionStream presents the stream as a Table. It is registered with the Snappy catalog and hence queriable. - * Underneath the covers, this is an abstraction over a DStream. DStream batches are emitted as DataFrames here. - */ - snsc.sql("create stream table adImpressionStream (" + - " time_stamp timestamp," + - " publisher string," + - " advertiser string," + - " website string," + - " geo string," + - " bid double," + - " cookie string) " + - " using directkafka_stream options" + - " (storagelevel 'MEMORY_AND_DISK_SER_2'," + - " rowConverter 'io.snappydata.adanalytics.AdImpressionToRowsConverter' ," + - s" kafkaParams 'metadata.broker.list->$brokerList'," + - s" topics '$kafkaTopic'," + - " K 'java.lang.String'," + - " V 'io.snappydata.adanalytics.AdImpressionLog', " + - " KD 'kafka.serializer.StringDecoder', " + - " VD 'io.snappydata.adanalytics.AdImpressionLogAvroDecoder')") - - // Aggregate metrics for each publisher, geo every few seconds. Just 1 second in this example. - // With the stream registered as a table, we can execute arbitrary queries. - // These queries run each time a batch is emitted by the stream. A continuous query. - val resultStream: SchemaDStream = snsc.registerCQ( - "select min(time_stamp), publisher, geo, avg(bid) as avg_bid," + - " count(*) as imps , count(distinct(cookie)) as uniques" + - " from adImpressionStream window (duration 1 seconds, slide 1 seconds)" + - " where geo != 'unknown' group by publisher, geo") -``` -#### Ingesting into Column table -Next, create the Column table and ingest result of continuous query of aggregating AdImpressionLogs. Here we use the Spark Data Source API to write to the aggrAdImpressions table. This will automatically localize the partitions in the data store without shuffling the data. -```scala - snsc.sql("create table aggrAdImpressions(time_stamp timestamp, publisher string," + - " geo string, avg_bid double, imps long, uniques long) " + - "using column options(buckets '11')") - //Simple in-memory partitioned, columnar table with 11 partitions. - //Other table types, options to replicate, persist, overflow, etc are defined - // here -> http://snappydatainc.github.io/snappydata/rowAndColumnTables/ - - //Persist using the Spark DataSource API - resultStream.foreachDataFrame(_.write.insertInto("aggrAdImpressions")) + def sendToKafka(log: AdImpressionLog): Future[RecordMetadata] = { + producer.send(new ProducerRecord[String, AdImpressionLog]( + Configs.kafkaTopic, log.getTimestamp.toString, log), new org.apache.kafka.clients.producer.Callback() { + override def onCompletion(metadata: RecordMetadata, exception: Exception): Unit = { + if (exception != null) { + if (exception.isInstanceOf[RetriableException]) { + println(s"Encountered a retriable exception while sending messages: $exception") + } else { + throw exception + } + } + } + } + ) + } ``` - -#### Ingesting into a Sample table -Finally, create a sample table that ingests from the column table specified above. This is the table that approximate queries will execute over. Here we create a query column set on the 'geo' column, specify how large of a sample we want relative to the column table (3%) and specify which table to ingest from: +#### Spark Structured Streaming With Snappysink + [SnappyLogAggregator](src/main/scala/io/snappydata/adanalytics/SnappyLogAggregator.scala) creates a stream over the + Kafka source and ingests data into Snappydata table using [`Snappysink`](https://snappydatainc.github.io/snappydata/howto/use_stream_processing_with_snappydata/#structured-streaming). ```scala - snsc.sql("CREATE SAMPLE TABLE sampledAdImpressions" + - " OPTIONS(qcs 'geo', fraction '0.03', strataReservoirSize '50', baseTable 'aggrAdImpressions')") + +// The volumes are low. Optimize Spark shuffle by reducing the partition count +snappy.sql("set spark.sql.shuffle.partitions=8") + +snappy.sql("drop table if exists aggrAdImpressions") + +snappy.sql("create table aggrAdImpressions(time_stamp timestamp, publisher string," + + " geo string, avg_bid double, imps long, uniques long) " + + "using column options(buckets '11')") + +val schema = StructType(Seq(StructField("timestamp", TimestampType), StructField("publisher", + StringType), StructField("advertiser", StringType), + StructField("website", StringType), StructField("geo", StringType), + StructField("bid", DoubleType), StructField("cookie", StringType))) + +import snappy.implicits._ +val df = snappy.readStream + .format("kafka") + .option("kafka.bootstrap.servers", brokerList) + .option("value.deserializer", classOf[ByteArrayDeserializer].getName) + .option("startingOffsets", "earliest") + .option("subscribe", kafkaTopic) + .load() + // projecting only value column of the Kafka data an using + .select("value").as[Array[Byte]](Encoders.BINARY) + .mapPartitions(itr => { + // Reuse deserializer for each partition which will internally reuse decoder and data object + val deserializer = new AdImpressionLogAVRODeserializer + itr.map(data => { + // deserializing AVRO binary data and formulating Row out of it + val adImpressionLog = deserializer.deserialize(data) + Row(new java.sql.Timestamp(adImpressionLog.getTimestamp), adImpressionLog.getPublisher + .toString, adImpressionLog.getAdvertiser.toString, adImpressionLog.getWebsite.toString, + adImpressionLog.getGeo.toString, adImpressionLog.getBid, + adImpressionLog.getCookie.toString) + }) + })(RowEncoder.apply(schema)) + // filtering invalid records + .filter(s"geo != '${Configs.UnknownGeo}'") + +// Aggregating records with +val windowedDF = df.withColumn("eventTime", $"timestamp".cast("timestamp")) + .withWatermark("eventTime", "0 seconds") + .groupBy(window($"eventTime", "1 seconds", "1 seconds"), $"publisher", $"geo") + .agg(unix_timestamp(min("timestamp"), "MM-dd-yyyy HH:mm:ss").alias("timestamp"), + avg("bid").alias("avg_bid"), count("geo").alias("imps"), + approx_count_distinct("cookie").alias("uniques")) + .select("timestamp", "publisher", "geo", "avg_bid", "imps", "uniques") + +val logStream = windowedDF + .writeStream + .format("snappysink") // using snappysink as output sink + .queryName("log_aggregator") // name of the streaming query + .trigger(ProcessingTime("1 seconds")) // trigger the batch processing every second + .option("tableName", "aggrAdImpressions") // target table name where data will be ingested + //checkpoint location where the streaming query progress and intermediate aggregation state + // is stored. It should be ideally on some HDFS location. + .option("checkpointLocation", snappyLogAggregatorCheckpointDir) + // Only the rows that were updated since the last trigger will be outputted to the sink. + // More details about output mode: https://spark.apache.org/docs/2.1.1/structured-streaming-programming-guide.html#output-modes + .outputMode("update") + .start ``` ### Let's get this going In order to run this example, we need to install the following: -1. [Apache Kafka 2.11-0.8.2.1](http://kafka.apache.org/downloads.html) -2. [SnappyData 1.0.0 Enterprise Release](https://www.snappydata.io/download). Download the binary snappydata-1.0.0-bin.tar.gz and Unzip it. -The binaries will be inside "snappydata-1.0.0-bin" directory. +1. [Apache Kafka 2.11-0.10.2.2](https://archive.apache.org/dist/kafka/0.10.2.2/kafka_2.11-0.10.2.2.tgz) +2. [TIBCO ComputeDB 1.2.0](https://tap.tibco.com/storefront/trialware/tibco-computedb-developer-edition/prod15349.html) +Or [Snappydata 1.2.0](https://github.com/SnappyDataInc/snappydata/releases/download/v1.2.0/snappydata-1.2.0-bin.tar.gz) 3. JDK 8 -Then checkout the Ad analytics example -``` -git clone https://github.com/SnappyDataInc/snappy-poc.git -``` - -Note that the instructions for kafka configuration below are for 2.11-0.8.2.1 version of Kafka. - -To setup kafka cluster, start Zookeeper first from the root kafka folder with default zookeeper.properties: +To setup Kafka cluster, start Zookeeper first from the root Kafka folder with default zookeeper.properties: ``` bin/zookeeper-server-start.sh config/zookeeper.properties ``` @@ -187,12 +211,17 @@ Start one Kafka broker with default properties: bin/kafka-server-start.sh config/server.properties ``` -From the root kafka folder, Create a topic "adImpressionsTopic": +From the root Kafka folder, Create a topic "adImpressionsTopic": ``` bin/kafka-topics.sh --create --zookeeper localhost:2181 --partitions 8 --topic adImpressionsTopic --replication-factor=1 ``` -Next from the checkout `/snappy-poc/` directory, build the example +Checkout the Ad analytics example +``` +git clone https://github.com/SnappyDataInc/snappy-examples.git +``` + +Next from the checkout `/snappy-examples/` directory, build the example ``` -- Build and create a jar having all dependencies in assembly/build/libs ./gradlew assemble @@ -202,14 +231,14 @@ Next from the checkout `/snappy-poc/` directory, build the example ``` Goto the SnappyData product install home directory. -In conf subdirectory, create file "spark-env.sh"(copy spark-env.sh.template) and add this line ... +In conf subdirectory, create file "spark-env.sh" (copy spark-env.sh.template) and add this line ... ``` -SPARK_DIST_CLASSPATH=SNAPPY_POC_HOME/assembly/build/libs/snappy-poc-1.0.0-assembly.jar +SPARK_DIST_CLASSPATH=/assembly/build/libs/snappy-examples-1.2.0-assembly.jar ``` -> Make sure you set the SNAPPY_POC_HOME directory appropriately above +> Make sure you set the `snappy_examples_home` directory appropriately above -Leave this file open as you will copy/paste the path for SNAPPY_POC_HOME shortly. +Leave this file open as you will copy/paste the path for `snappy_examples_home` shortly. Start SnappyData cluster using following command from installation directory. @@ -217,105 +246,122 @@ Start SnappyData cluster using following command from installation directory. ./sbin/snappy-start-all.sh ``` -This will start one locator, one server and a lead node. You can understand the roles of these nodes [here](https://github.com/SnappyDataInc/snappydata/blob/master/docs/GettingStarted.md#snappydata-cluster-explanation) - - +This will start one locator, one server and a lead node. You can understand the roles of these nodes [here](https://snappydatainc.github.io/snappydata/architecture/cluster_architecture/) Submit the streaming job to the cluster and start it (consume the stream, aggregate and store). > Make sure you copy/paste the SNAPPY_POC_HOME path from above in the command below where indicated ``` -./bin/snappy-job.sh submit --lead localhost:8090 --app-name AdAnalytics --class io.snappydata.adanalytics.SnappySQLLogAggregatorJob --app-jar SNAPPY_POC_HOME/assembly/build/libs/snappy-poc-1.0.0-assembly.jar --stream +./bin/snappy-job.sh submit --lead localhost:8090 --app-name AdAnalytics --class io.snappydata.adanalytics.SnappyLogAggregator --app-jar $SNAPPY_EXAMPLES_HOME/assembly/build/libs/snappy-examples-1.2.0-assembly.jar ``` -SnappyData supports "Managed Spark Drivers" by running these in Lead nodes. So, if the driver were to fail, it can automatically re-start on a standby node. While the Lead node starts the streaming job, the actual work of parallel processing from kafka, etc is done in the SnappyData servers. Servers execute Spark Executors collocated with the data. +8090 is the default port of spark-jobserver which is used to manage snappy jobs. + +SnappyData supports "Managed Spark Drivers" by running these in Lead nodes. So, if the driver were to fail, it can +automatically re-start on a standby node. While the Lead node starts the streaming job, the actual work of parallel +processing from Apache Kafka, etc is done in the SnappyData servers. Servers execute Spark Executors collocated with the data. -Start generating and publishing logs to Kafka from the `/snappy-poc/` folder +Start generating and publishing logs to Kafka from the `/snappy-examples/` folder ``` ./gradlew generateAdImpressions ``` -You can see the Spark streaming processing batches of data once every second in the [Spark console](http://localhost:4040/streaming/). It is important that our stream processing keeps up with the input rate. So, we note that the 'Scheduling Delay' doesn't keep increasing and 'Processing time' remains less than a second. +You can monitor the streaming query progress on the [Structured Streaming UI](http://localhost:5050/structuredstreaming/). It is +important that our stream processing keeps up with the input rate. So, we should monitor that the `Processing Rate` keeps +up with `Input Rate` and `Processing Time` remains less than the trigger interval which is one second. ### Next, interact with the data. Fast. Now, we can run some interactive analytic queries on the pre-aggregated data. From the root SnappyData folder, enter: ``` -./bin/snappy-shell +./bin/snappy-sql ``` Once this loads, connect to your running local cluster with: ``` -connect client 'localhost:1527'; +CONNECT CLIENT 'localhost:1527'; ``` -Set Spark shuffle partitions low since we don't have a lot of data; you can optionally view the members of the cluster as well: +Set Spark shuffle partitions to a lower number since we don't have a lot of data; you can optionally view the members of the cluster +as well: ``` -set spark.sql.shuffle.partitions=7; -show members; +SET spark.sql.shuffle.partitions=8; +SHOW members; ``` Let's do a quick count to make sure we have the ingested data: ```sql -select count(*) from aggrAdImpressions; +SELECT COUNT(*) FROM aggrAdImpressions; ``` -Let's also directly query the stream using SQL: +Now, lets run some OLAP queries on the column table of exact data. First, lets find the top 20 geographies with the most +ad impressions: ```sql -select count(*) from adImpressionStream; +SELECT COUNT(*) AS adCount, geo FROM aggrAdImpressions GROUP BY geo ORDER BY adCount DESC LIMIT 20; ``` -Now, lets run some OLAP queries on the column table of exact data. First, lets find the top 20 geographies with the most ad impressions: +Next, let's find the total uniques for a given ad, grouped by geography: ```sql -select count(*) AS adCount, geo from aggrAdImpressions group by geo order by adCount desc limit 20; +SELECT SUM(uniques) AS totalUniques, geo FROM aggrAdImpressions WHERE publisher='publisher11' GROUP BY geo ORDER BY totalUniques DESC LIMIT 20; ``` -Next, let's find the total uniques for a given ad, grouped by geography: +Note: Following instructions will only work with TIBCO ComputeDB distribution and not with Snappydata community distribution. + +Now that we've seen some standard OLAP queries over the exact data, let's execute the same queries on our sample tables +using SnappyData's [Approximate Query Processing techniques](https://github.com/SnappyDataInc/snappydata/blob/master/docs/aqp.md). +In most production situations, the latency difference here would be significant because the volume of data in the exact +table would be much higher than the sample tables. Since this is an example, there will not be a significant difference; +we are showcasing how easy AQP is to use. + +Create a sample table that ingests from the column table specified above. This is the table that approximate +queries will execute over. Here we create a query column set on the 'geo' column, specify how large of a sample we want +relative to the column table (3%) and specify which table to ingest from: ```sql -select sum(uniques) AS totalUniques, geo from aggrAdImpressions where publisher='publisher11' group by geo order by totalUniques desc limit 20; +CREATE SAMPLE TABLE sampledAdImpressions OPTIONS(qcs 'geo', fraction '0.03', strataReservoirSize '50', baseTable 'aggrAdImpressions'); ``` -Now that we've seen some standard OLAP queries over the exact data, let's execute the same queries on our sample tables using SnappyData's [Approximate Query Processing techinques](https://github.com/SnappyDataInc/snappydata/blob/master/docs/aqp.md). In most production situations, the latency difference here would be significant because the volume of data in the exact table would be much higher than the sample tables. Since this is an example, there will not be a significant difference; we are showcasing how easy AQP is to use. - -We are asking for an error rate of 20% or below and a confidence interval of 0.95 (note the last two clauses on the query). The addition of these last two clauses route the query to the sample table despite the exact table being in the FROM clause. If the error rate exceeds 20% an exception will be produced: +We are asking for an error rate of 20% or below and a confidence interval of 0.95 (note the last two clauses on the query). +The addition of these last two clauses route the query to the sample table despite the base table being in the FROM +clause. If the error rate exceeds 20% an exception will be produced: ```sql -select count(*) AS adCount, geo from aggrAdImpressions group by geo order by adCount desc limit 20 with error 0.20 confidence 0.95 ; +SELECT COUNT(*) AS adCount, geo FROM aggrAdImpressions GROUP BY geo ORDER BY adCount DESC LIMIT 20 WITH ERROR 0.20 CONFIDENCE 0.95 ; ``` And the second query from above: ```sql -select sum(uniques) AS totalUniques, geo from aggrAdImpressions where publisher='publisher11' group by geo order by totalUniques desc limit 20 with error 0.20 confidence 0.95 ; +SELECT SUM(uniques) AS totalUniques, geo FROM aggrAdImpressions WHERE publisher='publisher11' GROUP BY geo ORDER BY totalUniques DESC LIMIT 20 WITH ERROR 0.20 CONFIDENCE 0.95 ; ``` -Note that you can still query the sample table without specifying error and confidence clauses by simply specifying the sample table in the FROM clause: +Note that you can still query the sample table without specifying error and confidence clauses by simply specifying the +sample table in the FROM clause: ```sql -select sum(uniques) AS totalUniques, geo from sampledAdImpressions where publisher='publisher11' group by geo order by totalUniques desc; +SELECT SUM(uniques) AS totalUniques, geo FROM sampledAdImpressions WHERE publisher='publisher11' GROUP BY geo ORDER BY totalUniques DESC; ``` Now, we check the size of the sample table: ```sql -select count(*) as sample_cnt from sampledAdImpressions; +SELECT COUNT(*) AS sample_cnt FROM sampledAdImpressions; ``` -Finally, stop the SnappyData cluser with: +Finally, stop the SnappyData cluster with: ``` ./sbin/snappy-stop-all.sh ``` ### So, what was the point again? -Hopefully we showed you how simple yet flexible it is to parallely ingest, process using SQL, run continuous queries, store data in column and sample tables and interactively query data. All in a single unified cluster. -We will soon release Part B of this exercise - a benchmark of this use case where we compare SnappyData to other alternatives. Coming soon. +Hopefully we showed you how simple yet flexible it is to parallely ingest, process using SQL, run continuous queries, +store data in column and sample tables and interactively query data. All in a single unified cluster. ### Ask questions, start a Discussion @@ -326,7 +372,7 @@ We will soon release Part B of this exercise - a benchmark of this use case wher [SnappyData Docs](http://snappydatainc.github.io/snappydata/) -[This Example Source](https://github.com/SnappyDataInc/snappy-poc) +[This Example Source](https://github.com/SnappyDataInc/snappy-examples) [SnappyData Technical Paper](http://www.snappydata.io/snappy-industrial) diff --git a/assembly/build.gradle b/assembly/build.gradle index fce24ff..767a278 100644 --- a/assembly/build.gradle +++ b/assembly/build.gradle @@ -1,12 +1,9 @@ apply plugin: 'com.github.johnrengelman.shadow' -archivesBaseName = 'snappy-poc' +archivesBaseName = 'snappy-examples' dependencies { compile rootProject - compile project(':spark-memsql') - compile project(':spark-cassandra') - compile project(':rabbitmq-snappy') } shadowJar { diff --git a/build.gradle b/build.gradle index 6001302..0b5aa66 100644 --- a/build.gradle +++ b/build.gradle @@ -4,20 +4,15 @@ plugins { id 'com.commercehub.gradle.plugin.avro' version "0.5.0" } -archivesBaseName = 'snappy-poc' +archivesBaseName = 'snappy-examples' allprojects { - version = '1.0.0' + version = '1.2.0' repositories { + mavenLocal() mavenCentral() - maven { url "https://oss.sonatype.org/content/groups/public" } - maven { url "https://oss.sonatype.org/content/repositories/snapshots" } - maven { url "http://repository.snappydata.io/repository/internal" } - maven { url "http://repository.snappydata.io/repository/snapshots" } - maven { url "http://mvnrepository.com/artifact" } - maven { url 'https://clojars.org/repo' } } apply plugin: 'java' @@ -28,8 +23,8 @@ allprojects { apply plugin: "com.commercehub.gradle.plugin.avro" ext { - sparkVersion = '2.0.1-3' - snappyVersion = '1.0.0' + sparkVersion = '2.1.1.8' + snappyVersion = '1.2.0' } configurations.all { @@ -49,14 +44,14 @@ dependencies { compileOnly "io.snappydata:snappy-spark-core_2.11:${sparkVersion}" compileOnly "io.snappydata:snappy-spark-catalyst_2.11:${sparkVersion}" compileOnly "io.snappydata:snappy-spark-sql_2.11:${sparkVersion}" - // compileOnly "io.snappydata:snappydata-aqp_2.11:${snappyVersion}" - - compile 'com.miguno:kafka-avro-codec_2.10:0.1.1-SNAPSHOT' - compile 'org.apache.kafka:kafka_2.11:0.8.2.1' - compile 'com.twitter:algebird-core_2.10:0.1.11' - compile 'com.googlecode.javaewah:JavaEWAH:1.1.5' - compile 'org.joda:joda-convert:1.2' - compile 'com.opencsv:opencsv:3.3' + + compile 'org.apache.kafka:kafka-clients:0.10.0.1' + compile 'org.apache.kafka:kafka_2.11:0.10.0.1' +} + +sourceSets { + test.compileClasspath += configurations.compileOnly + test.runtimeClasspath += configurations.compileOnly } task generateAvro(type: com.commercehub.gradle.plugin.avro.GenerateAvroJavaTask) { @@ -68,7 +63,6 @@ compileJava.source(generateAvro.outputs) avro.stringType = "charSequence" - ext { assemblyJar = rootProject.tasks.getByPath(':assembly:shadowJar').outputs } @@ -79,84 +73,4 @@ task generateAdImpressions(type: JavaExec, dependsOn: classes) { main = 'io.snappydata.adanalytics.KafkaAdImpressionProducer' classpath sourceSets.test.runtimeClasspath environment 'PROJECT_ASSEMBLY_JAR', assemblyJar.files.asPath -} - -task aggeregateAdImpressions_API(type: JavaExec, dependsOn: classes) { - main = 'io.snappydata.adanalytics.SnappyAPILogAggregator' - jvmArgs = ['-XX:MaxPermSize=512m'] - classpath sourceSets.test.runtimeClasspath - environment 'PROJECT_ASSEMBLY_JAR', assemblyJar.files.asPath -} - -task aggeregateAdImpressions_SQL(type: JavaExec, dependsOn: classes) { - main = 'io.snappydata.adanalytics.SnappySQLLogAggregator' - jvmArgs = ['-XX:MaxPermSize=512m'] - classpath sourceSets.test.runtimeClasspath - environment 'PROJECT_ASSEMBLY_JAR', assemblyJar.files.asPath -} - -task generateAdImpressions_Socket(type: JavaExec, dependsOn: classes) { - main = 'io.snappydata.benchmark.SocketAdImpressionGenerator' - classpath sourceSets.test.runtimeClasspath - environment 'PROJECT_ASSEMBLY_JAR', assemblyJar.files.asPath - maxHeapSize = "8196m" -} - -task startSnappyIngestionPerf_Socket(type: JavaExec, dependsOn: classes) { - main = 'io.snappydata.benchmark.SocketSnappyIngestionPerf' - jvmArgs = ['-XX:MaxPermSize=512m'] - maxHeapSize = "8196m" - classpath sourceSets.test.runtimeClasspath - environment 'PROJECT_ASSEMBLY_JAR', assemblyJar.files.asPath -} - -task startSnappyIngestionPerf_CustomReceiver(type: JavaExec, dependsOn: classes) { - main = 'io.snappydata.benchmark.CustomReceiverSnappyIngestionPerf' - jvmArgs = ['-XX:MaxPermSize=512m'] - maxHeapSize = "8196m" - classpath sourceSets.test.runtimeClasspath - environment 'PROJECT_ASSEMBLY_JAR', assemblyJar.files.asPath -} - -task startSnappyIngestionPerf_CSV(type: JavaExec, dependsOn: classes) { - main = 'io.snappydata.benchmark.CSVSnappyIngestionPerf' - jvmArgs = ['-XX:MaxPermSize=512m'] - maxHeapSize = "8196m" - classpath sourceSets.test.runtimeClasspath - environment 'PROJECT_ASSEMBLY_JAR', assemblyJar.files.asPath -} - -task startSnappyIngestionPerf_Kafka(type: JavaExec, dependsOn: classes) { - main = 'io.snappydata.benchmark.KafkaSnappyIngestionPerf' - jvmArgs = ['-XX:MaxPermSize=512m'] - maxHeapSize = "8196m" - classpath sourceSets.test.runtimeClasspath - environment 'PROJECT_ASSEMBLY_JAR', assemblyJar.files.asPath -} - -task product(type: Exec) { - dependsOn ':assembly:shadowJar' - - def productDir = "${rootProject.buildDir}/snappydata-poc" - def snappyData = System.env.SNAPPYDATA - if (snappyData == null || snappyData.length() == 0) { - snappyData = "${projectDir}/../snappydata" - } - - doFirst { - delete productDir - file("${productDir}/lib").mkdirs() - } - - // first execute the snappydata "product" target based on env var SNAPPYDATA - workingDir snappyData - commandLine './gradlew', 'copyProduct', "-PcopyToDir=${productDir}" - - // lastly copy own assembly fat jar in product lib dir - doLast { - copy { - from assemblyJar - into "${productDir}/lib" - } - } -} +} \ No newline at end of file diff --git a/rabbitmq-snappy/build.gradle b/rabbitmq-snappy/build.gradle deleted file mode 100644 index 616e646..0000000 --- a/rabbitmq-snappy/build.gradle +++ /dev/null @@ -1,23 +0,0 @@ -dependencies { - compileOnly "io.snappydata:snappydata-core_2.11:${snappyVersion}" - compileOnly "io.snappydata:snappydata-cluster_2.11:${snappyVersion}" - compileOnly "io.snappydata:snappy-spark-core_2.11:${sparkVersion}" - compileOnly "io.snappydata:snappy-spark-catalyst_2.11:${sparkVersion}" - compileOnly "io.snappydata:snappy-spark-sql_2.11:${sparkVersion}" - compileOnly "io.snappydata:snappy-spark-streaming_2.11:${sparkVersion}" - compile 'com.rabbitmq:amqp-client:3.5.7' - compile project(':') -} - -task publishRabbitMQMsgs(type: JavaExec, dependsOn: classes) { - main = 'io.snappydata.adanalytics.aggregator.RabbitMQPublisher' - classpath sourceSets.main.runtimeClasspath - classpath configurations.runtime -} - -task receiveRabbitMQMsgs(type: JavaExec, dependsOn: classes) { - main = 'io.snappydata.adanalytics.aggregator.RabbitMQSnappyStream' - classpath sourceSets.main.runtimeClasspath - classpath configurations.runtime - jvmArgs = ['-XX:MaxPermSize=512m'] // for snappy -} diff --git a/rabbitmq-snappy/src/main/java/io/snappydata/rabbitmq/RabbitMQPublisher.java b/rabbitmq-snappy/src/main/java/io/snappydata/rabbitmq/RabbitMQPublisher.java deleted file mode 100644 index 6ed9ee4..0000000 --- a/rabbitmq-snappy/src/main/java/io/snappydata/rabbitmq/RabbitMQPublisher.java +++ /dev/null @@ -1,75 +0,0 @@ -/* - * Copyright (c) 2016 SnappyData, Inc. All rights reserved. - * - * Licensed under the Apache License, Version 2.0 (the "License"); you - * may not use this file except in compliance with the License. You - * may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - * implied. See the License for the specific language governing - * permissions and limitations under the License. See accompanying - * LICENSE file. - */ - -package io.snappydata.rabbitmq; - -import com.rabbitmq.client.Channel; -import com.rabbitmq.client.Connection; -import com.rabbitmq.client.ConnectionFactory; -import io.snappydata.adanalytics.AdImpressionGenerator; -import io.snappydata.adanalytics.AdImpressionLog; -import org.apache.avro.io.BinaryEncoder; -import org.apache.avro.io.DatumWriter; -import org.apache.avro.io.EncoderFactory; -import org.apache.avro.specific.SpecificDatumWriter; - -import java.io.ByteArrayOutputStream; -import java.io.IOException; - -/** - * A RabbitMQ client which publishes messages to RabbitMQ server - */ -public class RabbitMQPublisher { - - private final static String QUEUE_NAME = "rabbitmq-q"; - - public static void main(String[] argv) - throws Exception { - ConnectionFactory factory = new ConnectionFactory(); - factory.setHost("localhost"); - Connection connection = factory.newConnection(); - Channel channel = connection.createChannel(); - channel.queueDeclare(QUEUE_NAME, true, false, false, null); - - int logCount = 0; - int totalNumLogs = 1000000; - - while (logCount <= totalNumLogs) { - AdImpressionLog log = AdImpressionGenerator.nextRandomAdImpression(); - channel.basicPublish("", QUEUE_NAME, null, getLogBytes(log)); - logCount += 1; - if (logCount % 100000 == 0) { - System.out.println("RabbitMQPublisher published total " + - logCount + " messages"); - } - } - - channel.close(); - connection.close(); - } - - private static byte[] getLogBytes(AdImpressionLog log) throws IOException { - ByteArrayOutputStream out = new ByteArrayOutputStream(); - BinaryEncoder encoder = EncoderFactory.get().binaryEncoder(out, null); - SpecificDatumWriter writer = - new SpecificDatumWriter(AdImpressionLog.getClassSchema()); - writer.write(log, encoder); - encoder.flush(); - out.close(); - return out.toByteArray(); - } -} diff --git a/rabbitmq-snappy/src/main/scala/io/snappydata/rabbitmq/RabbitMQAvroDecoder.scala b/rabbitmq-snappy/src/main/scala/io/snappydata/rabbitmq/RabbitMQAvroDecoder.scala deleted file mode 100644 index 5a06f92..0000000 --- a/rabbitmq-snappy/src/main/scala/io/snappydata/rabbitmq/RabbitMQAvroDecoder.scala +++ /dev/null @@ -1,14 +0,0 @@ -package io.snappydata.rabbitmq - -import io.snappydata.adanalytics.AdImpressionLog -import org.apache.avro.io.DecoderFactory -import org.apache.avro.specific.SpecificDatumReader -import org.apache.spark.sql.streaming.RabbitMQDecoder - -class RabbitMQAvroDecoder extends RabbitMQDecoder[AdImpressionLog] { - def fromBytes(bytes: scala.Array[scala.Byte]): AdImpressionLog = { - val reader = new SpecificDatumReader[AdImpressionLog](AdImpressionLog.getClassSchema()) - val decoder = DecoderFactory.get().binaryDecoder(bytes, null) - reader.read(null, decoder) - } -} \ No newline at end of file diff --git a/rabbitmq-snappy/src/main/scala/io/snappydata/rabbitmq/RabbitMQSnappyApp.scala b/rabbitmq-snappy/src/main/scala/io/snappydata/rabbitmq/RabbitMQSnappyApp.scala deleted file mode 100644 index 5970f38..0000000 --- a/rabbitmq-snappy/src/main/scala/io/snappydata/rabbitmq/RabbitMQSnappyApp.scala +++ /dev/null @@ -1,70 +0,0 @@ -/* - * Copyright (c) 2016 SnappyData, Inc. All rights reserved. - * - * Licensed under the Apache License, Version 2.0 (the "License"); you - * may not use this file except in compliance with the License. You - * may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - * implied. See the License for the specific language governing - * permissions and limitations under the License. See accompanying - * LICENSE file. - */ - -package io.snappydata.rabbitmq - -import io.snappydata.adanalytics.Configs -import Configs._ -import org.apache.log4j.{Level, Logger} -import org.apache.spark.streaming.SnappyStreamingContext -import org.apache.spark.{SparkConf, SparkContext} - -/** - * A Snappy streaming program to receive RabbitMQ messages - */ -object RabbitMQSnappyApp extends App { - - Logger.getRootLogger().setLevel(Level.ERROR) - - val sparkConf = new SparkConf() - .setAppName(getClass.getSimpleName) - .setMaster(s"$sparkMasterURL") - .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") - - val sc = new SparkContext(sparkConf) - val snsc = new SnappyStreamingContext(sc, batchDuration) - - snsc.sql("drop table if exists adImpressions") - snsc.sql("drop table if exists adImpressionStream") - - snsc.sql("create stream table adImpressionStream (" + - " timestamp long," + - " publisher string," + - " advertiser string," + - " website string," + - " geo string," + - " bid double," + - " cookie string) " + - " using rabbitmq_stream options(" + - " rowConverter 'io.snappydata.adanalytics.aggregator.AdImpressionToRowsConverter' ," + - " host 'localhost',"+ - " queueName 'rabbitmq-q',"+ - " T 'io.snappydata.adanalytics.aggregator.AdImpressionLog'," + - " D 'io.snappydata.adanalytics.aggregator.RabbitMQAvroDecoder')") - - snsc.sql("create table adImpressions(timestamp long, publisher string, " + - "advertiser string, website string, geo string, bid double, cookie string) " + - "using column options ( BUCKETS '29')") - - snsc.getSchemaDStream("adImpressionStream").foreachDataFrame(df => { - df.show - df.write.insertInto("adImpressions") - }) - - snsc.start() - snsc.awaitTermination() -} diff --git a/settings.gradle b/settings.gradle index 553809b..7b093e8 100644 --- a/settings.gradle +++ b/settings.gradle @@ -1 +1 @@ -include 'spark-memsql', 'spark-cassandra', 'rabbitmq-snappy', 'assembly' +include 'assembly' diff --git a/spark-cassandra/build.gradle b/spark-cassandra/build.gradle deleted file mode 100644 index bdcb2ef..0000000 --- a/spark-cassandra/build.gradle +++ /dev/null @@ -1,37 +0,0 @@ -dependencies { - compileOnly "org.apache.spark:spark-core_2.11:1.6.0" - compileOnly "org.apache.spark:spark-streaming_2.11:1.6.0" - compileOnly "org.apache.spark:spark-streaming-kafka_2.11:1.6.0" - // compile "org.apache.spark:spark-streaming-kafka_2.10:1.6.0" //required in assemble jar in cluster mode - compileOnly "org.apache.spark:spark-sql_2.11:1.6.0" - - compile project(':') - compile 'com.datastax.spark:spark-cassandra-connector_2.11:1.6.0-M2' -} - -task startCassandraStreamIngestPerf(type: JavaExec, dependsOn: classes) { - dependsOn ':assembly:shadowJar' - main = 'io.snappydata.benchmark.CassandraStreamIngestPerf' - def filterGuava = sourceSets.test.runtimeClasspath.findAll { !it.getName().contains('guava-14') } - classpath filterGuava - environment 'PROJECT_ASSEMBLY_JAR', "~/guava-16.0.1.jar:~/AdImpressionLogAggr-0.3-assembly.jar" -} - -task startCassandraQueryPerf(type: JavaExec, dependsOn: classes) { - dependsOn ':assembly:shadowJar' - main = 'io.snappydata.benchmark.CassandraQueryPerf' - def filterGuava = sourceSets.test.runtimeClasspath.findAll { !it.getName().contains('guava-14') } - classpath filterGuava - environment 'PROJECT_ASSEMBLY_JAR', "~/guava-16.0.1.jar:~/AdImpressionLogAggr-0.3-assembly.jar" -} - -task startOLAPStreamingBenchmark(type: JavaExec, dependsOn: classes) { - dependsOn ':assembly:shadowJar' - main = 'io.snappydata.benchmark.chbench.OLAPStreamingBench' - maxHeapSize = "57344m" - def filterGuava = sourceSets.test.runtimeClasspath.findAll { !it.getName().contains('guava-14') } - classpath filterGuava - environment 'PROJECT_ASSEMBLY_JAR', "~/guava-16.0.1.jar:~/AdImpressionLogAggr-0.3-assembly.jar" -} - - diff --git a/spark-cassandra/src/main/scala/io/snappydata/benchmark/CassandraQueryPerf.scala b/spark-cassandra/src/main/scala/io/snappydata/benchmark/CassandraQueryPerf.scala deleted file mode 100644 index f2e0e45..0000000 --- a/spark-cassandra/src/main/scala/io/snappydata/benchmark/CassandraQueryPerf.scala +++ /dev/null @@ -1,50 +0,0 @@ -package io.snappydata.benchmark - -import io.snappydata.adanalytics.Configs -import Configs._ -import org.apache.log4j.{Level, Logger} -import org.apache.spark.sql.cassandra.CassandraSQLContext -import org.apache.spark.{SparkConf, SparkContext} - -object CassandraQueryPerf extends App { - - val rootLogger = Logger.getLogger("org"); - rootLogger.setLevel(Level.WARN); - - val conf = new SparkConf(true) - .setAppName(getClass.getSimpleName) - .set("spark.cassandra.connection.host", s"$cassandraHost") - .set("spark.cassandra.auth.username", "cassandra") - .set("spark.cassandra.auth.password", "cassandra") - .set("spark.cassandra.sql.keyspace", "adlogs") - // .set("spark.sql.shuffle.partitions", "8") - .setMaster("local[*]") - .set("spark.executor.cores", "2") - .set("spark.ui.port", "4041") - val assemblyJar = System.getenv("PROJECT_ASSEMBLY_JAR") - if (assemblyJar != null) { - conf.set("spark.driver.extraClassPath", assemblyJar) - conf.set("spark.executor.extraClassPath", assemblyJar) - } - - val sc = new SparkContext(conf) - val msc = new CassandraSQLContext(sc) - msc.setKeyspace("adlogs") - var start = System.currentTimeMillis() - msc.sql("select count(*) AS adCount, geo from adimpressions group" + - " by geo order by adCount desc limit 20").collect() - println("Time for Q1 " + (System.currentTimeMillis() - start )) - - start = System.currentTimeMillis() - msc.sql("select sum (bid) as max_bid, geo from adimpressions group" + - " by geo order by max_bid desc limit 20").collect() - println("Time for Q2 " + (System.currentTimeMillis() - start )) - - start = System.currentTimeMillis() - msc.sql("select sum (bid) as max_bid, publisher from adimpressions" + - " group by publisher order by max_bid desc limit 20").collect() - println("Time for Q3 " + (System.currentTimeMillis() - start )) - - msc.sql("select count(*) as cnt from adimpressions").show() - -} diff --git a/spark-cassandra/src/main/scala/io/snappydata/benchmark/CassandraStreamIngestPerf.scala b/spark-cassandra/src/main/scala/io/snappydata/benchmark/CassandraStreamIngestPerf.scala deleted file mode 100644 index 923aeb1..0000000 --- a/spark-cassandra/src/main/scala/io/snappydata/benchmark/CassandraStreamIngestPerf.scala +++ /dev/null @@ -1,95 +0,0 @@ -/* - * Copyright (c) 2016 SnappyData, Inc. All rights reserved. - * - * Licensed under the Apache License, Version 2.0 (the "License"); you - * may not use this file except in compliance with the License. You - * may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - * implied. See the License for the specific language governing - * permissions and limitations under the License. See accompanying - * LICENSE file. - */ - -package io.snappydata.benchmark - -import com.datastax.spark.connector.SomeColumns -import com.datastax.spark.connector.cql.CassandraConnector -import io.snappydata.adanalytics.AdImpressionLogAvroDecoder -import io.snappydata.adanalytics.Configs._ -import io.snappydata.adanalytics.AdImpressionLog -import kafka.serializer.StringDecoder -import org.apache.log4j.{Level, Logger} -import org.apache.spark.sql.cassandra.CassandraSQLContext -import org.apache.spark.streaming.StreamingContext -import org.apache.spark.streaming.kafka.KafkaUtils -import org.apache.spark.{SparkConf, SparkContext} - -/** - * Simple direct kafka spark streaming program which pulls log messages - * from kafka broker and ingest those log messages to Cassandra using - * DataStax's Spark Cassandra Connector. To run this program you need to - * start a single instance of Cassandra and run Spark in local mode. - */ -object CassandraStreamIngestPerf extends App { - - val rootLogger = Logger.getLogger("org"); - rootLogger.setLevel(Level.WARN); - - val conf = new SparkConf(true) - .setAppName(getClass.getSimpleName) - .set("spark.cassandra.connection.host", s"$cassandraHost") - .set("spark.cassandra.auth.username", "cassandra") - .set("spark.cassandra.auth.password", "cassandra") - .set("spark.streaming.kafka.maxRatePerPartition", s"$maxRatePerPartition") - .set("spark.cassandra.output.batch.size.bytes", "5120") //8000 * 1024 - .set("spark.cassandra.output.concurrent.writes", "32") - .set("spark.cassandra.output.consistency.level", "ANY") - .set("spark.cassandra.output.batch.grouping.key", "none") ///replica_set/partition - .set("spark.cassandra.sql.keyspace", "adlogs") - .set("spark.executor.cores", "2") - //.set("spark.cassandra.output.batch.size.rows", "500") - //.set("spark.cassandra.output.batch.grouping.buffer.size", "1") //1000 - //.setMaster(s"$sparkMasterURL") - .setMaster("local[*]") - - val assemblyJar = System.getenv("PROJECT_ASSEMBLY_JAR") - if (assemblyJar != null) { - conf.set("spark.driver.extraClassPath", assemblyJar) - conf.set("spark.executor.extraClassPath", assemblyJar) - } - val sc = new SparkContext(conf) - val csc = new CassandraSQLContext(sc) - CassandraConnector(conf).withSessionDo { session => - // Create keysapce and table in Cassandra - session.execute(s"DROP KEYSPACE IF EXISTS adlogs") - session.execute(s"CREATE KEYSPACE IF NOT EXISTS adlogs " + - s"WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }") - session.execute(s"CREATE TABLE IF NOT EXISTS adlogs.adimpressions " + - s"(timestamp bigint, publisher text, advertiser text, " + - "website text, geo text, bid double, cookie text, primary key (timestamp, cookie))") - } - csc.setKeyspace("adlogs") - - // batchDuration of 1 second - val ssc = new StreamingContext(sc, batchDuration) - - // Creates a stream of AdImpressionLog using kafka direct that pulls - // messages from a Kafka Broker - val messages = KafkaUtils.createDirectStream - [String, AdImpressionLog, StringDecoder, AdImpressionLogAvroDecoder](ssc, kafkaParams, topics) - - import com.datastax.spark.connector.streaming._ - - messages.map(_._2).map(ad => - (ad.getTimestamp, ad.getPublisher, ad.getAdvertiser, ad.getWebsite, ad.getGeo, ad.getBid, ad.getCookie)) - .saveToCassandra("adlogs", "adimpressions", - SomeColumns("timestamp", "publisher", "advertiser", "website", "geo", "bid", "cookie")) - - ssc.start - ssc.awaitTermination -} diff --git a/spark-cassandra/src/main/scala/io/snappydata/benchmark/chbench/BenchmarkingReceiver.scala b/spark-cassandra/src/main/scala/io/snappydata/benchmark/chbench/BenchmarkingReceiver.scala deleted file mode 100644 index 82c1e3b..0000000 --- a/spark-cassandra/src/main/scala/io/snappydata/benchmark/chbench/BenchmarkingReceiver.scala +++ /dev/null @@ -1,68 +0,0 @@ -package io.snappydata.benchmark.chbench - -import java.util.Random -import org.apache.spark.storage.StorageLevel -import org.apache.spark.streaming.receiver.Receiver - -class BenchmarkingReceiver (val maxRecPerSecond: Int, - val numWarehouses: Int, - val numDistrictsPerWarehouse: Int, - val numCustomersPerDistrict: Int, - val itemCount : Int) - extends Receiver[ClickStreamCustomer](StorageLevel.MEMORY_AND_DISK) { - - - var receiverThread: Thread = null - var stopThread = false; - override def onStart() { - receiverThread = new Thread("BenchmarkingReceiver") { - override def run() { - receive() - } - } - receiverThread.start() - } - - override def onStop(): Unit = { - receiverThread.interrupt() - } - - private def receive() { - while (!isStopped()) { - val start = System.currentTimeMillis() - var i = 0; - for (i <- 1 to maxRecPerSecond) { - store(generateClickStream()) - if (isStopped()) { - return - } - } - // If one second hasn't elapsed wait for the remaining time - // before queueing more. - val remainingtime = 1000 - (System.currentTimeMillis() - start) - if (remainingtime > 0) { - Thread.sleep(remainingtime) - } - } - } - - val rand = new Random(123) - - private def generateClickStream(): ClickStreamCustomer = { - - val warehouseID: Int = rand.nextInt(numWarehouses) - val districtID: Int = rand.nextInt(this.numDistrictsPerWarehouse) - val customerID: Int = rand.nextInt(this.numCustomersPerDistrict) - val itemId: Int = rand.nextInt(this.itemCount) - // timespent on website is 100 -500 seconds - val timespent: Int = rand.nextInt(400) + 100 - - new ClickStreamCustomer(warehouseID, districtID, customerID, itemId, timespent) - } -} - -class ClickStreamCustomer (val w_id: Int, - val d_id: Int, - val c_id: Int, - val i_id: Int, - val c_ts: Int) extends Serializable \ No newline at end of file diff --git a/spark-cassandra/src/main/scala/io/snappydata/benchmark/chbench/HQueries.scala b/spark-cassandra/src/main/scala/io/snappydata/benchmark/chbench/HQueries.scala deleted file mode 100644 index 8ec7b5a..0000000 --- a/spark-cassandra/src/main/scala/io/snappydata/benchmark/chbench/HQueries.scala +++ /dev/null @@ -1,486 +0,0 @@ -package io.snappydata.benchmark.chbench - -object HQueries { - - val Q1: String = "SELECT ol_number, " + - " sum(ol_quantity) AS sum_qty, " + - " sum(ol_amount) AS sum_amount, " + - " avg(ol_quantity) AS avg_qty, " + - " avg(ol_amount) AS avg_amount, " + - " count(*) AS count_order " + - " FROM order_line" + - " WHERE ol_delivery_d > '2007-01-02 00:00:00.000000' " + - " GROUP BY ol_number " + " ORDER BY ol_number" - - val Q2: String = "SELECT su_suppkey, " + - " su_name, " + - " n_name, " + - " i_id, " + - " i_name, " + - " su_address, " + - " su_phone, " + - " su_comment " + - " FROM item, supplier, stock, nation, region, " + - " (SELECT s_i_id AS m_i_id, MIN(s_quantity) AS m_s_quantity " + - " FROM stock, " + " supplier, " + - " nation, " + " region " + - " WHERE PMOD((s_w_id*s_i_id), 10000)=su_suppkey " + - " AND su_nationkey=n_nationkey " + - " AND n_regionkey=r_regionkey " + - " AND r_name LIKE 'Europ%' " + - " GROUP BY s_i_id) m " + - " WHERE i_id = s_i_id " + - " AND PMOD((s_w_id * s_i_id), 10000) = su_suppkey " + - " AND su_nationkey = n_nationkey " + - " AND n_regionkey = r_regionkey " + - " AND i_data LIKE '%b' " + - " AND r_name LIKE 'Europ%' " + - " AND i_id=m_i_id " + - " AND s_quantity = m_s_quantity " + - " ORDER BY n_name, " + - " su_name, " + - " i_id" - val Q3: String = "SELECT ol_o_id, " + - " ol_w_id, " + - " ol_d_id, " + - " sum(ol_amount) AS revenue, " + - " o_entry_d " + - " FROM " + - " customer, " + - " orders, " + - " order_line," + - " new_order" + - " WHERE c_state LIKE 'A%' " + - " AND c_id = o_c_id " + - " AND c_w_id = o_w_id " + - " AND c_d_id = o_d_id " + - " AND no_w_id = o_w_id " + - " AND no_d_id = o_d_id " + - " AND no_o_id = o_id " + - " AND ol_w_id = o_w_id " + - " AND ol_d_id = o_d_id " + - " AND ol_o_id = o_id " + - " AND o_entry_d > '2007-01-02 00:00:00.000000' " + - " GROUP BY ol_o_id, " + - " ol_w_id, " + - " ol_d_id, " + - " o_entry_d " + - " ORDER BY revenue DESC , o_entry_d" - - /*val Q4: String = "SELECT o_ol_cnt, " + - "count(*) AS order_count " + - "FROM orders " + "WHERE exists " + - "(SELECT * " + "FROM order_line " + - "WHERE o_id = ol_o_id " + - "AND o_w_id = ol_w_id " + - "AND o_d_id = ol_d_id " + - "AND ol_delivery_d >= o_entry_d) " + - "GROUP BY o_ol_cnt " + - "ORDER BY o_ol_cnt" - */ - - val Q4: String = "SELECT o_ol_cnt, count(*) AS order_count " + - "FROM orders JOIN order_line ON o_id = ol_o_id " + - "AND o_w_id = ol_w_id " + - "AND o_d_id = ol_d_id " + - "AND ol_delivery_d >= o_entry_d " + - "GROUP BY o_ol_cnt " + - "ORDER BY o_ol_cnt " - - val Q5: String = "SELECT n_name, " + - "sum(ol_amount) AS revenue " + - "FROM customer, " + - "orders, " + - "order_line, " + - "stock, " + - "supplier, " + - "nation, " + - "region " + - "WHERE c_id = o_c_id " + - "AND c_w_id = o_w_id " + - "AND c_d_id = o_d_id " + - "AND ol_o_id = o_id " + - "AND ol_w_id = o_w_id " + - "AND ol_d_id=o_d_id " + - "AND ol_w_id = s_w_id " + - "AND ol_i_id = s_i_id " + - "AND pMOD((s_w_id * s_i_id), 10000) = su_suppkey " + - "AND ascii(substr(c_state, 1, 1)) = su_nationkey " + - "AND su_nationkey = n_nationkey " + - "AND n_regionkey = r_regionkey " + - "AND r_name = 'Europe' " + - "AND o_entry_d >= '2007-01-02 00:00:00.000000' " + - "GROUP BY n_name " + - "ORDER BY revenue DESC" - - val Q6: String = "SELECT sum(ol_amount) AS revenue " + - "FROM order_line " + - "WHERE ol_delivery_d >= '1999-01-01 00:00:00.000000' " + - "AND ol_delivery_d < '2020-01-01 00:00:00.000000' " + - "AND ol_quantity BETWEEN 1 AND 100000" - - // Select, GroupBY and OrderClause of Q7 has been MOdified - val Q7: String = "SELECT su_nationkey AS supp_nation, " + - "n2.n_nationkey AS cust_nation, " + - "YEAR(o_entry_d) AS l_year, " + - "sum(ol_amount) AS revenue " + - "FROM supplier, " + - "stock, " + - "order_line, " + - "orders, " + - "customer, " + - "nation n1, " + - "nation n2 " + - "WHERE ol_supply_w_id = s_w_id " + - "AND ol_i_id = s_i_id " + - "AND pMOD ((s_w_id * s_i_id), 10000) = su_suppkey " + - "AND ol_w_id = o_w_id " + - "AND ol_d_id = o_d_id " + - "AND ol_o_id = o_id " + - "AND c_id = o_c_id " + - "AND c_w_id = o_w_id " + - "AND c_d_id = o_d_id " + - "AND su_nationkey = n1.n_nationkey " + - "AND ascii(substr(c_state,1, 1)) = n2.n_nationkey " + - "AND ((n1.n_name = 'Germany' " + - "AND n2.n_name = 'Cambodia') " + - "OR (n1.n_name = 'Cambodia' " + - "AND n2.n_name = 'Germany')) " + - "GROUP BY su_nationkey, " + - "n2.n_nationkey, " + - "YEAR(o_entry_d) " + - "ORDER BY su_nationkey, " + - "n2.n_nationkey, " + - "YEAR(o_entry_d)" - // Modified the group by and order by clauses - val Q8 = "SELECT YEAR (o_entry_d) AS l_year, " + - "sum(CASE WHEN n2.n_name = 'Germany' THEN ol_amount ELSE 0 END) / sum(ol_amount) AS mkt_share " + - "FROM item, " + - "supplier, " + - "stock, " + - "order_line, " + - "orders, " + - "customer, " + - "nation n1, " + - "nation n2, " + - "region " + - "WHERE i_id = s_i_id " + - "AND ol_i_id = s_i_id " + - "AND ol_supply_w_id = s_w_id " + - "AND pMOD ((s_w_id * s_i_id), 10000) = su_suppkey " + - "AND ol_w_id = o_w_id " + - "AND ol_d_id = o_d_id " + - "AND ol_o_id = o_id " + - "AND c_id = o_c_id " + - "AND c_w_id = o_w_id " + - "AND c_d_id = o_d_id " + - "AND n1.n_nationkey = ascii(substr(c_state, 1, 1)) " + - "AND n1.n_regionkey = r_regionkey " + - "AND ol_i_id < 1000 " + - "AND r_name = 'Europe' " + - "AND su_nationkey = n2.n_nationkey " + - "AND i_data LIKE '%b' " + - "AND i_id = ol_i_id " + - "GROUP BY YEAR(o_entry_d) " + - " ORDER BY YEAR(o_entry_d)" - - // Modified the group by and order by clauses - val Q9 = "SELECT n_name, YEAR(o_entry_d) AS l_year, " + - "sum(ol_amount) AS sum_profit " + - "FROM item, stock, supplier, " + - "order_line, " + - "orders, " + - "nation " + - "WHERE ol_i_id = s_i_id " + - "AND ol_supply_w_id = s_w_id " + - "AND pMOD ((s_w_id * s_i_id), 10000) = su_suppkey " + - "AND ol_w_id = o_w_id " + - "AND ol_d_id = o_d_id " + - "AND ol_o_id = o_id " + - "AND ol_i_id = i_id " + - "AND su_nationkey = n_nationkey " + - "AND i_data LIKE '%bb' " + - "GROUP BY n_name, " + - "YEAR(o_entry_d) " + - "ORDER BY n_name, " + - "YEAR(o_entry_d) DESC" - - val Q10 = "SELECT c_id, " + - "c_last, " + - "sum(ol_amount) AS revenue, " + - "c_city, " + - "c_phone, " + - "n_name " + - "FROM customer, " + - "orders, " + - "order_line, " + - "nation " + - "WHERE c_id = o_c_id " + - "AND c_w_id = o_w_id " + - "AND c_d_id = o_d_id " + - "AND ol_w_id = o_w_id " + - "AND ol_d_id = o_d_id " + - "AND ol_o_id = o_id " + - "AND o_entry_d >= '2007-01-02 00:00:00.000000' " + - "AND o_entry_d <= ol_delivery_d " + - "AND n_nationkey = ascii(substr(c_state, 1, 1)) " + - "GROUP BY c_id, " + - "c_last, " + - "c_city, " + - "c_phone, " + - "n_name " + - "ORDER BY revenue DESC" - - /*val Q11 = "SELECT s_i_id, " + - "sum(s_order_cnt) AS ordercount " + - "FROM stock, " + - "supplier, " + - "nation " + - "WHERE pmod((s_w_id * s_i_id), 10000) = su_suppkey " + - "AND su_nationkey = n_nationkey " + - "AND n_name = 'Germany' " + - "GROUP BY s_i_id HAVING sum(s_order_cnt) > " + - "(SELECT sum(s_order_cnt) * .005 " + - "FROM stock, " + - "supplier, " + - "nation " + - "WHERE pmod((s_w_id * s_i_id), 10000) = su_suppkey " + - "AND su_nationkey = n_nationkey " + - "AND n_name = 'Germany') " + - "ORDER BY ordercount DESC"*/ - - val Q11a = "SELECT sum(s_order_cnt) * .005 " + - "FROM stock, " + - "supplier, " + - "nation " + - "WHERE pmod((s_w_id * s_i_id), 10000) = su_suppkey " + - "AND su_nationkey = n_nationkey " + - "AND trim(n_name) = 'Germany'" - - val Q11b = "SELECT s_i_id, " + - "sum(s_order_cnt) AS ordercount " + - "FROM stock, " + - "supplier, " + - "nation " + - "WHERE pmod((s_w_id * s_i_id), 10000) = su_suppkey " + - "AND su_nationkey = n_nationkey " + - "AND n_name = 'Germany' " + - "GROUP BY s_i_id HAVING sum(s_order_cnt) > ? " + - "ORDER BY ordercount DESC" - - val Q12 = "SELECT o_ol_cnt, " + - "sum(CASE WHEN o_carrier_id = 1 " + - "OR o_carrier_id = 2 THEN 1 ELSE 0 END) AS high_line_count, " + - "sum(CASE WHEN o_carrier_id <> 1 " + - "AND o_carrier_id <> 2 THEN 1 ELSE 0 END) AS low_line_count " + - "FROM orders, " + - "order_line " + - "WHERE ol_w_id = o_w_id " + - "AND ol_d_id = o_d_id " + - "AND ol_o_id = o_id " + - "AND o_entry_d <= ol_delivery_d " + - "AND ol_delivery_d < '2020-01-01 00:00:00.000000' " + - "GROUP BY o_ol_cnt " + - "ORDER BY o_ol_cnt" - - val Q13 = "SELECT c_count, " + - "count(*) AS custdist " + - "FROM " + - "(SELECT c_id, " + - "count(o_id) AS c_count " + - "FROM customer " + - "LEFT OUTER JOIN orders ON (c_w_id = o_w_id " + - "AND c_d_id = o_d_id " + - "AND c_id = o_c_id " + - "AND o_carrier_id > 8) " + - "GROUP BY c_id) AS c_orders " + - "GROUP BY c_count " + - "ORDER BY custdist DESC, c_count DESC" - - val Q14 = " SELECT (100.00 * sum(CASE WHEN i_data LIKE 'PR%' THEN ol_amount ELSE 0 END) / " + - "(1 + sum(ol_amount))) AS promo_revenue " + - "FROM order_line, " + - "item " + - "WHERE ol_i_id = i_id " + - "AND ol_delivery_d >= '2007-01-02 00:00:00.000000' " + - "AND ol_delivery_d < '2020-01-02 00:00:00.000000'" - - val Q15a = "SELECT " + - " pmod((s_w_id * s_i_id),10000) as supplier_no, " + - " sum(ol_amount) as total_revenue " + - "FROM " + - " order_line, stock " + - "WHERE " + - " ol_i_id = s_i_id " + - " AND ol_supply_w_id = s_w_id " + - " AND ol_delivery_d >= '2007-01-02 00:00:00.000000' " + - "GROUP BY pmod((s_w_id * s_i_id),10000)" - - val Q15b = "select max(total_revenue) as mxRevenue from revenue " - - val Q15c = "SELECT su_suppkey, " + - " su_name, " + - " su_address, " + - " su_phone, " + - " total_revenue " + - "FROM supplier, revenue " + - "WHERE su_suppkey = supplier_no " + - " AND total_revenue = ? " + - "ORDER BY su_suppkey" - - val Q16 = "SELECT i_name, " + - "substr(i_data, 1, 3) AS brand, " + - "i_price, " + - "count(DISTINCT (pmod((s_w_id * s_i_id),10000))) AS supplier_cnt " + - "FROM stock, " + - "item " + - "WHERE i_id = s_i_id " + - "AND i_data NOT LIKE 'zz%' " + - "AND pmod((s_w_id * s_i_id),10000) NOT IN " + - "(SELECT SU_SUPPKEY " + - "FROM supplier " + - "WHERE su_comment LIKE '%bad%') " + - "" + - "GROUP BY i_name, " + - "substr(i_data, 1, 3), " + - "i_price " + - "ORDER BY supplier_cnt DESC" - - val Q17 = "SELECT SUM(ol_amount) / 2.0 AS avg_yearly " + - "FROM order_line, " + - "(SELECT i_id, AVG (ol_quantity) AS a " + - "FROM item, " + - "order_line " + - "WHERE i_data LIKE '%b' " + - "AND ol_i_id = i_id " + - "GROUP BY i_id) t " + - "WHERE ol_i_id = t.i_id " + - "AND ol_quantity < t.a" - val Q18 = " SELECT c_last, " + - "c_id, " + - "o_id, " + - "o_entry_d, " + - "o_ol_cnt, " + - "sum(ol_amount) AS amount_sum " + - "FROM customer, " + - "orders, " + - "order_line " + - "WHERE c_id = o_c_id " + - "AND c_w_id = o_w_id " + - "AND c_d_id = o_d_id " + - "AND ol_w_id = o_w_id " + - "AND ol_d_id = o_d_id " + - "AND ol_o_id = o_id " + - "GROUP BY o_id, " + - "o_w_id, " + - "o_d_id, " + - "c_id, " + - "c_last, " + - "o_entry_d, " + - "o_ol_cnt HAVING sum(ol_amount) > 200 " + - "ORDER BY amount_sum DESC, o_entry_d" - - val Q19 = " SELECT sum(ol_amount) AS revenue " + - "FROM order_line, " + - "item " + - "WHERE (ol_i_id = i_id " + - "AND i_data LIKE '%a' " + - "AND ol_quantity >= 1 " + - "AND ol_quantity <= 10 " + - "AND i_price BETWEEN 1 AND 400000 " + - "AND ol_w_id IN (1, 2, 3)) " + - "OR (ol_i_id = i_id " + - "AND i_data LIKE '%b' " + - "AND ol_quantity >= 1 " + - "AND ol_quantity <= 10 " + - "AND i_price BETWEEN 1 AND 400000 " + - "AND ol_w_id IN (1, 2, 4)) OR (ol_i_id = i_id AND i_data LIKE '%c' AND ol_quantity >= 1 " + - "AND ol_quantity <= 10 " + - "AND i_price BETWEEN 1 AND 400000 " + - "AND ol_w_id IN (1, 5, 3))" - - val Q20 = "SELECT su_name, su_address " + - "FROM supplier, " + - "nation " + - "WHERE su_suppkey IN " + - "(SELECT pmod(s_i_id * s_w_id, 10000) " + - "FROM stock " + - "INNER JOIN item ON i_id = s_i_id " + - "INNER JOIN order_line ON ol_i_id = s_i_id " + - "WHERE ol_delivery_d > '2010-05-23 12:00:00' " + - "AND i_data LIKE 'co%' " + - "GROUP BY s_i_id, " + - "s_w_id, " + - "s_quantity HAVING 2*s_quantity > sum(ol_quantity)) " + - "AND su_nationkey = n_nationkey " + - "AND n_name = 'Germany' " + - "ORDER BY su_name" - - val Q21 = "SELECT su_name, " + - " count(*) AS numwait " + - "FROM supplier, " + - " order_line l1, " + - " orders, " + - " stock, " + - " nation " + - "WHERE ol_o_id = o_id " + - " AND ol_w_id = o_w_id " + - " AND ol_d_id = o_d_id " + - " AND ol_w_id = s_w_id " + - " AND ol_i_id = s_i_id " + - " AND pmod((s_w_id * s_i_id),10000) = su_suppkey " + - " AND l1.ol_delivery_d > o_entry_d " + - " AND NOT EXISTS " + - " (SELECT * " + - " FROM order_line l2 " + - " WHERE l2.ol_o_id = l1.ol_o_id " + - " AND l2.ol_w_id = l1.ol_w_id " + - " AND l2.ol_d_id = l1.ol_d_id " + - " AND l2.ol_delivery_d > l1.ol_delivery_d) " + - " AND su_nationkey = n_nationkey " + - " AND n_name = 'Germany' " + - "GROUP BY su_name " + - "ORDER BY numwait DESC, su_name " - - val Q22a = "SELECT avg(c_balance) " + - "FROM customer " + - "WHERE c_balance > 0.00 " + - "AND substring(c_phone,1,1) IN ('1', '2', '3', '4', '5', '6', '7')" - - val Q22b = "SELECT substring(c_state,1,1) AS country, " + - "count(*) AS numcust, " + - "sum(c_balance) AS totacctbal " + - "FROM customer left outer join orders on o_c_id = c_id " + - " AND o_w_id = c_w_id " + - " AND o_d_id = c_d_id " + - "WHERE substring(c_phone,1,1) IN ('1','2','3','4','5','6','7') " + - " AND c_balance > ? " + - " AND o_id is null " + - "GROUP BY substring(c_state,1,1) " + - "ORDER BY substring(c_state,1,1)" - - val queries = Map( - "Q1" -> Q1, - "Q2" -> Q2, - "Q3" -> Q3, - "Q4" -> Q4, - "Q5" -> Q5, - "Q6" -> Q6, - "Q7" -> Q7, - "Q8" -> Q8, - "Q9" -> Q9, - "Q10" -> Q10, - "Q11" -> Q11b, - "Q12" -> Q12, - "Q13" -> Q13, - "Q14" -> Q14, - "Q15" -> Q15c, - "Q16" -> Q16, - "Q17" -> Q17, - "Q18" -> Q18, - "Q19" -> Q19, - "Q20" -> Q20, - "Q21" -> Q21, - "Q22" -> Q22b - ) -} diff --git a/spark-cassandra/src/main/scala/io/snappydata/benchmark/chbench/OLAPStreamingBench.scala b/spark-cassandra/src/main/scala/io/snappydata/benchmark/chbench/OLAPStreamingBench.scala deleted file mode 100644 index 1138389..0000000 --- a/spark-cassandra/src/main/scala/io/snappydata/benchmark/chbench/OLAPStreamingBench.scala +++ /dev/null @@ -1,185 +0,0 @@ -package io.snappydata.benchmark.chbench - -import java.io.PrintWriter - -import com.datastax.spark.connector.cql.CassandraConnector -import org.apache.log4j.{Level, Logger} -import org.apache.spark.sql.cassandra.CassandraSQLContext -import org.apache.spark.sql.types.{IntegerType, StructType, TimestampType} -import org.apache.spark.sql.{DataFrame, Row} -import org.apache.spark.streaming.{Duration, StreamingContext} -import org.apache.spark.{SparkConf, SparkContext} - -object OLAPStreamingBench extends App { - val rootLogger = Logger.getLogger("org"); - rootLogger.setLevel(Level.WARN); - - val host = "127.0.0.1" - val master = "local[*]" - val numWH = 10 - val memory = "8g" - - val conf = new SparkConf(true) - .setAppName(getClass.getSimpleName) - .set(s"spark.cassandra.connection.host", host) - .set("spark.cassandra.auth.username", "cassandra") - .set("spark.cassandra.auth.password", "cassandra") - .set("spark.cassandra.sql.keyspace", "tpcc") - .set("spark.driver.memory", memory) - .set("spark.executor.memory",memory) - .set("spark.executor.cores", "6") - .set("spark.driver.maxResultSize", "10g") - .set("spark.scheduler.mode", "FAIR") - .setMaster(master) - - val assemblyJar = System.getenv("PROJECT_ASSEMBLY_JAR") - if (assemblyJar != null) { - conf.set("spark.driver.extraClassPath", assemblyJar) - conf.set("spark.executor.extraClassPath", assemblyJar) - } - - val sc = new SparkContext(conf) - val cc = new CassandraSQLContext(sc) - val ssc = new StreamingContext(sc, Duration(2000)) - - cc.setKeyspace("tpcc") - CassandraConnector(conf).withSessionDo { session => - println("******* CONNECTED TO CASSANDRA **********") - } - cc.sql("set spark.sql.shuffle.partitions=64") - - val stream = ssc.receiverStream[ClickStreamCustomer]( - new BenchmarkingReceiver(10000, numWH, 10, 30000, 100000)) - - val schema = new StructType() - .add("cs_c_w_id", IntegerType) - .add("cs_c_d_id", IntegerType) - .add("cs_c_id", IntegerType) - .add("cs_i_id", IntegerType) - .add("cs_timespent", IntegerType) - .add("cs_click_d", TimestampType) - - val rows = stream.map(v => Row(v.w_id, - v.d_id, v.c_id, v.i_id, v.c_ts, new java.sql.Timestamp(System.currentTimeMillis))) - - val window_rows = rows.window(new Duration(60 * 1000), new Duration(60 * 1000)) - - window_rows.foreachRDD(rdd => { - val df = cc.createDataFrame(rdd, schema) - val outFileName = s"BenchmarkingStreamingJob-${System.currentTimeMillis()}.out" - val pw = new PrintWriter(outFileName) - val clickstreamlog = "benchmarking" + System.currentTimeMillis() - df.registerTempTable(clickstreamlog) - // Find out the items in the clickstream with - // price range greater than a particular amount. - var resultdfQ1: DataFrame = null - var resultdfQ2: DataFrame = null - cc.synchronized { - resultdfQ1 = cc.sql(s"select i_id, count(i_id) from " + - s" $clickstreamlog, item " + - " where i_id = cs_i_id " + - " AND i_price > 50 " + - " GROUP BY i_id "); - - // Find out which district's customer are currently more online active to - // stop tv commercials in those districts - resultdfQ2 = cc.sql("select avg(cs_timespent) as avgtimespent , cs_c_d_id " + - s"from $clickstreamlog group by cs_c_d_id order by avgtimespent") - } - - val sq1 = System.currentTimeMillis() - resultdfQ1.limit(10).collect().foreach(pw.println) - val endq1 = System.currentTimeMillis() - resultdfQ2.collect().foreach(pw.println) - val endq2 = System.currentTimeMillis() - val output = s"Q1 ${endq1 - sq1} Q2 ${endq2 - endq1}" - val tid = Thread.currentThread() - pw.println(s"$tid Time taken $output") - pw.close() - }) - - ssc.start - - def getCurrentDirectory = new java.io.File(".").getCanonicalPath - - // scalastyle:off println - - var i: Int = 0 - while (i < 4) { - val outFileName = s"HQueries-${i}.out" - - val pw = new PrintWriter(outFileName) - - i = i + 1 - for (q <- HQueries.queries) { - val start: Long = System.currentTimeMillis - val tid = Thread.currentThread() - try { - q._1 match { - case "Q11" => - var df : DataFrame = null - cc.synchronized { - df = cc.sql(HQueries.Q11a) - } - val ret = df.collect() - assert(ret.length == 1) - val paramVal = ret(0).getDecimal(0) - val qry = q._2.replace("?", paramVal.toString) - cc.synchronized { - df = cc.sql(qry) - } - df.collect() - case "Q15" => - var ret : DataFrame = null - cc.synchronized { - ret = cc.sql(HQueries.Q15a) - } - ret.registerTempTable("revenue") - var df : DataFrame = null - cc.synchronized { - df = cc.sql(HQueries.Q15b) - } - val maxV = df.collect() - val paramVal = maxV(0).getDouble(0) - val qry = q._2.replace("?", paramVal.toString) - cc.synchronized { - df = cc.sql(qry) - } - df.collect() - case "Q22" => - var df: DataFrame = null - cc.synchronized { - df = cc.sql(HQueries.Q22a) - } - val ret = df.collect() - assert(ret.length == 1) - val paramVal = ret(0).getDouble(0) - val qry = q._2.replace("?", paramVal.toString) - cc.synchronized { - df = cc.sql(qry) - } - df.collect() - case "Q16" | "Q20" | "Q21" => - pw.println(s"$tid Not running " + q._1) - pw.flush() - //cc.sql(q._2).collect() - case _ => - var df: DataFrame = null - cc.synchronized { - df = cc.sql(q._2) - } - df.collect() - } - } catch { - case e: Throwable => pw.println(s"$tid Exception for query ${q._1}: " + e) - } - val end: Long = System.currentTimeMillis - start - pw.println(s"${new java.util.Date(System.currentTimeMillis())} $tid Time taken by ${q._1} is $end") - pw.flush() - } - pw.close() - } - // // Return the output file name - // s"See ${getCurrentDirectory}" - ssc.awaitTermination -} diff --git a/spark-memsql/build.gradle b/spark-memsql/build.gradle deleted file mode 100644 index 32be270..0000000 --- a/spark-memsql/build.gradle +++ /dev/null @@ -1,18 +0,0 @@ -dependencies { - compileOnly "org.apache.spark:spark-core_2.10:1.5.2" - compileOnly "org.apache.spark:spark-streaming_2.10:1.5.2" - compileOnly "org.apache.spark:spark-streaming-kafka_2.10:1.5.2" - compileOnly "org.apache.spark:spark-sql_2.10:1.5.2" - compileOnly "org.apache.spark:spark-catalyst_2.10:1.5.2" - - compile project(':') - compile 'com.memsql:memsql-connector_2.10:1.3.2' -} - -task startMemSqlStreamIngestPerf(type: JavaExec, dependsOn: classes) { - main = 'io.snappydata.benchmark.MemSqlStreamIngestPerf' - def filterGuava = sourceSets.test.runtimeClasspath.findAll { !it.getName().contains('guava-14') } - def guava19 = sourceSets.test.runtimeClasspath.findAll { it.getName().contains('guava-19') } - classpath filterGuava - environment 'PROJECT_ASSEMBLY_JAR', "${guava19[0]}:${assemblyJar.files.asPath}" -} diff --git a/spark-memsql/src/main/scala/io/snappydata/benchmark/MemSqlQueryPerf.scala b/spark-memsql/src/main/scala/io/snappydata/benchmark/MemSqlQueryPerf.scala deleted file mode 100644 index 980d4bf..0000000 --- a/spark-memsql/src/main/scala/io/snappydata/benchmark/MemSqlQueryPerf.scala +++ /dev/null @@ -1,61 +0,0 @@ -/* - * Copyright (c) 2016 SnappyData, Inc. All rights reserved. - * - * Licensed under the Apache License, Version 2.0 (the "License"); you - * may not use this file except in compliance with the License. You - * may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - * implied. See the License for the specific language governing - * permissions and limitations under the License. See accompanying - * LICENSE file. - */ - -package io.snappydata.benchmark - -import com.memsql.spark.connector.MemSQLContext -import io.snappydata.adanalytics.Configs._ -import org.apache.log4j.{Level, Logger} -import org.apache.spark.{SparkConf, SparkContext} - -object MemSqlQueryPerf extends App { - - val rootLogger = Logger.getLogger("org"); - rootLogger.setLevel(Level.WARN); - - val conf = new SparkConf() - .setAppName(getClass.getSimpleName) - .setMaster(s"$sparkMasterURL") - - val assemblyJar = System.getenv("PROJECT_ASSEMBLY_JAR") - if (assemblyJar != null) { - conf.set("spark.driver.extraClassPath", assemblyJar) - conf.set("spark.executor.extraClassPath", assemblyJar) - } - conf.set("memsql.defaultDatabase", "adLogs") - - val sc = new SparkContext(conf) - val msc = new MemSQLContext(sc) - - var start = System.currentTimeMillis() - msc.sql("select count(*) AS adCount, geo from adImpressions group" + - " by geo order by adCount desc limit 20").collect() - println("Time for Q1 " + (System.currentTimeMillis() - start)) - - start = System.currentTimeMillis() - msc.sql("select sum (bid) as max_bid, geo from adImpressions group" + - " by geo order by max_bid desc limit 20").collect() - println("Time for Q2 " + (System.currentTimeMillis() - start)) - - start = System.currentTimeMillis() - msc.sql("select sum (bid) as max_bid, publisher from adImpressions" + - " group by publisher order by max_bid desc limit 20").collect() - println("Time for Q3 " + (System.currentTimeMillis() - start)) - -} - - diff --git a/spark-memsql/src/main/scala/io/snappydata/benchmark/MemSqlStreamIngestPerf.scala b/spark-memsql/src/main/scala/io/snappydata/benchmark/MemSqlStreamIngestPerf.scala deleted file mode 100644 index 5bfb569..0000000 --- a/spark-memsql/src/main/scala/io/snappydata/benchmark/MemSqlStreamIngestPerf.scala +++ /dev/null @@ -1,99 +0,0 @@ -/* - * Copyright (c) 2016 SnappyData, Inc. All rights reserved. - * - * Licensed under the Apache License, Version 2.0 (the "License"); you - * may not use this file except in compliance with the License. You - * may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - * implied. See the License for the specific language governing - * permissions and limitations under the License. See accompanying - * LICENSE file. - */ - -package io.snappydata.benchmark - -import com.memsql.spark.connector.MemSQLContext -import io.snappydata.adanalytics.{Configs, AdImpressionLogToRowRDD, AdImpressionLogAvroDecoder} -import Configs._ -import io.snappydata.adanalytics.AdImpressionLog -import kafka.serializer.StringDecoder -import org.apache.spark.streaming.StreamingContext -import org.apache.spark.streaming.kafka.KafkaUtils -import org.apache.spark.{SparkConf, SparkContext} - -/** - * Simple direct kafka spark streaming program which pulls log messages - * from kafka broker and ingest those log messages to MemSql using - * Spark MemSql Connector. To run this program you need to - * start an aggregator and leaf node of MemSql and run Spark in local mode. - */ -object MemSqlStreamIngestPerf extends App { - - val conf = new SparkConf() - .setAppName(getClass.getSimpleName) - .setMaster(s"$sparkMasterURL") - .set("spark.executor.cores", "6") - - val assemblyJar = System.getenv("PROJECT_ASSEMBLY_JAR") - if (assemblyJar != null) { - conf.set("spark.driver.extraClassPath", assemblyJar) - conf.set("spark.executor.extraClassPath", assemblyJar) - } - conf.set("memsql.defaultDatabase", "adLogs") - - val sc = new SparkContext(conf) - val msc = new MemSQLContext(sc) - - import com.memsql.spark.connector.util.JDBCImplicits._ - - msc.getMemSQLCluster.withMasterConn(conn => { - conn.withStatement(stmt => { - // Create database and table in MemSql - stmt.execute(s"CREATE DATABASE IF NOT EXISTS adLogs") - stmt.execute(s"DROP TABLE IF EXISTS adLogs.adImpressions") - stmt.execute( - s""" - CREATE TABLE adLogs.adImpressions - (timestamp bigint, - publisher varchar(15), - advertiser varchar(15), - website varchar(20), - geo varchar(4), - bid double, - cookie varchar(20), - KEY (`timestamp`) USING CLUSTERED COLUMNSTORE, - SHARD KEY (timestamp)) - """) - }) - }) - - // batchDuration of 1 second - val ssc = new StreamingContext(sc, batchDuration) - - val schema = msc.table("adLogs.adImpressions").schema - - val rowConverter = new AdImpressionLogToRowRDD - - import com.memsql.spark.connector._ - - // Creates a stream of AdImpressionLog using kafka direct that pulls - // messages from a Kafka Broker - val messages = KafkaUtils.createDirectStream - [String, AdImpressionLog, StringDecoder, AdImpressionLogAvroDecoder](ssc, kafkaParams, topics) - - // transform the Spark RDDs as per the table schema and save it to MemSql - messages.map(_._2).foreachRDD(rdd => { - msc.createDataFrame(rowConverter.convert(rdd), schema) - .saveToMemSQL("adLogs", "adImpressions") - }) - - ssc.start - ssc.awaitTermination -} - - diff --git a/src/main/scala/io/snappydata/adanalytics/AdImpressionGenerator.scala b/src/main/scala/io/snappydata/adanalytics/AdImpressionGenerator.scala index e420822..a956325 100644 --- a/src/main/scala/io/snappydata/adanalytics/AdImpressionGenerator.scala +++ b/src/main/scala/io/snappydata/adanalytics/AdImpressionGenerator.scala @@ -1,19 +1,8 @@ /* - * Copyright (c) 2016 SnappyData, Inc. All rights reserved. - * - * Licensed under the Apache License, Version 2.0 (the "License"); you - * may not use this file except in compliance with the License. You - * may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - * implied. See the License for the specific language governing - * permissions and limitations under the License. See accompanying - * LICENSE file. - */ +* Copyright © 2019. TIBCO Software Inc. +* This file is subject to the license terms contained +* in the license file that is distributed with this file. +*/ package io.snappydata.adanalytics import java.util.Random diff --git a/src/main/scala/io/snappydata/adanalytics/AdImpressionLogAVRODeserializer.scala b/src/main/scala/io/snappydata/adanalytics/AdImpressionLogAVRODeserializer.scala new file mode 100644 index 0000000..8c777a9 --- /dev/null +++ b/src/main/scala/io/snappydata/adanalytics/AdImpressionLogAVRODeserializer.scala @@ -0,0 +1,23 @@ +/* +* Copyright © 2019. TIBCO Software Inc. +* This file is subject to the license terms contained +* in the license file that is distributed with this file. +*/ + +package io.snappydata.adanalytics + +import org.apache.avro.io.{BinaryDecoder, DecoderFactory} +import org.apache.avro.specific.{SpecificData, SpecificDatumReader} + +class AdImpressionLogAVRODeserializer { + @transient private lazy val datumReader = new SpecificDatumReader[AdImpressionLog](AdImpressionLog.getClassSchema, + AdImpressionLog.getClassSchema, new SpecificData()) + @transient private var decoder: BinaryDecoder = _ + @transient private var result: AdImpressionLog = _ + + def deserialize(data: Array[Byte]): AdImpressionLog = { + decoder = DecoderFactory.get.binaryDecoder(data, decoder) + result = datumReader.read(result, decoder) + result + } +} \ No newline at end of file diff --git a/src/main/scala/io/snappydata/adanalytics/AdImpressionLogAVROSerializer.scala b/src/main/scala/io/snappydata/adanalytics/AdImpressionLogAVROSerializer.scala new file mode 100644 index 0000000..3310718 --- /dev/null +++ b/src/main/scala/io/snappydata/adanalytics/AdImpressionLogAVROSerializer.scala @@ -0,0 +1,33 @@ +/* +* Copyright © 2019. TIBCO Software Inc. +* This file is subject to the license terms contained +* in the license file that is distributed with this file. +*/ +package io.snappydata.adanalytics + +import java.io.ByteArrayOutputStream +import java.util + +import io.snappydata.adanalytics.AdImpressionLogAVROSerializer.{binaryEncoder, datumWriter} +import org.apache.avro.generic.GenericDatumWriter +import org.apache.avro.io.{BinaryEncoder, EncoderFactory} + +class AdImpressionLogAVROSerializer extends org.apache.kafka.common.serialization.Serializer[AdImpressionLog] { + override def configure(configs: util.Map[String, _], isKey: Boolean): Unit = {} + + override def serialize(topic: String, data: AdImpressionLog): Array[Byte] = { + val byteArrayOutputStream = new ByteArrayOutputStream() + binaryEncoder.set(EncoderFactory.get.binaryEncoder(byteArrayOutputStream, binaryEncoder.get)) + datumWriter.write(data, binaryEncoder.get()) + binaryEncoder.get().flush() + byteArrayOutputStream.close() + byteArrayOutputStream.toByteArray + } + + override def close(): Unit = {} +} + +object AdImpressionLogAVROSerializer { + val binaryEncoder = new ThreadLocal[BinaryEncoder] + lazy val datumWriter = new GenericDatumWriter[AdImpressionLog](AdImpressionLog.getClassSchema) +} \ No newline at end of file diff --git a/src/main/scala/io/snappydata/adanalytics/Codec.scala b/src/main/scala/io/snappydata/adanalytics/Codec.scala deleted file mode 100644 index 97d3910..0000000 --- a/src/main/scala/io/snappydata/adanalytics/Codec.scala +++ /dev/null @@ -1,94 +0,0 @@ -/* - * Copyright (c) 2016 SnappyData, Inc. All rights reserved. - * - * Licensed under the Apache License, Version 2.0 (the "License"); you - * may not use this file except in compliance with the License. You - * may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - * implied. See the License for the specific language governing - * permissions and limitations under the License. See accompanying - * LICENSE file. - */ - -package io.snappydata.adanalytics - -import com.miguno.kafka.avro.{AvroDecoder, AvroEncoder} -import kafka.utils.VerifiableProperties -import org.apache.avro.io.DecoderFactory -import org.apache.avro.specific.SpecificDatumReader -import org.apache.spark.rdd.RDD -import org.apache.spark.sql.Row -import org.apache.spark.sql.streaming.{StreamConverter, StreamToRowsConverter} - -class AdImpressionLogAvroEncoder(props: VerifiableProperties = null) - extends AvroEncoder[AdImpressionLog](props, AdImpressionLog.getClassSchema) - -class AdImpressionLogAvroDecoder(props: VerifiableProperties = null) - extends AvroDecoder[AdImpressionLog](props, AdImpressionLog.getClassSchema) - -class AdImpressionToRowsConverter extends StreamToRowsConverter with Serializable { - - override def toRows(message: Any): Seq[Row] = { - val log = message.asInstanceOf[AdImpressionLog] - Seq(Row.fromSeq(Seq( - new java.sql.Timestamp(log.getTimestamp), - log.getPublisher.toString, - log.getAdvertiser.toString, - log.getWebsite.toString, - log.getGeo.toString, - log.getBid, - log.getCookie.toString))) - } -} - -/** - * Convertes Spark RDD[AdImpressionLog] to RDD[Row] - * to insert into table - */ -class AdImpressionLogToRowRDD extends Serializable { - - def convert(logRdd: RDD[AdImpressionLog]): RDD[Row] = { - logRdd.map(log => { - Row(log.getTimestamp, - log.getPublisher.toString, - log.getAdvertiser.toString, - log.getWebsite.toString, - log.getGeo.toString, - log.getBid, - log.getCookie.toString) - }) - } -} - - -class AvroSocketStreamConverter extends StreamConverter with Serializable { - override def convert(inputStream: java.io.InputStream): Iterator[AdImpressionLog] = { - val reader = new SpecificDatumReader[AdImpressionLog](AdImpressionLog.getClassSchema) - val decoder = DecoderFactory.get().directBinaryDecoder(inputStream, null) - new Iterator[AdImpressionLog] { - - val log: AdImpressionLog = new AdImpressionLog() - var nextVal = log - nextVal = reader.read(nextVal, decoder) - - override def hasNext: Boolean = nextVal != null - - override def next(): AdImpressionLog = { - val n = nextVal - if (n ne null) { - nextVal = reader.read(nextVal, decoder) - n - } else { - throw new NoSuchElementException() - } - } - } - } - - override def getTargetType: scala.Predef.Class[_] = classOf[AdImpressionLog] -} diff --git a/src/main/scala/io/snappydata/adanalytics/Configs.scala b/src/main/scala/io/snappydata/adanalytics/Configs.scala index a0a37be..a2ed950 100644 --- a/src/main/scala/io/snappydata/adanalytics/Configs.scala +++ b/src/main/scala/io/snappydata/adanalytics/Configs.scala @@ -1,37 +1,13 @@ /* - * Copyright (c) 2016 SnappyData, Inc. All rights reserved. - * - * Licensed under the Apache License, Version 2.0 (the "License"); you - * may not use this file except in compliance with the License. You - * may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - * implied. See the License for the specific language governing - * permissions and limitations under the License. See accompanying - * LICENSE file. - */ +* Copyright © 2019. TIBCO Software Inc. +* This file is subject to the license terms contained +* in the license file that is distributed with this file. +*/ package io.snappydata.adanalytics -import org.apache.spark.sql.types._ -import org.apache.spark.streaming.Seconds - object Configs { - val snappyMasterURL = "snappydata://localhost:10334" - - val sparkMasterURL = "spark://127.0.0.1:7077" - - val cassandraHost = "127.0.0.1" - - val snappyLocators = "localhost:10334" - - val maxRatePerPartition = 1000 - val kafkaTopic = "adImpressionsTopic" val brokerList = "localhost:9092" @@ -40,6 +16,10 @@ object Configs { "metadata.broker.list" -> brokerList ) + // Ideally checkpoint directory should be at some shared HDFS location accessible by all the nodes + val snappyLogAggregatorCheckpointDir = s"/tmp/snappyLogAggregator" + val sparkLogAggregatorCheckpointDir = s"/tmp/sparkLogAggregator" + val hostname = "localhost" val socketPort = 9000 @@ -56,13 +36,13 @@ object Configs { val UnknownGeo = "un" - val geos = Seq("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", + val geos: Seq[String] = Seq("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY", UnknownGeo) - val numGeos = geos.size + val numGeos: Int = geos.size val numWebsites = 999 @@ -74,21 +54,5 @@ object Configs { val numLogsPerThread = 20000000 - val batchDuration = Seconds(1) - - val topics = Set(kafkaTopic) - val maxLogsPerSecPerThread = 5000 - - def getAdImpressionSchema: StructType = { - StructType(Array( - StructField("timestamp", TimestampType, true), - StructField("publisher", StringType, true), - StructField("advertiser", StringType, true), - StructField("website", StringType, true), - StructField("geo", StringType, true), - StructField("bid", DoubleType, true), - StructField("cookie", StringType, true))) - } - } diff --git a/src/main/scala/io/snappydata/adanalytics/KafkaAdImpressionProducer.scala b/src/main/scala/io/snappydata/adanalytics/KafkaAdImpressionProducer.scala index 066c7a2..10fd499 100644 --- a/src/main/scala/io/snappydata/adanalytics/KafkaAdImpressionProducer.scala +++ b/src/main/scala/io/snappydata/adanalytics/KafkaAdImpressionProducer.scala @@ -1,43 +1,34 @@ /* - * Copyright (c) 2016 SnappyData, Inc. All rights reserved. - * - * Licensed under the Apache License, Version 2.0 (the "License"); you - * may not use this file except in compliance with the License. You - * may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - * implied. See the License for the specific language governing - * permissions and limitations under the License. See accompanying - * LICENSE file. - */ +* Copyright © 2019. TIBCO Software Inc. +* This file is subject to the license terms contained +* in the license file that is distributed with this file. +*/ package io.snappydata.adanalytics import java.util.Properties +import java.util.concurrent.Future import io.snappydata.adanalytics.Configs._ -import kafka.producer.{KeyedMessage, Producer, ProducerConfig} import io.snappydata.adanalytics.KafkaAdImpressionProducer._ +import org.apache.hadoop.ipc.RetriableException +import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord, RecordMetadata} /** * A simple Kafka Producer program which randomly generates * ad impression log messages and sends it to Kafka broker. * This program generates and sends 10 million messages. + * + * Note that this producer sends messages on Kafka in async manner. */ -object KafkaAdImpressionProducer{ +object KafkaAdImpressionProducer { val props = new Properties() - props.put("serializer.class", "io.snappydata.adanalytics.AdImpressionLogAvroEncoder") - props.put("partitioner.class", "kafka.producer.DefaultPartitioner") - props.put("key.serializer.class", "kafka.serializer.StringEncoder") - props.put("metadata.broker.list", brokerList) + props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer") + props.put("value.serializer", "io.snappydata.adanalytics.AdImpressionLogAVROSerializer") + props.put("bootstrap.servers", brokerList) - val config = new ProducerConfig(props) - val producer = new Producer[String, AdImpressionLog](config) + val producer = new KafkaProducer[String, AdImpressionLog](props) def main(args: Array[String]) { println("Sending Kafka messages of topic " + kafkaTopic + " to brokers " + brokerList) @@ -52,9 +43,20 @@ object KafkaAdImpressionProducer{ System.exit(0) } - def sendToKafka(log: AdImpressionLog) = { - producer.send(new KeyedMessage[String, AdImpressionLog]( - Configs.kafkaTopic, log.getTimestamp.toString, log)) + def sendToKafka(log: AdImpressionLog): Future[RecordMetadata] = { + producer.send(new ProducerRecord[String, AdImpressionLog]( + Configs.kafkaTopic, log.getTimestamp.toString, log), new org.apache.kafka.clients.producer.Callback() { + override def onCompletion(metadata: RecordMetadata, exception: Exception): Unit = { + if (exception != null) { + if (exception.isInstanceOf[RetriableException]) { + println(s"Encountered a retriable exception while sending messages: $exception") + } else { + throw exception + } + } + } + } + ) } } @@ -71,7 +73,7 @@ final class Worker extends Runnable { if (timeRemaining > 0) { Thread.sleep(timeRemaining) } - if (j !=0 & (j % 200000) == 0) { + if (j != 0 & (j % 200000) == 0) { println(s" ${Thread.currentThread().getName} sent $j Kafka messages" + s" of topic $kafkaTopic to brokers $brokerList ") } diff --git a/src/main/scala/io/snappydata/adanalytics/SnappyAPILogAggregator.scala b/src/main/scala/io/snappydata/adanalytics/SnappyAPILogAggregator.scala deleted file mode 100644 index b33a161..0000000 --- a/src/main/scala/io/snappydata/adanalytics/SnappyAPILogAggregator.scala +++ /dev/null @@ -1,90 +0,0 @@ -/* - * Copyright (c) 2016 SnappyData, Inc. All rights reserved. - * - * Licensed under the Apache License, Version 2.0 (the "License"); you - * may not use this file except in compliance with the License. You - * may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - * implied. See the License for the specific language governing - * permissions and limitations under the License. See accompanying - * LICENSE file. - */ - -package io.snappydata.adanalytics - -import io.snappydata.adanalytics.Configs._ -import kafka.serializer.StringDecoder -import org.apache.spark.sql.Row -import org.apache.spark.sql.streaming.SchemaDStream -import org.apache.spark.streaming.kafka.KafkaUtils -import org.apache.spark.streaming.{Duration, SnappyStreamingContext} -import org.apache.spark.{SparkConf, SparkContext} - -/** - * Example using Spark API + Snappy extension to model a Stream as a DataFrame. - * The Spark driver and executors run in local mode and simply use Snappy - * cluster as the data store. - */ -object SnappyAPILogAggregator extends App { - - val conf = new SparkConf() - .setAppName(getClass.getSimpleName) - .setMaster(s"$sparkMasterURL") // local split - .set("snappydata.store.locators", s"$snappyLocators") - .set("spark.ui.port", "4041") - .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") - .registerAvroSchemas(AdImpressionLog.getClassSchema) - - // add the "assembly" jar to executor classpath - val assemblyJar = System.getenv("PROJECT_ASSEMBLY_JAR") - if (assemblyJar != null) { - conf.set("spark.driver.extraClassPath", assemblyJar) - conf.set("spark.executor.extraClassPath", assemblyJar) - } - - val sc = new SparkContext(conf) - val ssc = new SnappyStreamingContext(sc, batchDuration) - - // The volumes are low. Optimize Spark shuffle by reducing the partition count - ssc.sql("set spark.sql.shuffle.partitions=8") - - // stream of (topic, ImpressionLog) - val messages = KafkaUtils.createDirectStream - [String, AdImpressionLog, StringDecoder, AdImpressionLogAvroDecoder](ssc, kafkaParams, topics) - - // Filter out bad messages ...use a second window - val logs = messages.map(_._2).filter(_.getGeo != Configs.UnknownGeo) - .window(Duration(1000), Duration(1000)) - - // We want to process the stream as a DataFrame/Table ... easy to run - // analytics on stream ...will be standard part of Spark 2.0 (Structured - // streaming) - val rows = logs.map(v => Row(new java.sql.Timestamp(v.getTimestamp), v.getPublisher.toString, - v.getAdvertiser.toString, v.getWebsite.toString, v.getGeo.toString, v.getBid, v.getCookie.toString)) - - val logStreamAsTable : SchemaDStream = ssc.createSchemaDStream(rows, getAdImpressionSchema) - - import org.apache.spark.sql.functions._ - - /** - * We want to execute the following analytic query ... using the DataFrame - * API ... - * select publisher, geo, avg(bid) as avg_bid, count(*) imps, count(distinct(cookie)) uniques - * from AdImpressionLog group by publisher, geo, timestamp" - */ - logStreamAsTable.foreachDataFrame(df => { - val df1 = df.groupBy("publisher", "geo", "timestamp") - .agg(avg("bid").alias("avg_bid"), count("geo").alias("imps"), - countDistinct("cookie").alias("uniques")) - df1.show() - }) - - // start rolling! - ssc.start - ssc.awaitTermination -} diff --git a/src/main/scala/io/snappydata/adanalytics/SnappyAPILogAggregatorJob.scala b/src/main/scala/io/snappydata/adanalytics/SnappyAPILogAggregatorJob.scala deleted file mode 100644 index 34bf0f5..0000000 --- a/src/main/scala/io/snappydata/adanalytics/SnappyAPILogAggregatorJob.scala +++ /dev/null @@ -1,85 +0,0 @@ -/* - * Copyright (c) 2016 SnappyData, Inc. All rights reserved. - * - * Licensed under the Apache License, Version 2.0 (the "License"); you - * may not use this file except in compliance with the License. You - * may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - * implied. See the License for the specific language governing - * permissions and limitations under the License. See accompanying - * LICENSE file. - */ - -package io.snappydata.adanalytics - -import com.typesafe.config.Config -import io.snappydata.adanalytics.Configs._ -import kafka.serializer.StringDecoder -import org.apache.spark.sql.streaming.{SchemaDStream, SnappyStreamingJob} -import org.apache.spark.sql.{Row, SnappyJobValid, SnappyJobValidation} -import org.apache.spark.streaming.kafka.KafkaUtils -import org.apache.spark.streaming.{Seconds, SnappyStreamingContext} - -/** - * Same as SnappyAPILogAggregator except this streaming job runs in the data - * store cluster. By implementing a SnappyStreamingJob we allow this program - * to run managed in the snappy cluster. - * Here we use Snappy SQL extensions to process a stream as - * micro-batches of DataFrames instead of using the Spark Streaming API based - * on RDDs. This is similar to what we will see in Spark 2.0 (Structured - * streaming). - * - * Run this program using bin/snappy-job.sh - */ -class SnappyAPILogAggregatorJob extends SnappyStreamingJob { - - /** contains the implementation of the Job, Snappy uses this as - * an entry point to execute Snappy job - */ - override def runSnappyJob(snsc: SnappyStreamingContext, jobConfig: Config): Any = { - - // The volumes are low. Optimize Spark shuffle by reducing the partition count - snsc.sql("set spark.sql.shuffle.partitions=8") - - // stream of (topic, ImpressionLog) - val messages = KafkaUtils.createDirectStream - [String, AdImpressionLog, StringDecoder, AdImpressionLogAvroDecoder](snsc, kafkaParams, topics) - - // Filter out bad messages ...use a 1 second window - val logs = messages.map(_._2).filter(_.getGeo != Configs.UnknownGeo) - .window(Seconds(1), Seconds(1)) - - // Best to operate stream as a DataFrame/Table ... easy to run analytics on stream - val rows = logs.map(v => Row(new java.sql.Timestamp(v.getTimestamp), v.getPublisher.toString, - v.getAdvertiser.toString, v.getWebsite.toString, v.getGeo.toString, v.getBid, v.getCookie.toString)) - - val logStreamAsTable : SchemaDStream = snsc.createSchemaDStream(rows, getAdImpressionSchema) - - import org.apache.spark.sql.functions._ - - /** - * We want to execute the following analytic query ... using the DataFrame - * API ... - * select publisher, geo, avg(bid) as avg_bid, count(*) imps, count(distinct(cookie)) uniques - * from AdImpressionLog group by publisher, geo, timestamp" - */ - logStreamAsTable.foreachDataFrame(df => { - val df1 = df.groupBy("publisher", "geo", "timestamp") - .agg(avg("bid").alias("avg_bid"), count("geo").alias("imps"), - countDistinct("cookie").alias("uniques")) - df1.show() - }) - - snsc.start() - snsc.awaitTermination() - } - - override def isValidJob(snsc: SnappyStreamingContext, config: Config): SnappyJobValidation = { - SnappyJobValid() - } -} \ No newline at end of file diff --git a/src/main/scala/io/snappydata/adanalytics/SnappyLogAggregator.scala b/src/main/scala/io/snappydata/adanalytics/SnappyLogAggregator.scala new file mode 100644 index 0000000..10f570a --- /dev/null +++ b/src/main/scala/io/snappydata/adanalytics/SnappyLogAggregator.scala @@ -0,0 +1,134 @@ +/* +* Copyright © 2019. TIBCO Software Inc. +* This file is subject to the license terms contained +* in the license file that is distributed with this file. +*/ + +package io.snappydata.adanalytics + +import com.typesafe.config.{Config, ConfigFactory} +import io.snappydata.adanalytics.Configs._ +import org.apache.kafka.common.serialization.ByteArrayDeserializer +import org.apache.spark.sql._ +import org.apache.spark.sql.catalyst.encoders.RowEncoder +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.streaming.ProcessingTime +import org.apache.spark.sql.types._ +import org.apache.spark.{SparkConf, SparkContext} + +/** + * Example using Spark API + Snappy extension to model a Stream as a DataFrame. + * + * This example can be run either in local mode or can be submitted as a job + * to an already running SnappyData cluster. + * + * To run the job as snappy-job use following command from snappy product home: + * {{{ + * ./bin/snappy-job.sh submit --lead localhost:8090 --app-name AdAnalytics \ + * --class io.snappydata.adanalytics.SnappyLogAggregator --app-jar \ + * + * }}} + * + * To run the job as a smart connector application use the following command: + * {{{ + * ./bin/spark-submit --class io.snappydata.adanalytics.SnappyLogAggregator \ + * --conf spark.snappydata.connection=localhost:1527 --master \ + * + * }}} + * Note that for smart connector mode the application UI will be started on 4041 port. + */ +object SnappyLogAggregator extends SnappySQLJob with App { + + val conf = new SparkConf() + .setAppName(getClass.getSimpleName) + .setMaster("local[*]") + .set("spark.ui.port", "4041") + .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") + .registerAvroSchemas(AdImpressionLog.getClassSchema) + + // add the "assembly" jar to executor classpath + val assemblyJar = System.getenv("PROJECT_ASSEMBLY_JAR") + if (assemblyJar != null) { + conf.set("spark.driver.extraClassPath", assemblyJar) + conf.set("spark.executor.extraClassPath", assemblyJar) + } + + val sc = new SparkContext(conf) + private val snappy = new SnappySession(sc) + + runSnappyJob(snappy, ConfigFactory.empty()) + + /** Contains the implementation of the Job, Snappy uses this as an entry point to execute + * Snappy job + */ + override def runSnappyJob(snappy: SnappySession, jobConfig: Config): Any = { + + // The volumes are low. Optimize Spark shuffle by reducing the partition count + snappy.sql("set spark.sql.shuffle.partitions=8") + + snappy.sql("drop table if exists aggrAdImpressions") + + snappy.sql("create table aggrAdImpressions(time_stamp timestamp, publisher string," + + " geo string, avg_bid double, imps long, uniques long) " + + "using column options(buckets '11')") + + val schema = StructType(Seq(StructField("timestamp", TimestampType), + StructField("publisher", StringType), StructField("advertiser", StringType), + StructField("website", StringType), StructField("geo", StringType), + StructField("bid", DoubleType), StructField("cookie", StringType))) + + import snappy.implicits._ + val df = snappy.readStream + .format("kafka") + .option("kafka.bootstrap.servers", brokerList) + .option("value.deserializer", classOf[ByteArrayDeserializer].getName) + .option("startingOffsets", "earliest") + .option("subscribe", kafkaTopic) + .load() + // projecting only value column of the Kafka data an using + .select("value").as[Array[Byte]](Encoders.BINARY) + .mapPartitions(itr => { + // Reuse deserializer for each partition which will internally reuse decoder and data object + val deserializer = new AdImpressionLogAVRODeserializer + itr.map(data => { + // deserializing AVRO binary data and formulating Row out of it + val adImpressionLog = deserializer.deserialize(data) + Row(new java.sql.Timestamp(adImpressionLog.getTimestamp), adImpressionLog.getPublisher + .toString, adImpressionLog.getAdvertiser.toString, adImpressionLog.getWebsite.toString, + adImpressionLog.getGeo.toString, adImpressionLog.getBid, + adImpressionLog.getCookie.toString) + }) + })(RowEncoder.apply(schema)) + // filtering invalid records + .filter(s"geo != '${Configs.UnknownGeo}'") + + // Aggregating records with + val windowedDF = df.withColumn("eventTime", $"timestamp".cast("timestamp")) + .withWatermark("eventTime", "0 seconds") + .groupBy(window($"eventTime", "1 seconds", "1 seconds"), $"publisher", $"geo") + .agg(unix_timestamp(min("timestamp"), "MM-dd-yyyy HH:mm:ss").alias("timestamp"), + avg("bid").alias("avg_bid"), count("geo").alias("imps"), + approx_count_distinct("cookie").alias("uniques")) + .select("timestamp", "publisher", "geo", "avg_bid", "imps", "uniques") + + val logStream = windowedDF + .writeStream + .format("snappysink") // using snappysink as output sink + .queryName("log_aggregator") // name of the streaming query + .trigger(ProcessingTime("1 seconds")) // trigger the batch processing every second + .option("tableName", "aggrAdImpressions") // target table name where data will be ingested + //checkpoint location where the streaming query progress and intermediate aggregation state + // is stored. It should be ideally on some HDFS location. + .option("checkpointLocation", snappyLogAggregatorCheckpointDir) + // Only the rows that were updated since the last trigger will be outputted to the sink. + // More details about output mode: https://spark.apache.org/docs/2.1.1/structured-streaming-programming-guide.html#output-modes + .outputMode("update") + .start + + logStream.awaitTermination() + } + + override def isValidJob(snappy: SnappySession, config: Config): SnappyJobValidation = { + SnappyJobValid() + } +} \ No newline at end of file diff --git a/src/main/scala/io/snappydata/adanalytics/SnappySQLLogAggregator.scala b/src/main/scala/io/snappydata/adanalytics/SnappySQLLogAggregator.scala deleted file mode 100644 index 43b5ce6..0000000 --- a/src/main/scala/io/snappydata/adanalytics/SnappySQLLogAggregator.scala +++ /dev/null @@ -1,116 +0,0 @@ -/* - * Copyright (c) 2016 SnappyData, Inc. All rights reserved. - * - * Licensed under the Apache License, Version 2.0 (the "License"); you - * may not use this file except in compliance with the License. You - * may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - * implied. See the License for the specific language governing - * permissions and limitations under the License. See accompanying - * LICENSE file. - */ - -package io.snappydata.adanalytics - -import io.snappydata.adanalytics.Configs._ -import org.apache.spark.SparkContext -import org.apache.spark.sql.streaming.SchemaDStream -import org.apache.spark.streaming.SnappyStreamingContext - -/** - * We use Snappy SQL extensions to process a stream as - * micro-batches of DataFrames instead of using the Spark Streaming API based - * on RDDs. This is similar to what we will see in Spark 2.0 (Structured - * streaming). - * - * Not only does the use of SQL permit optimizations in the spark engine but - * we make the Stream visible as a Table to external clients. For instance, - * you can connect using JDBC and run a query on the stream table. - * - * This program will run in a standalong JVM and connect to the Snappy - * cluster as the data store. - */ -object SnappySQLLogAggregator extends App { - - val sparkConf = new org.apache.spark.SparkConf() - .setAppName(getClass.getSimpleName) - .set("spark.sql.inMemoryColumnarStorage.compressed", "false") - .set("spark.sql.inMemoryColumnarStorage.batchSize", "2000") - .setMaster(s"$sparkMasterURL") - .set("snappydata.store.locators", s"$snappyLocators") - .set("spark.ui.port", "4041") - .set("spark.streaming.kafka.maxRatePerPartition", s"$maxRatePerPartition") - - // add the "assembly" jar to executor classpath - val assemblyJar = System.getenv("PROJECT_ASSEMBLY_JAR") - if (assemblyJar != null) { - sparkConf.set("spark.driver.extraClassPath", assemblyJar) - sparkConf.set("spark.executor.extraClassPath", assemblyJar) - } - - val sc = new SparkContext(sparkConf) - val snsc = new SnappyStreamingContext(sc, batchDuration) - - //Spark tip : Keep shuffle count low when data volume is low. - snsc.sql("set spark.sql.shuffle.partitions=8") - - snsc.sql("drop table if exists aggrAdImpressions") - snsc.sql("drop table if exists sampledAdImpressions") - snsc.sql("drop table if exists adImpressionStream") - - /** - * Create a stream over the Kafka source. The messages are converted to Row - * objects and comply with the schema defined in the 'create' below. - * This is mostly just a SQL veneer over Spark Streaming. The stream table - * is also automatically registered with the SnappyData catalog so external - * clients can see this stream as a table - */ - snsc.sql("create stream table adImpressionStream (" + - " time_stamp timestamp," + - " publisher string," + - " advertiser string," + - " website string," + - " geo string," + - " bid double," + - " cookie string) " + - " using directkafka_stream options(" + - " rowConverter 'io.snappydata.adanalytics.AdImpressionToRowsConverter' ," + - s" kafkaParams 'metadata.broker.list->$brokerList;auto.offset.reset->smallest'," + - s" topics '$kafkaTopic'," + - " K 'java.lang.String'," + - " V 'io.snappydata.adanalytics.AdImpressionLog', " + - " KD 'kafka.serializer.StringDecoder', " + - " VD 'io.snappydata.adanalytics.AdImpressionLogAvroDecoder')") - - // Next, create the Column table where we ingest all our data into. - snsc.sql("create table aggrAdImpressions(time_stamp timestamp, publisher string," + - " geo string, avg_bid double, imps long, uniques long) " + - "using column options(buckets '11')") - // You can make these tables persistent, add partitioned keys, replicate - // for HA, overflow to HDFS, etc, etc. ... Read the docs. - - snsc.sql("CREATE SAMPLE TABLE sampledAdImpressions" + - " OPTIONS(qcs 'geo', fraction '0.03', strataReservoirSize '50', baseTable 'aggrAdImpressions')") - - // Execute this query once every second. Output is a SchemaDStream. - val resultStream : SchemaDStream = snsc.registerCQ( - "select time_stamp, publisher, geo, avg(bid) as avg_bid," + - " count(*) as imps , count(distinct(cookie)) as uniques" + - " from adImpressionStream window (duration 1 seconds, slide 1 seconds)"+ - " where geo != 'unknown' group by publisher, geo, time_stamp") - - resultStream.foreachDataFrame( df => { - df.write.insertInto("aggrAdImpressions") - }) - // Above we use the Spark Data Source API to write to our Column table. - // This will automatically localize the partitions in the data store. No - // Shuffling. - - snsc.start() - snsc.awaitTermination() -} diff --git a/src/main/scala/io/snappydata/adanalytics/SnappySQLLogAggregatorJob.scala b/src/main/scala/io/snappydata/adanalytics/SnappySQLLogAggregatorJob.scala deleted file mode 100644 index 53d3337..0000000 --- a/src/main/scala/io/snappydata/adanalytics/SnappySQLLogAggregatorJob.scala +++ /dev/null @@ -1,92 +0,0 @@ -/* - * Copyright (c) 2016 SnappyData, Inc. All rights reserved. - * - * Licensed under the Apache License, Version 2.0 (the "License"); you - * may not use this file except in compliance with the License. You - * may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - * implied. See the License for the specific language governing - * permissions and limitations under the License. See accompanying - * LICENSE file. - */ - -package io.snappydata.adanalytics - -import com.typesafe.config.Config -import io.snappydata.adanalytics.Configs._ -import org.apache.spark.sql.streaming.{SchemaDStream, SnappyStreamingJob} -import org.apache.spark.sql.{SnappyJobValid, SnappyJobValidation} -import org.apache.spark.streaming.{Seconds, SnappyStreamingContext} - -/** - * Same as SnappySQLogAggregator except this streaming job runs in the data - * store cluster. By implementing a SnappyStreamingJob we allow this program - * to run managed in the snappy cluster. - * Here we use Snappy SQL to process a stream as - * micro-batches of DataFrames instead of using the Spark Streaming API based - * on RDDs. This is similar to what we will see in Spark 2.0 (Structured - * streaming). - * - * Run this program using bin/snappy-job.sh - */ -class SnappySQLLogAggregatorJob extends SnappyStreamingJob { - - override def runSnappyJob(snsc: SnappyStreamingContext, jobConfig: Config): Any = { - - //Spark tip : Keep shuffle count low when data volume is low. - snsc.sql("set spark.sql.shuffle.partitions=8") - - snsc.sql("drop table if exists adImpressionStream") - snsc.sql("drop table if exists sampledAdImpressions") - snsc.sql("drop table if exists aggrAdImpressions") - - snsc.sql("create stream table adImpressionStream (" + - " time_stamp timestamp," + - " publisher string," + - " advertiser string," + - " website string," + - " geo string," + - " bid double," + - " cookie string) " + - " using directkafka_stream options(" + - " rowConverter 'io.snappydata.adanalytics.AdImpressionToRowsConverter' ," + - s" kafkaParams 'metadata.broker.list->$brokerList;auto.offset.reset->smallest'," + - s" topics '$kafkaTopic'," + - " K 'java.lang.String'," + - " V 'io.snappydata.adanalytics.AdImpressionLog', " + - " KD 'kafka.serializer.StringDecoder', " + - " VD 'io.snappydata.adanalytics.AdImpressionLogAvroDecoder')") - - // Next, create the Column table where we ingest all our data into. - snsc.sql("create table aggrAdImpressions(time_stamp timestamp, publisher string," + - " geo string, avg_bid double, imps long, uniques long) " + - "using column options(buckets '11')") - - snsc.sql("CREATE SAMPLE TABLE sampledAdImpressions" + - " OPTIONS(qcs 'geo,publisher', fraction '0.03', strataReservoirSize '50', baseTable 'aggrAdImpressions')") - - // Execute this query once every second. Output is a SchemaDStream. - val resultStream: SchemaDStream = snsc.registerCQ( - "select min(time_stamp), publisher, geo, avg(bid) as avg_bid," + - " count(*) as imps , count(distinct(cookie)) as uniques" + - " from adImpressionStream window (duration 1 seconds, slide 1 seconds)" + - " where geo != 'unknown' group by publisher, geo") - - resultStream.foreachDataFrame(df => { - df.write.insertInto("aggrAdImpressions") - df.write.insertInto("sampledAdImpressions") - }) - - snsc.start() - snsc.awaitTermination() - } - - override def isValidJob(snsc: SnappyStreamingContext, config: Config): SnappyJobValidation = { - SnappyJobValid() - } -} diff --git a/src/main/scala/io/snappydata/adanalytics/SparkLogAggregator.scala b/src/main/scala/io/snappydata/adanalytics/SparkLogAggregator.scala index 7e7b4d0..23157ae 100644 --- a/src/main/scala/io/snappydata/adanalytics/SparkLogAggregator.scala +++ b/src/main/scala/io/snappydata/adanalytics/SparkLogAggregator.scala @@ -1,72 +1,87 @@ +/* +* Copyright © 2019. TIBCO Software Inc. +* This file is subject to the license terms contained +* in the license file that is distributed with this file. +*/ + package io.snappydata.adanalytics -import com.twitter.algebird.{HLL, HyperLogLogMonoid} import io.snappydata.adanalytics.Configs._ -import kafka.serializer.StringDecoder +import org.apache.commons.lang.StringEscapeUtils +import org.apache.kafka.common.serialization.ByteArrayDeserializer import org.apache.spark.SparkConf -import org.apache.spark.streaming.kafka.KafkaUtils +import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, RowEncoder} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ +import org.apache.spark.sql.{Encoders, Row, SparkSession} import org.apache.spark.streaming.{Seconds, StreamingContext} -import org.joda.time.DateTime /** - * Vanilla Spark implementation with no Snappy extensions being used. - * Code is from https://chimpler.wordpress.com/2014/07/01/implementing-a-real-time-data-pipeline-with-spark-streaming/ - * This implementation uses a HyperLogLog to find uniques. We skip this - * probabilistic structure in our implementation as we can easily extract the - * exact distinct count for such small time windows. - **/ + * Vanilla Spark implementation with no Snappy extensions being used. The aggregated data is + * written to a kafka topic. + * + * Following command should be used to submit this job: + * + * {{{ + * ./bin/spark-submit --class io.snappydata.adanalytics.SparkLogAggregator \ + * --master + * }}} + */ object SparkLogAggregator extends App { val sc = new SparkConf() .setAppName(getClass.getName) - .setMaster("local[*]") + .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") val ssc = new StreamingContext(sc, Seconds(1)) + val schema = StructType(Seq(StructField("timestamp", TimestampType), StructField("publisher", StringType), + StructField("advertiser", StringType), StructField("website", StringType), StructField("geo", StringType), + StructField("bid", DoubleType), StructField("cookie", StringType))) - // stream of (topic, ImpressionLog) - val messages = KafkaUtils.createDirectStream - [String, AdImpressionLog, StringDecoder, AdImpressionLogAvroDecoder](ssc, kafkaParams, topics) + private val spark = SparkSession.builder().getOrCreate() - // to count uniques - lazy val hyperLogLog = new HyperLogLogMonoid(12) + import spark.implicits._ - // we filter out non resolved geo (unknown) and map (pub, geo) -> AggLog that will be reduced - val logsByPubGeo = messages.map(_._2).filter(_.getGeo != Configs.UnknownGeo).map { - log => - val key = PublisherGeoKey(log.getPublisher.toString, log.getGeo.toString) - val agg = AggregationLog( - timestamp = log.getTimestamp, - sumBids = log.getBid, - imps = 1, - uniquesHll = hyperLogLog(log.getCookie.toString.getBytes()) - ) - (key, agg) - } + val df = spark.readStream + .format("kafka") + .option("kafka.bootstrap.servers", brokerList) + .option("value.deserializer", classOf[ByteArrayDeserializer].getName) + .option("startingOffsets", "earliest") + .option("subscribe", kafkaTopic) + .load().select("value").as[Array[Byte]](Encoders.BINARY) + .mapPartitions(itr => { + val deserializer = new AdImpressionLogAVRODeserializer + itr.map(data => { + val adImpressionLog = deserializer.deserialize(data) + Row(new java.sql.Timestamp(adImpressionLog.getTimestamp), adImpressionLog.getPublisher.toString, + adImpressionLog.getAdvertiser.toString, adImpressionLog.getWebsite.toString, + adImpressionLog.getGeo.toString, adImpressionLog.getBid, adImpressionLog.getCookie.toString) + }) + })(RowEncoder.apply(schema)) + .filter(s"geo != '${Configs.UnknownGeo}'") // filtering invalid data - // Reduce to generate imps, uniques, sumBid per pub and geo per 2 seconds - val aggLogs = logsByPubGeo.reduceByKeyAndWindow(reduceAggregationLogs, Seconds(2)) + // Group by on sliding window of 1 second + val windowedDF = df.withColumn("eventTime", $"timestamp".cast("timestamp")) + .withWatermark("eventTime", "0 seconds") + .groupBy(window($"eventTime", "1 seconds", "1 seconds"), $"publisher", $"geo") + .agg(min("timestamp").alias("timestamp"), avg("bid").alias("avg_bid"), count("geo").alias + ("imps"), approx_count_distinct("cookie").alias("uniques")) + .select("timestamp", "publisher", "geo", "avg_bid", "imps", "uniques") - aggLogs.foreachRDD(rdd => { - rdd.foreach(f => { - println("AggregationLog {timestamp=" + f._2.timestamp + " sumBids=" + f._2.sumBids + " imps=" + f._2.imps + "}") - }) - }) + // content of 'value' column will be written to kafka topic as value + private val targetSchema = StructType(Seq(StructField("value", StringType))) + implicit val encoder: ExpressionEncoder[Row] = RowEncoder(targetSchema) - // start rolling! - ssc.start - ssc.awaitTermination + // writing aggregated records on a kafka topic in CSV format + val logStream = windowedDF + .map(r => Row(r.toSeq.map(f => StringEscapeUtils.escapeCsv(f.toString)).mkString(","))) + .writeStream + .queryName("spark_log_aggregator") + .option("checkpointLocation", sparkLogAggregatorCheckpointDir) + .outputMode("update") + .format("kafka") + .option("kafka.bootstrap.servers", brokerList) + .option("topic", "adImpressionsOut") + .start() - private def reduceAggregationLogs(aggLog1: AggregationLog, aggLog2: AggregationLog) = { - aggLog1.copy( - timestamp = math.min(aggLog1.timestamp, aggLog2.timestamp), - sumBids = aggLog1.sumBids + aggLog2.sumBids, - imps = aggLog1.imps + aggLog2.imps, - uniquesHll = aggLog1.uniquesHll + aggLog2.uniquesHll - ) - } + logStream.awaitTermination() } - -case class AggregationLog(timestamp: Long, sumBids: Double, imps: Int = 1, uniquesHll: HLL) - -case class AggregationResult(date: DateTime, publisher: String, geo: String, imps: Int, uniques: Int, avgBids: Double) - -case class PublisherGeoKey(publisher: String, geo: String) diff --git a/src/main/scala/io/snappydata/benchmark/CSVSnappyIngestionPerf.scala b/src/main/scala/io/snappydata/benchmark/CSVSnappyIngestionPerf.scala deleted file mode 100644 index bcd377f..0000000 --- a/src/main/scala/io/snappydata/benchmark/CSVSnappyIngestionPerf.scala +++ /dev/null @@ -1,99 +0,0 @@ -/* - * Copyright (c) 2016 SnappyData, Inc. All rights reserved. - * - * Licensed under the Apache License, Version 2.0 (the "License"); you - * may not use this file except in compliance with the License. You - * may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - * implied. See the License for the specific language governing - * permissions and limitations under the License. See accompanying - * LICENSE file. - */ -package io.snappydata.benchmark - -import java.io.FileReader - -import com.opencsv.CSVReader -import io.snappydata.adanalytics.Configs -import io.snappydata.adanalytics.AdImpressionLog -import Configs._ -import org.apache.spark.rdd.RDD -import org.apache.spark.sql.Row -import org.apache.spark.streaming.SnappyStreamingContext -import org.apache.spark.{SparkConf, SparkContext} - -import scala.collection.mutable.Queue -import scala.concurrent.ExecutionContext.Implicits.global -import scala.concurrent.Future -import scala.util.{Failure, Success} - -object CSVSnappyIngestionPerf extends App { - - val sparkConf = new SparkConf() - .setAppName(getClass.getSimpleName) - .setMaster(s"$sparkMasterURL") - .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") - - - val assemblyJar = System.getenv("PROJECT_ASSEMBLY_JAR") - if (assemblyJar != null) { - sparkConf.set("spark.driver.extraClassPath", assemblyJar) - sparkConf.set("spark.executor.extraClassPath", assemblyJar) - } - - val sc = new SparkContext(sparkConf) - - val snsc = new SnappyStreamingContext(sc, batchDuration) - - snsc.snappyContext.dropTable("adImpressions", ifExists = true) - - val rddQueue = Queue[RDD[AdImpressionLog]]() - - val logStream = snsc.queueStream(rddQueue) - - val rows = logStream.map(v => Row(new java.sql.Timestamp(v.getTimestamp), v.getPublisher.toString, - v.getAdvertiser.toString, v.getWebsite.toString, v.getGeo.toString, v.getBid, v.getCookie.toString)) - - val logStreamAsTable = snsc.createSchemaDStream(rows, getAdImpressionSchema) - - snsc.snappyContext.createTable("adImpressions", "column", getAdImpressionSchema, - Map("buckets" -> "29")) - - logStreamAsTable.foreachDataFrame(_.write.insertInto("adImpressions")) - - val csvReader = Future { - import collection.JavaConverters._ - - val csvFile = new CSVReader(new FileReader("adimpressions.csv")) - csvFile.iterator.asScala - .map { fields => { - val log = new AdImpressionLog() - log.setTimestamp(fields(0).toLong) - log.setPublisher(fields(1)) - log.setAdvertiser(fields(2)) - log.setWebsite(fields(3)) - log.setGeo(fields(4)) - log.setBid(fields(5).toDouble) - log.setCookie(fields(6)) - log - } - }.grouped(100000).foreach { logs => - val logRDD = sc.parallelize(logs, 8) - rddQueue += logRDD - } - } - - csvReader.onComplete { - case Success(value) => - case Failure(e) => e.printStackTrace - } - - snsc.start() - snsc.awaitTermination() - -} diff --git a/src/main/scala/io/snappydata/benchmark/CustomReceiverSnappyIngestionPerf.scala b/src/main/scala/io/snappydata/benchmark/CustomReceiverSnappyIngestionPerf.scala deleted file mode 100644 index 17dc435..0000000 --- a/src/main/scala/io/snappydata/benchmark/CustomReceiverSnappyIngestionPerf.scala +++ /dev/null @@ -1,82 +0,0 @@ -/* - * Copyright (c) 2016 SnappyData, Inc. All rights reserved. - * - * Licensed under the Apache License, Version 2.0 (the "License"); you - * may not use this file except in compliance with the License. You - * may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - * implied. See the License for the specific language governing - * permissions and limitations under the License. See accompanying - * LICENSE file. - */ - -package io.snappydata.benchmark - -import io.snappydata.adanalytics.{Configs, AdImpressionGenerator} -import Configs._ -import io.snappydata.adanalytics.AdImpressionLog -import org.apache.spark.sql.Row -import org.apache.spark.storage.StorageLevel -import org.apache.spark.streaming.SnappyStreamingContext -import org.apache.spark.streaming.receiver.Receiver -import org.apache.spark.{SparkConf, SparkContext} - -object CustomReceiverSnappyIngestionPerf extends App { - - val sparkConf = new SparkConf() - .setAppName(getClass.getSimpleName) - .setMaster(s"$sparkMasterURL") - .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") - - val assemblyJar = System.getenv("PROJECT_ASSEMBLY_JAR") - if (assemblyJar != null) { - sparkConf.set("spark.driver.extraClassPath", assemblyJar) - sparkConf.set("spark.executor.extraClassPath", assemblyJar) - } - - val sc = new SparkContext(sparkConf) - val snsc = new SnappyStreamingContext(sc, batchDuration) - - snsc.snappyContext.dropTable("adImpressions", ifExists = true) - - val stream = snsc.receiverStream[AdImpressionLog](new AdImpressionReceiver) - - val rows = stream.map(v => Row(new java.sql.Timestamp(v.getTimestamp), v.getPublisher.toString, - v.getAdvertiser.toString, v.getWebsite.toString, v.getGeo.toString, v.getBid, v.getCookie.toString)) - - val logStreamAsTable = snsc.createSchemaDStream(rows, getAdImpressionSchema) - - snsc.snappyContext.createTable("adImpressions", "column", getAdImpressionSchema, - Map("buckets" -> "29")) - - logStreamAsTable.foreachDataFrame(_.write.insertInto("adImpressions")) - - snsc.start() - snsc.awaitTermination() - -} - - -final class AdImpressionReceiver extends Receiver[AdImpressionLog](StorageLevel.MEMORY_AND_DISK_2) { - override def onStart() { - new Thread("AdImpressionReceiver") { - override def run() { - receive() - } - }.start() - } - - override def onStop() { - } - - private def receive() { - while (!isStopped()) { - store(AdImpressionGenerator.nextRandomAdImpression()) - } - } -} \ No newline at end of file diff --git a/src/main/scala/io/snappydata/benchmark/KafkaAdImpressionAsyncProducer.scala b/src/main/scala/io/snappydata/benchmark/KafkaAdImpressionAsyncProducer.scala deleted file mode 100644 index e5f1571..0000000 --- a/src/main/scala/io/snappydata/benchmark/KafkaAdImpressionAsyncProducer.scala +++ /dev/null @@ -1,85 +0,0 @@ -/* - * Copyright (c) 2016 SnappyData, Inc. All rights reserved. - * - * Licensed under the Apache License, Version 2.0 (the "License"); you - * may not use this file except in compliance with the License. You - * may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - * implied. See the License for the specific language governing - * permissions and limitations under the License. See accompanying - * LICENSE file. - */ - -package io.snappydata.benchmark - -import java.util.Properties - -import io.snappydata.adanalytics.Configs._ -import io.snappydata.adanalytics.KafkaAdImpressionProducer._ -import io.snappydata.adanalytics.{AdImpressionGenerator, AdImpressionLog, Configs} -import kafka.producer.{KeyedMessage, Producer, ProducerConfig} - -/** - * A simple Kafka Producer program which randomly generates - * ad impression log messages and sends it to Kafka broker. - * This program generates and sends 10 million messages. - */ -object KafkaAdImpressionAsyncProducer{ - - val props = new Properties() - props.put("producer.type", "async") - props.put("request.required.acks", "0") - props.put("serializer.class", "io.snappydata.adanalytics.AdImpressionLogAvroEncoder") - props.put("queue.buffering.max.messages", "1000000") // 10000 - props.put("metadata.broker.list", brokerList) - props.put("partitioner.class", "kafka.producer.DefaultPartitioner") - props.put("key.serializer.class", "kafka.serializer.StringEncoder") - props.put("batch.size", "9000000") // bytes - props.put("linger.ms", "50") - - val config = new ProducerConfig(props) - val producer = new Producer[String, AdImpressionLog](config) - - def main(args: Array[String]) { - println("Sending Kafka messages of topic " + kafkaTopic + " to brokers " + brokerList) - val threads = new Array[Thread](numProducerThreads) - for (i <- 0 until numProducerThreads) { - val thread = new Thread(new Worker()) - thread.start() - threads(i) = thread - } - threads.foreach(_.join()) - println(s"Done sending $numLogsPerThread Kafka messages of topic $kafkaTopic") - System.exit(0) - } - - def sendToKafka(log: AdImpressionLog) = { - producer.send(new KeyedMessage[String, AdImpressionLog]( - Configs.kafkaTopic, log.getTimestamp.toString, log)) - } -} - -final class Worker extends Runnable { - def run() { - for (j <- 0 to numLogsPerThread by maxLogsPerSecPerThread) { - val start = System.currentTimeMillis() - for (i <- 0 to maxLogsPerSecPerThread) { - sendToKafka(AdImpressionGenerator.nextRandomAdImpression()) - } - // If one second hasn't elapsed wait for the remaining time - // before queueing more. - val timeRemaining = 1000 - (System.currentTimeMillis() - start) - if (timeRemaining > 0) { - Thread.sleep(timeRemaining) - } - if (j !=0 & (j % 200000) == 0) { - println(s"Sent $j Kafka messages of topic $kafkaTopic") - } - } - } -} diff --git a/src/main/scala/io/snappydata/benchmark/KafkaSnappyIngestionPerf.scala b/src/main/scala/io/snappydata/benchmark/KafkaSnappyIngestionPerf.scala deleted file mode 100644 index 44d8c95..0000000 --- a/src/main/scala/io/snappydata/benchmark/KafkaSnappyIngestionPerf.scala +++ /dev/null @@ -1,83 +0,0 @@ -/* - * Copyright (c) 2016 SnappyData, Inc. All rights reserved. - * - * Licensed under the Apache License, Version 2.0 (the "License"); you - * may not use this file except in compliance with the License. You - * may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - * implied. See the License for the specific language governing - * permissions and limitations under the License. See accompanying - * LICENSE file. - */ - -package io.snappydata.benchmark - -import io.snappydata.adanalytics.Configs._ -import org.apache.spark.SparkContext -import org.apache.spark.streaming.SnappyStreamingContext - -/** - * Simple direct kafka spark streaming program which pulls log messages - * from kafka broker and ingest those log messages to Snappy store. - */ -object KafkaSnappyIngestionPerf extends App { - - val sparkConf = new org.apache.spark.SparkConf() - .setAppName(getClass.getSimpleName) - .set("spark.sql.inMemoryColumnarStorage.compressed", "false") - .set("spark.sql.inMemoryColumnarStorage.batchSize", "2000") - .set("spark.streaming.kafka.maxRatePerPartition" , s"$maxRatePerPartition") - //.setMaster("local[*]") - .setMaster(s"$snappyMasterURL") - - val assemblyJar = System.getenv("PROJECT_ASSEMBLY_JAR") - if (assemblyJar != null) { - sparkConf.set("spark.driver.extraClassPath", assemblyJar) - sparkConf.set("spark.executor.extraClassPath", assemblyJar) - } - - sparkConf.set("spark.driver.extraJavaOptions", "-Dgemfire.tombstone-gc-threshold=5000") - sparkConf.set("spark.executor.extraJavaOptions", "-Dgemfire.tombstone-gc-threshold=5000") - - val sc = new SparkContext(sparkConf) - val snsc = new SnappyStreamingContext(sc, batchDuration) - - snsc.sql("drop table if exists adImpressions") - snsc.sql("drop table if exists adImpressionStream") - - // Create a stream of AdImpressionLog which will pull the log messages - // from Kafka broker - snsc.sql("create stream table adImpressionStream (" + - " time_stamp timestamp," + - " publisher string," + - " advertiser string," + - " website string," + - " geo string," + - " bid double," + - " cookie string) " + - " using directkafka_stream options (" + - " rowConverter 'io.snappydata.adanalytics.AdImpressionToRowsConverter' ," + - s" kafkaParams 'metadata.broker.list->$brokerList'," + - s" topics '$kafkaTopic'," + - " K 'java.lang.String'," + - " V 'io.snappydata.adanalytics.AdImpressionLog', " + - " KD 'kafka.serializer.StringDecoder', " + - " VD 'io.snappydata.adanalytics.AdImpressionLogAvroDecoder')") - - snsc.sql("create table adImpressions(times_tamp timestamp, publisher string, " + - "advertiser string, website string, geo string, bid double, cookie string) " + - "using column " + - "options ( buckets '29', persistent 'asynchronous')") - - // Save the streaming data to snappy store per second (btachDuration) - snsc.getSchemaDStream("adImpressionStream") - .foreachDataFrame(_.write.insertInto("adImpressions")) - - snsc.start - snsc.awaitTermination -} diff --git a/src/main/scala/io/snappydata/benchmark/SnappyQueryPerfJob.scala b/src/main/scala/io/snappydata/benchmark/SnappyQueryPerfJob.scala deleted file mode 100644 index d0bd9f5..0000000 --- a/src/main/scala/io/snappydata/benchmark/SnappyQueryPerfJob.scala +++ /dev/null @@ -1,38 +0,0 @@ -package io.snappydata.benchmark - -import java.io.PrintWriter - -import com.typesafe.config.Config -import org.apache.spark.sql._ - -class SnappyQueryPerfJob extends SnappySQLJob { - - override def runSnappyJob(sc: SnappySession, jobConfig: Config): Any = { - val outFileName = s"QueryPerf-${System.currentTimeMillis()}.out" - val pw = new PrintWriter(outFileName) - var start = System.currentTimeMillis() - sc.sql("select count(*) AS adCount, geo from adImpressions group by geo order by adCount desc limit 20").collect() - pw.println("Time for Q1 " + (System.currentTimeMillis() - start)) - pw.flush() - - start = System.currentTimeMillis() - sc.sql("select sum (bid) as max_bid, geo from adImpressions group by geo order by max_bid desc limit 20").collect() - pw.println("Time for Q2 " + (System.currentTimeMillis() - start)) - pw.flush() - - start = System.currentTimeMillis() - sc.sql("select sum (bid) as max_bid, publisher from adImpressions group by publisher order by max_bid desc limit 20").collect() - pw.println("Time for Q3 " + (System.currentTimeMillis() - start)) - pw.flush() - - start = System.currentTimeMillis() - val array = sc.sql("select count(*) from adImpressions").collect() - pw.println(array(0) +"Time for count(*) " + (System.currentTimeMillis() - start)) - pw.flush() - pw.close() - } - - override def isValidJob(sc: SnappySession, config: Config): SnappyJobValidation = { - SnappyJobValid() - } -} diff --git a/src/main/scala/io/snappydata/benchmark/SnappySampleQueryPerfJob.scala b/src/main/scala/io/snappydata/benchmark/SnappySampleQueryPerfJob.scala deleted file mode 100644 index ef48c80..0000000 --- a/src/main/scala/io/snappydata/benchmark/SnappySampleQueryPerfJob.scala +++ /dev/null @@ -1,43 +0,0 @@ -package io.snappydata.benchmark - -import java.io.PrintWriter - -import com.typesafe.config.Config -import org.apache.spark.sql._ - -class SnappySampleQueryPerfJob extends SnappySQLJob { - - override def runSnappyJob(sc: SnappySession, jobConfig: Config): Any = { - val outFileName = s"SampleQueryPerf-${System.currentTimeMillis()}.out" - val pw = new PrintWriter(outFileName) - var start = System.currentTimeMillis() - sc.sql("select count(*) AS adCount, geo from adImpressions group by geo" + - " order by adCount desc limit 20 with error 0.1").collect() - pw.println("Time for Sample Q1 " + (System.currentTimeMillis() - start)) - pw.flush() - - start = System.currentTimeMillis() - sc.sql("select sum (bid) as max_bid, geo from adImpressions group by geo" + - " order by max_bid desc limit 20 with error 0.1").collect() - pw.println("Time for Sample Q2 " + (System.currentTimeMillis() - start)) - pw.flush() - - start = System.currentTimeMillis() - val array = sc.sql("select count(*) as sample_cnt from" + - " adImpressions with error 0.1").collect() - pw.println(array(0) +"Time for sample count(*) " + (System.currentTimeMillis() - start)) - pw.flush() - - start = System.currentTimeMillis() - sc.sql("select sum (bid) as max_bid, publisher from adImpressions group by" + - " publisher order by max_bid desc limit 20 with error 0.5").collect() - pw.println("Time for Sample Q3 " + (System.currentTimeMillis() - start)) - pw.flush() - - pw.close() - } - - override def isValidJob(sc: SnappySession, config: Config): SnappyJobValidation = { - SnappyJobValid() - } -} diff --git a/src/main/scala/io/snappydata/benchmark/SnappyStreamIngestPerfJob.scala b/src/main/scala/io/snappydata/benchmark/SnappyStreamIngestPerfJob.scala deleted file mode 100644 index cd6b07d..0000000 --- a/src/main/scala/io/snappydata/benchmark/SnappyStreamIngestPerfJob.scala +++ /dev/null @@ -1,55 +0,0 @@ -package io.snappydata.benchmark - -import com.typesafe.config.Config -import io.snappydata.adanalytics.Configs._ -import org.apache.spark.sql.streaming.SnappyStreamingJob -import org.apache.spark.sql.{SnappyJobValid, SnappyJobValidation} -import org.apache.spark.streaming.SnappyStreamingContext - -class SnappyStreamIngestPerfJob extends SnappyStreamingJob { - - override def runSnappyJob(snsc: SnappyStreamingContext, jobConfig: Config): Any = { - //snsc.sql("drop table if exists adImpressions") - snsc.sql("drop table if exists adImpressionStream") - - // Create a stream of AdImpressionLog which will pull the log messages - // from Kafka broker - snsc.sql("create stream table adImpressionStream (" + - " time_stamp timestamp," + - " publisher string," + - " advertiser string," + - " website string," + - " geo string," + - " bid double," + - " cookie string) " + - " using directkafka_stream options (" + - " rowConverter 'io.snappydata.adanalytics.AdImpressionToRowsConverter' ," + - s" kafkaParams 'metadata.broker.list->$brokerList'," + - s" topics '$kafkaTopic'," + - " K 'java.lang.String'," + - " V 'io.snappydata.adanalytics.AdImpressionLog', " + - " KD 'kafka.serializer.StringDecoder', " + - " VD 'io.snappydata.adanalytics.AdImpressionLogAvroDecoder')") - - snsc.sql("create table adImpressions(times_tamp timestamp, publisher string, " + - "advertiser string, website string, geo string, bid double, cookie string) " + - "using column " + - "options ( buckets '29')") - - snsc.sql("CREATE SAMPLE TABLE sampledAdImpressions" + - " OPTIONS(qcs 'geo,publisher', fraction '0.02', strataReservoirSize '50', baseTable 'adImpressions')") - - // Save the streaming data to snappy store per second (btachDuration) - snsc.getSchemaDStream("adImpressionStream").foreachDataFrame( df => { - df.write.insertInto("adImpressions") - df.write.insertInto("sampledAdImpressions") - }) - - snsc.start - snsc.awaitTermination - } - - override def isValidJob(snsc: SnappyStreamingContext, config: Config): SnappyJobValidation = { - SnappyJobValid() - } -} \ No newline at end of file diff --git a/src/main/scala/io/snappydata/benchmark/SocketAdImpressionGenerator.scala b/src/main/scala/io/snappydata/benchmark/SocketAdImpressionGenerator.scala deleted file mode 100644 index a1c658b..0000000 --- a/src/main/scala/io/snappydata/benchmark/SocketAdImpressionGenerator.scala +++ /dev/null @@ -1,164 +0,0 @@ -/* - * Copyright (c) 2016 SnappyData, Inc. All rights reserved. - * - * Licensed under the Apache License, Version 2.0 (the "License"); you - * may not use this file except in compliance with the License. You - * may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - * implied. See the License for the specific language governing - * permissions and limitations under the License. See accompanying - * LICENSE file. - */ -package io.snappydata.benchmark - -import java.io.{ByteArrayOutputStream, IOException} -import java.net.ServerSocket - -import io.snappydata.adanalytics.{Configs, AdImpressionGenerator} -import io.snappydata.adanalytics.AdImpressionLog -import org.apache.avro.io.EncoderFactory -import org.apache.avro.specific.SpecificDatumWriter -import org.apache.spark.streaming.StreamUtils -import Configs._ - -/** - * A Simple program which writes Avro objects to socket stream - */ -object SocketAdImpressionGenerator extends App { - - val bytesPerSec = 800000000 - val blockSize = bytesPerSec / 10 - val bufferStream = new ByteArrayOutputStream(blockSize + 1000) - val encoder = EncoderFactory.get.directBinaryEncoder(bufferStream, null) - val writer = new SpecificDatumWriter[AdImpressionLog]( - AdImpressionLog.getClassSchema) - while (bufferStream.size < blockSize) { - writer.write(AdImpressionGenerator.nextRandomAdImpression, encoder) - } -// encoder.flush -// bufferStream.close - - val serverSocket = new ServerSocket(socketPort) - println("Listening on port " + socketPort) - - while (true) { - val socket = serverSocket.accept() - println("Got a new connection") - val out = StreamUtils.getRateLimitedOutputStream(socket.getOutputStream, bytesPerSec) - try { - while (true) { - out.write(bufferStream.toByteArray) - //out.flush - } - } catch { - case e: IOException => - println("Client disconnected") - socket.close() - } - } -} - -/* -final class Server(port: Int) extends Runnable { - def run(): Unit = { - launchServerSockets(port) - } - - private def launchServerSockets(port: Int): Unit = { - val serverSocket = new ServerSocket(port, 50, InetAddress.getLocalHost) - println("Listening on port " + port) - - while (true) { - val socket = serverSocket.accept() - println("Got a new connection on "+ port) - try { - while (true) { - val threads = new Array[Thread](1) - for (i <- 0 until 1) { - val thread = new Thread(new Ingester(socket)) - thread.start() - threads(i) = thread - } - threads.foreach(_.join()) - } - } catch { - case e: IOException => - println("Client disconnected from port " + port) - socket.close() - } - } - } -} - -final class Ingester(socket: Socket) extends Runnable { - def run() { - for (i <- 0 until 100000) { - val out = new ByteArrayOutputStream(630000) //1000 AdImpressions - val encoder = EncoderFactory.get.directBinaryEncoder(out, null) - val writer = new SpecificDatumWriter[AdImpressionLog]( - AdImpressionLog.getClassSchema) - while (out.size < 630000) { - writer.write(generateAdImpression, encoder) - } - encoder.flush - out.close - writeToSocket(socket, out.toByteArray) - } - } - - private def writeToSocket(socket: Socket, bytes: Array[Byte]) = synchronized { - socket.getOutputStream.write(bytes) - socket.getOutputStream.flush() - } - - private def generateAdImpression(): AdImpressionLog = { - val numPublishers = 50 - val numAdvertisers = 30 - val publishers = (0 to numPublishers).map("publisher" +) - val advertisers = (0 to numAdvertisers).map("advertiser" +) - val unknownGeo = "un" - - val geos = Seq("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", - "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", - "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", - "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", - "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY", unknownGeo) - - val numWebsites = 999 - val numCookies = 999 - val websites = (0 to numWebsites).map("website" +) - val cookies = (0 to numCookies).map("cookie" +) - - val random = new java.util.Random() - val timestamp = System.currentTimeMillis() - val publisher = publishers(random.nextInt(numPublishers - 10 + 1) + 10) - val advertiser = advertisers(random.nextInt(numAdvertisers - 10 + 1) + 10) - val website = websites(random.nextInt(numWebsites - 100 + 1) + 100) - val cookie = cookies(random.nextInt(numCookies - 100 + 1) + 100) - val geo = geos(random.nextInt(geos.size)) - val bid = math.abs(random.nextDouble()) % 1 - - val log = new AdImpressionLog() -// log.setTimestamp(1L) -// log.setPublisher("publisher") -// log.setAdvertiser("advertiser") -// log.setWebsite("website") -// log.setGeo("geo") -// log.setBid(1D) -// log.setCookie("cookie") - log.setTimestamp(timestamp) - log.setPublisher(publisher) - log.setAdvertiser(advertiser) - log.setWebsite(website) - log.setGeo(geo) - log.setBid(bid) - log.setCookie(cookie) - log - } -} -*/ diff --git a/src/main/scala/io/snappydata/benchmark/SocketSnappyIngestionPerf.scala b/src/main/scala/io/snappydata/benchmark/SocketSnappyIngestionPerf.scala deleted file mode 100644 index 1fbcd89..0000000 --- a/src/main/scala/io/snappydata/benchmark/SocketSnappyIngestionPerf.scala +++ /dev/null @@ -1,71 +0,0 @@ -/* - * Copyright (c) 2016 SnappyData, Inc. All rights reserved. - * - * Licensed under the Apache License, Version 2.0 (the "License"); you - * may not use this file except in compliance with the License. You - * may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - * implied. See the License for the specific language governing - * permissions and limitations under the License. See accompanying - * LICENSE file. - */ - -package io.snappydata.benchmark - -import io.snappydata.adanalytics.{Configs, AvroSocketStreamConverter} -import Configs._ -import io.snappydata.adanalytics.AdImpressionLog -import org.apache.spark.sql.Row -import org.apache.spark.storage.StorageLevel -import org.apache.spark.streaming.SnappyStreamingContext -import org.apache.spark.{SparkConf, SparkContext} - -/** - * Simple Snappy streaming program which pulls log messages - * from socket and ingest those log messages to Snappy store. - */ -object SocketSnappyIngestionPerf extends App { - - val sparkConf = new SparkConf() - .setAppName(getClass.getSimpleName) - .setMaster(s"$sparkMasterURL") - //.setMaster("snappydata://localhost:10334") - .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") - .set("spark.executor.extraJavaOptions", - " -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+AggressiveOpts -XX:FreqInlineSize=300 -XX:MaxInlineSize=300 ") - .set("spark.streaming.blockInterval", "50") - - val assemblyJar = System.getenv("PROJECT_ASSEMBLY_JAR") - if (assemblyJar != null) { - sparkConf.set("spark.driver.extraClassPath", assemblyJar) - sparkConf.set("spark.executor.extraClassPath", assemblyJar) - } - - val sc = new SparkContext(sparkConf) - - val snsc = new SnappyStreamingContext(sc, batchDuration) - - snsc.snappyContext.dropTable("adImpressions", ifExists = true) - - val converter = new AvroSocketStreamConverter - - val logStream = snsc.socketStream[AdImpressionLog](hostname, socketPort, converter.convert, StorageLevel.MEMORY_ONLY) - - val rows = logStream.map(v => Row(new java.sql.Timestamp(v.getTimestamp), v.getPublisher.toString, - v.getAdvertiser.toString, v.getWebsite.toString, v.getGeo.toString, v.getBid, v.getCookie.toString)) - - val logStreamAsTable = snsc.createSchemaDStream(rows, getAdImpressionSchema) - - snsc.snappyContext.createTable("adImpressions", "column", getAdImpressionSchema, - Map("buckets" -> "29")) - - logStreamAsTable.foreachDataFrame(_.write.insertInto("adImpressions")) - - snsc.start() - snsc.awaitTermination() -}