visualskyrim
diff --git a/‎.gitignore‎
Lines changed: 3 additions & 0 deletions b/‎.gitignore‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 7 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 38 additions & 42 deletions b/‎README.md‎
Lines changed: 38 additions & 42 deletions
diff --git a/‎build.sbt‎
Lines changed: 54 additions & 0 deletions b/‎build.sbt‎
Lines changed: 54 additions & 0 deletions
@@ -0,0 +1,3 @@
+.idea
+project/target/
+target/
@@ -0,0 +1,7 @@
+# CHANGELOG
+
+# 1.0.0
+> 2020-04-07
+
+## Added
+- Basic functionality to sessionize the accesses based on IP.
@@ -1,63 +1,59 @@
 # DataEngineerChallenge
 
-This is an interview challenge for PayPay. Please feel free to fork. Pull Requests will be ignored.
 
-The challenge is to make make analytical observations about the data using the distributed tools below.
+## Overview
 
-## Processing & Analytical goals:
+This document describes the solution to https://github.com/Pay-Baymax/DataEngineerChallenge.
+This repo will only show the Spark solution.
 
-1. Sessionize the web log by IP. Sessionize = aggregrate all page hits by visitor/IP during a session.
-    https://en.wikipedia.org/wiki/Session_(web_analytics)
 
-2. Determine the average session time
+## Solution
+### Understand the input data
+This is my first step. I need to look into the data to find out the  data size, schema and so on.
+So, what I did is to use **Jupyter** to inspect the input data that I uploaded to the hdfs.
 
-3. Determine unique URL visits per session. To clarify, count a hit to a unique URL only once per session.
+From the inspectation, what I learned:
+- Potential duration of a session can be extremely long. (up to 11 hours)
+- Normal traffic per hour can be around 100k in this data set, with peak traffic around 300k.
+- Most sessions will likely end around 20 minutes. And there are plenty of sessions end after 15 minutes.
+- There are 15 hours in the data.
 
-4. Find the most engaged users, ie the IPs with the longest session times
+For this inspectation of the data, please refer to the [inspectation notebook](./doc/Data%20Inspect/Data%20Inspect.md).
 
-## Additional questions for Machine Learning Engineer (MLE) candidates:
-1. Predict the expected load (requests/second) in the next minute
+### Design Consideration
 
-2. Predict the session length for a given IP
+#### Should I use streaming?
 
-3. Predict the number of unique URL visits by a given IP
+Absolutely, yes. If we only think about the mission, it makes perfect sense to process the traffic data in the streaming application.
+Normally I would set up logstash to stream the access log from AWS to a kafka topic, then build a streaming application to provide the realtime analysis.
+However, given the form of the data is actually a packed file, I assume that the scenario is more of a batched context.
+That's why I chose to use Spark to build a batched application.
 
-## Tools allowed (in no particular order):
-- Spark (any language, but prefer Scala or Java)
-- Pig
-- MapReduce (Hadoop 2.x only)
-- Flink
-- Cascading, Cascalog, or Scalding
 
-If you need Hadoop, we suggest 
-HDP Sandbox:
-http://hortonworks.com/hdp/downloads/
-or 
-CDH QuickStart VM:
-http://www.cloudera.com/content/cloudera/en/downloads.html
+#### How about the granularity of the batch
 
+Based on the requirement, it would make less sense to calculate the session of the first hour of the day at the beginning of the next day in a daily batch.
+Why don't show it in the next hour with an hourly batch?
+Besides, the timestamp in the data is in UTC, thus introducing a concept of "day" would be very confusing.
 
-### Additional notes:
-- You are allowed to use whatever libraries/parsers/solutions you can find provided you can explain the functions you are implementing in detail.
-- IP addresses do not guarantee distinct users, but this is the limitation of the data. As a bonus, consider what additional data would help make better analytical conclusions
-- For this dataset, complete the sessionization by time window rather than navigation. Feel free to determine the best session window time on your own, or start with 15 minutes.
-- The log file was taken from an AWS Elastic Load Balancer:
-http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/access-log-collection.html#access-log-entry-format
+Another thing we can benefit from using a hourly batch is that, we can potentially reduce the cluster cost by using less resource to process hourly data instead of daily data.
 
+#### How should we deal with the sessions not ended within one hour?
 
+Since we need to calculate the sessions in the next hour, and a session can theoretically last forever,
+we need two things:
+- Concat the accesses from last hour that are not in any ended session, with the accesses in the current hour.
+- A limitation for how long at most a session can last.
 
-## How to complete this challenge:
+For the second one, we need it because if some sessions last too long, we will have serious data skew problem.
 
-1. Fork this repo in github
-2. Complete the processing and analytics as defined first to the best of your ability with the time provided.
-3. Place notes in your code to help with clarity where appropriate. Make it readable enough to present to the PayPay interview team.
-4. Include the test code and data in your solution. 
-5. Complete your work in your own github repo and send the results to us and/or present them during your interview.
+#### How should the output look like
 
-## What are we looking for? What does this prove?
+According to the [Analytical goals](https://github.com/Pay-Baymax/DataEngineerChallenge#processing--analytical-goals),
+all interested metrics are on the **session** instead of individual access.
+With that being said, it would make more sense to me to output session with these metrics directly instead of outputting the accesses with a session id attached to them.
+This benefits us with:
+- Easier and faster to calculate duration, session number and average accesses per session, since they are already aggregated at session level.
+- Avoid confusion about "*If a session last for two hours, and then we check the session number for each hour, should this session count as 1 session on each hour?*"
 
-We want to see how you handle:
-- New technologies and frameworks
-- Messy (ie real) data
-- Understanding data transformation
-This is not a pass or fail test, we want to hear about your challenges and your successes with this particular problem.
+Other that that, we will also output the pending accesses that are not yet being cut into a session. This result will be used as the input for next hour's batch. 
@@ -0,0 +1,54 @@
+
+lazy val versions = new {
+  val sessionize = "1.0.0"
+  val jodaTime = "2.9.3"
+  val log4j = "1.2.17"
+  val scalatest = "3.0.3"
+  val sparkVersion = "2.3.3"
+  val typesafe = "1.3.1"
+}
+
+lazy val root = (project in file("."))
+  .configs(Test)
+  .settings(
+    inThisBuild(List(
+      organization := "com.rakuten.rat",
+      scalaVersion := "2.11.8",
+      version := versions.sessionize
+    )),
+    name := "sessionize",
+    fork := true,
+    parallelExecution := false,
+    libraryDependencies ++= Seq(
+      "org.apache.thrift" % "libthrift" % "0.11.0",
+      "org.apache.spark" %% "spark-core" % versions.sparkVersion % "provided",
+      "org.apache.spark" %% "spark-sql" % versions.sparkVersion % "provided",
+      "org.apache.spark" %% "spark-hive" % versions.sparkVersion % "provided",
+      "com.typesafe" % "config" % versions.typesafe,
+      "joda-time" % "joda-time" % versions.jodaTime,
+      "log4j" % "log4j" % versions.log4j,
+      "log4j" % "apache-log4j-extras" % versions.log4j,
+      "com.sksamuel.elastic4s" %% "elastic4s-core" % "5.6.0",
+      "com.sksamuel.elastic4s" %% "elastic4s-http" % "5.6.0",
+      "org.json4s" %% "json4s-native" % "3.2.11",
+      "org.json4s" %% "json4s-jackson" % "3.2.11",
+      "org.scalatest" %% "scalatest" % versions.scalatest % "test",
+      "com.github.scopt" %% "scopt" % "3.6.0"
+    )
+  )
+
+
+parallelExecution in Test := false
+parallelExecution := false
+
+//unmanagedSourceDirectories in Compile += baseDirectory.value / "src" / "main" / "thrift-java"
+//scroogeThriftSourceFolder in Compile := { baseDirectory.value / "whitelist-store-service/src/main/thrift" }
+
+lazy val compileScalastyle = taskKey[Unit]("compileScalastyle")
+
+/* scalastyle >= 0.9.0 */
+compileScalastyle := scalastyle.in(Compile).toTask("").value
+
+lazy val upgrade = TaskKey[Unit]("upgrade", "Upgrade version")
+
+(compile in Compile) := ((compile in Compile) dependsOn compileScalastyle).value
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+.idea`
	`2`	`+project/target/`
	`3`	`+target/`