|
1 | 1 | # DataEngineerChallenge |
2 | 2 |
|
3 | | -This is an interview challenge for PayPay. Please feel free to fork. Pull Requests will be ignored. |
4 | 3 |
|
5 | | -The challenge is to make make analytical observations about the data using the distributed tools below. |
| 4 | +## Overview |
6 | 5 |
|
7 | | -## Processing & Analytical goals: |
| 6 | +This document describes the solution to https://github.com/Pay-Baymax/DataEngineerChallenge. |
| 7 | +This repo will only show the Spark solution. |
8 | 8 |
|
9 | | -1. Sessionize the web log by IP. Sessionize = aggregrate all page hits by visitor/IP during a session. |
10 | | - https://en.wikipedia.org/wiki/Session_(web_analytics) |
11 | 9 |
|
12 | | -2. Determine the average session time |
| 10 | +## Solution |
| 11 | +### Understand the input data |
| 12 | +This is my first step. I need to look into the data to find out the data size, schema and so on. |
| 13 | +So, what I did is to use **Jupyter** to inspect the input data that I uploaded to the hdfs. |
13 | 14 |
|
14 | | -3. Determine unique URL visits per session. To clarify, count a hit to a unique URL only once per session. |
| 15 | +From the inspectation, what I learned: |
| 16 | +- Potential duration of a session can be extremely long. (up to 11 hours) |
| 17 | +- Normal traffic per hour can be around 100k in this data set, with peak traffic around 300k. |
| 18 | +- Most sessions will likely end around 20 minutes. And there are plenty of sessions end after 15 minutes. |
| 19 | +- There are 15 hours in the data. |
15 | 20 |
|
16 | | -4. Find the most engaged users, ie the IPs with the longest session times |
| 21 | +For this inspectation of the data, please refer to the [inspectation notebook](./doc/Data%20Inspect/Data%20Inspect.md). |
17 | 22 |
|
18 | | -## Additional questions for Machine Learning Engineer (MLE) candidates: |
19 | | -1. Predict the expected load (requests/second) in the next minute |
| 23 | +### Design Consideration |
20 | 24 |
|
21 | | -2. Predict the session length for a given IP |
| 25 | +#### Should I use streaming? |
22 | 26 |
|
23 | | -3. Predict the number of unique URL visits by a given IP |
| 27 | +Absolutely, yes. If we only think about the mission, it makes perfect sense to process the traffic data in the streaming application. |
| 28 | +Normally I would set up logstash to stream the access log from AWS to a kafka topic, then build a streaming application to provide the realtime analysis. |
| 29 | +However, given the form of the data is actually a packed file, I assume that the scenario is more of a batched context. |
| 30 | +That's why I chose to use Spark to build a batched application. |
24 | 31 |
|
25 | | -## Tools allowed (in no particular order): |
26 | | -- Spark (any language, but prefer Scala or Java) |
27 | | -- Pig |
28 | | -- MapReduce (Hadoop 2.x only) |
29 | | -- Flink |
30 | | -- Cascading, Cascalog, or Scalding |
31 | 32 |
|
32 | | -If you need Hadoop, we suggest |
33 | | -HDP Sandbox: |
34 | | -http://hortonworks.com/hdp/downloads/ |
35 | | -or |
36 | | -CDH QuickStart VM: |
37 | | -http://www.cloudera.com/content/cloudera/en/downloads.html |
| 33 | +#### How about the granularity of the batch |
38 | 34 |
|
| 35 | +Based on the requirement, it would make less sense to calculate the session of the first hour of the day at the beginning of the next day in a daily batch. |
| 36 | +Why don't show it in the next hour with an hourly batch? |
| 37 | +Besides, the timestamp in the data is in UTC, thus introducing a concept of "day" would be very confusing. |
39 | 38 |
|
40 | | -### Additional notes: |
41 | | -- You are allowed to use whatever libraries/parsers/solutions you can find provided you can explain the functions you are implementing in detail. |
42 | | -- IP addresses do not guarantee distinct users, but this is the limitation of the data. As a bonus, consider what additional data would help make better analytical conclusions |
43 | | -- For this dataset, complete the sessionization by time window rather than navigation. Feel free to determine the best session window time on your own, or start with 15 minutes. |
44 | | -- The log file was taken from an AWS Elastic Load Balancer: |
45 | | -http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/access-log-collection.html#access-log-entry-format |
| 39 | +Another thing we can benefit from using a hourly batch is that, we can potentially reduce the cluster cost by using less resource to process hourly data instead of daily data. |
46 | 40 |
|
| 41 | +#### How should we deal with the sessions not ended within one hour? |
47 | 42 |
|
| 43 | +Since we need to calculate the sessions in the next hour, and a session can theoretically last forever, |
| 44 | +we need two things: |
| 45 | +- Concat the accesses from last hour that are not in any ended session, with the accesses in the current hour. |
| 46 | +- A limitation for how long at most a session can last. |
48 | 47 |
|
49 | | -## How to complete this challenge: |
| 48 | +For the second one, we need it because if some sessions last too long, we will have serious data skew problem. |
50 | 49 |
|
51 | | -1. Fork this repo in github |
52 | | -2. Complete the processing and analytics as defined first to the best of your ability with the time provided. |
53 | | -3. Place notes in your code to help with clarity where appropriate. Make it readable enough to present to the PayPay interview team. |
54 | | -4. Include the test code and data in your solution. |
55 | | -5. Complete your work in your own github repo and send the results to us and/or present them during your interview. |
| 50 | +#### How should the output look like |
56 | 51 |
|
57 | | -## What are we looking for? What does this prove? |
| 52 | +According to the [Analytical goals](https://github.com/Pay-Baymax/DataEngineerChallenge#processing--analytical-goals), |
| 53 | +all interested metrics are on the **session** instead of individual access. |
| 54 | +With that being said, it would make more sense to me to output session with these metrics directly instead of outputting the accesses with a session id attached to them. |
| 55 | +This benefits us with: |
| 56 | +- Easier and faster to calculate duration, session number and average accesses per session, since they are already aggregated at session level. |
| 57 | +- Avoid confusion about "*If a session last for two hours, and then we check the session number for each hour, should this session count as 1 session on each hour?*" |
58 | 58 |
|
59 | | -We want to see how you handle: |
60 | | -- New technologies and frameworks |
61 | | -- Messy (ie real) data |
62 | | -- Understanding data transformation |
63 | | -This is not a pass or fail test, we want to hear about your challenges and your successes with this particular problem. |
| 59 | +Other that that, we will also output the pending accesses that are not yet being cut into a session. This result will be used as the input for next hour's batch. |
0 commit comments