Welcome! We have prepared a little exercise for you.
The task is to build a miniature data pipeline: In lack of a streaming source, pull the sample data set via https, do some simplified validation, transformation, and splitting of the data, and write the output to disk.
In more detail, the program should do the following, not necessarily in that order:
- Fetch our zipped line-json file with sample data from here.
- Output (not gzipped) line-json files again.
- Process one line at a time, pretending that you're processing a stream.
- Whenever there is a
convvalue, use the rates.json andconvvalueunitto convert the value to USD. Write that value to a new fieldconvusdvalue. - Split records by values in the
typefield and create one output file for eachtype. - Validate that the
linkidis a valid UUID, using the standard library. Invalid records must go to their own output file, e.g. "deadletters.json".