Fixes to API, added primitives to format data, benchmarking function, tuning a pipeline by skyeeiskowitz · Pull Request #3 · sintel-dev/pyteller

skyeeiskowitz · 2021-01-26T21:34:49Z

Data Input

removed functions ingest_data and egest_data and replaced them with simpler preprocessing primitives pyteller.primitives.preprocessing.format_data.json and pyteller/primitives/jsons/pyteller.primitives.postprocessing.reformat_data.json

Pipeline Outputs

Added default outputs block to pipelines with outputs as forecast and actual
fit and forecast outputs are dictionaries with the key being the output name specified in the pipeline json_
Added a tutorial folder with a python notebooks that steps through a pipe line and one that tunes a pipeline

Benchmarking

Made a new benchmarking that can run on dask
The function including detailed output and summary output for 3 pipelines and 11 signals

Style

Cleaned up code style
added logging instead of print statements
fixed plotting function in utils.py
updated documentation to sphinx theme

CONTRIBUTING.rst

MANIFEST.in

tutorials/Stepping Through a Pipeline.ipynb

dyuliu · 2021-09-30T12:38:47Z

README.md

-in the subdirectories inside the [pyteller/pipelines](orion/pipelines) folder.

-This is the list of pipelines available so far, which will grow over time:
+# Quickstart


Does pyteller still support multiple input options? If yes, perhaps create a separate file (e.g., data_format.md) to introduce different input options and give some examples?

dyuliu · 2021-09-30T12:47:41Z

benchmark/README.md

 ## Releases

-In every release, we run the pyteller benchmark and maintain an up to-date [leaderboard](../README.md#leaderboard).
+In every release, we run the pyteller benchmark and maintain an up to-date [leaderboard](leaderboard.md) which can also be found in this [summary Google Sheets document](https://docs.google.com/spreadsheets/d/1OPwAslqfpWvzpUgiGoeEq-Wk_yK-GYPGpmS7TwEaSbw/edit?usp=sharing).


It is worthwhile to report some aggregated numbers such as mean runtime, failed times, average performance scores, in the summary file. We can refer to ORION's summary report (link).

dyuliu · 2021-09-30T12:50:01Z

benchmark/README.md

-In [5]: scores = benchmark(pipelines=pipelines, datasets=datasets, metrics=metrics, rank='MAPE')
+datasets = ['a10', 'gasoline', 'calls']

+results = benchmark(pipelines=pipelines, metrics=metrics, output_path=output_path, datasets=datasets, metrics=metrics, rank='MASE')


Benchmark will actually create many tasks to run. Do we save the meta information of these tasks somewhere? (ideas from Cardea and LM project)

dyuliu · 2021-09-30T12:52:04Z

benchmark/leaderboard.md

@@ -0,0 +1,8 @@
+The current pyteller benchmark was run on 11 signals for 3 pipelines


It's more convenient to merge leaderboard.md to README.md either in the root directory or the current benchmark directory.

dyuliu · 2021-09-30T12:54:55Z

benchmark/results/results_all.csv

@@ -0,0 +1,34 @@
+dataset,pipeline,signal,prediction length,iteration,MAE,MSE,RMSE,MASE,sMAPE,MAPE,under,over,strategy,lstm_1_units,dropout_1_rate,lstm_2_units,dropout_2_rate,window_size,p,d,q,status,elapsed,run_id


why not just use the release number to name the file?
For example,
0.1.0_summary.csv and 0.1.0_details.csv

dyuliu · 2021-09-30T12:57:17Z

docs/api_reference/api/pyteller.metrics.over_pred.rst

@@ -0,0 +1,6 @@
+pyteller.metrics.over\_pred


Are these files generated automatically? Should we include it in a PR?

frances-h · 2021-10-04T15:19:27Z

docs/user_guides/index.rst

 ===========

-The User Guides section covers different topics about Orion usage:
+The User Guides section covers different topics about python usage:


sarahmish

Great work! A lot of changes are proposed in this PR, there are some comments that I would recommend addressing before merging. Most of my comments are of format / style nature

General thoughts I have about pyteller:

rethink where data should be stored.
how do we store pipelines? according to the PR we have the following path pyteller/pipelines/pyteller/, why not keep it simpler? pyteller/pipelines/
edit the .gitignore to ignore all automatically generated files under api/.

sarahmish · 2021-10-05T15:26:33Z

README.md


-Time series forecasting using MLPrimitives
+



I would add the following as well to be clear about where we are in the project

- License: [MIT](https://github.com/signals-dev/pyteller/blob/master/LICENSE) - Development Status: [Pre-Alpha](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)

sarahmish · 2021-10-05T15:31:13Z

benchmark/leaderboard.md

+| pipeline                         | Percentage of Times Beat ARIMA |
+| -------------------------------- | ------------------------------ |
+| pyteller.ARIMA.arima             | 0                              |
+| pyteller.LSTM.LSTM               | 36.3636364                     |
+| pyteller.persistence.persistence | 18.1818182                     |


Interesting table for a leaderboard. What does the percentage mean in this case? I find it hard to interpret. In Orion we aggregate the score over all signals per dataset and compute in how many datasets did the pipeline outperform ARIMA. I gather you are using % here because you are computing it per signal?

sarahmish · 2021-10-05T15:34:55Z

docs/api_reference/api/pyteller.Pyteller.evaluate.rst

@@ -0,0 +1,6 @@
+pyteller.Pyteller.evaluate


we should not include any file from api/*, to avoid this from happening add the following to .gitignore

# Sphinx documentation docs/_build/ docs/**/api

sarahmish · 2021-10-05T15:54:01Z

docs/getting_started/quickstart.rst

+1. Load the data
+----------------
+
+Here is an example of loading the **Alabama Weather** demo data which has multiple entities in long form:


in data.rst we do not introduce the difference between flatform and longform. I would just omit long form or add a section in data.rst to explain the supported data formats.

sarahmish · 2021-10-05T15:57:36Z

docs/getting_started/quickstart.rst

+
+	from pyteller.core import Pyteller
+
+    pipeline = 'pyteller.LSTM.LSTM'


I'm wondering if we can simplify this to pipeline = 'LSTM' and generally to all other places as well

sarahmish · 2021-10-05T20:57:42Z

pyteller/primitives/jsons/pyteller.primitives.postprocessing.reformat_data2.json

@@ -0,0 +1,46 @@
+{


this file is a duplicate I think?

sarahmish · 2021-10-05T20:58:25Z

pyteller/primitives/jsons/pyteller.primitives.preprocessing.format_data2.json

@@ -0,0 +1,61 @@
+{


are we supposed to have 2 files of each?

sarahmish · 2021-10-05T20:59:30Z

pyteller/primitives/postprocessing.py

+    # convert index to datetime
+    # if index.dtype == 'float' or index.dtype == 'int':
+    #     index = pd.to_datetime(index.values * 1e9)
+    # else:
+    #     index = pd.to_datetime(index)
+    #
+    # if actuals[time_column].dtypes == 'float' or actuals[time_column].dtypes == 'int':
+    #     actuals[time_column] = pd.to_datetime(actuals[time_column] * 1e9)
+    # else:
+    #     actuals[time_column] = pd.to_datetime(actuals[time_column])


sarahmish · 2021-10-05T21:44:38Z

pyteller/utils.py



-def plot(dfs, output_path, labels=None):
+def plot_forecast(dfs, output_path=None, labels=['actuals', 'predicted'], frequency=None):


we typically try to avoid mutable objects as the default arguments of a function, I recommend setting it to None and then adding the default if it is None.

def plot_forecast(dfs, output_path=None, labels=None, frequency=None): labels = labels or ['actuals', 'predicted']

sarahmish · 2021-10-05T21:50:22Z

setup.py

-    'Keras>=2.1.6,<2.4',
+    'mlblocks>=0.4.0,<0.5',
+    'mlprimitives>=0.3.0,<0.4',
+    'pandas>=1,<2',


I think 'pandas>=1,<2' is already specified in mlprimitives so we can skip it here. Similarly to 'scikit-learn>=0.21'.

skyeeiskowitz added 11 commits January 15, 2021 15:46

move ingest egest inside pyteller class

d260e21

clean metrics

e07e208

delete read csv

2b121ff

get rid of test size

775edc6

set up logger

1d815a8

clean API spacings

5498d11

util and test ingest data optional

44c8012

comment

f6229f7

fix lint

2da2d3a

lint

5936976

formatting primitives and fit and predict outputs

b274cd5

skyeeiskowitz requested a review from pvk-developer January 26, 2021 21:34

skyeeiskowitz added 18 commits January 26, 2021 16:45

lint

2b81b37

lint

7290f9d

upgrade mlblocks and mlprimitives

abe8655

fix ARIMA pipeline to unscale

7a53c9c

added Plamens files

97cc081

get rid of training data in metrics

7b5376c

remove btb example

c8f35fe

add examples to lint

c50c398

user stories fix api

e2655cd

cleaning up output of fit

7ef4d41

fixes to core, output spec cleaner

a69e3b9

fix lint

8b9dc88

fix lint

04d13ad

setup error fix

6eb03f0

fix loading pipelines

07e31bc

add tutorial

c9e03d0

move plot function to forecast

7a2c8b5

lint

010ebd4

skyeeiskowitz added 14 commits June 3, 2021 11:05

fix docs

06151c2

fix docs

3fe88e0

fix docs

8ed4d46

make lint

e679803

make lint

bfa06d5

make lint

3cb5fc8

make lint

3246695

make lint

7392b06

make lint

f377e62

make lint

7094ffd

make lint

29cc9d1

python notebook find pyteller pipelines

3b90246

fix readme

2812f2d

cleaning up

bde9d66

skyeeiskowitz changed the title ~~Fixes to API, primitives to format data, optional 'visualization' outputs, and outputs from fit method~~ Fixes to API, added primitives to format data, benchmarking function, tuning a pipeline Aug 3, 2021

dyuliu assigned sarahmish and unassigned sarahmish Sep 23, 2021

dyuliu requested review from dyuliu and removed request for pvk-developer September 23, 2021 13:27

sarahmish requested review from dyuliu and sarahmish and removed request for dyuliu and sarahmish September 23, 2021 13:27

dyuliu requested review from sarahmish and removed request for dyuliu September 23, 2021 14:09

set keras fix

d4a94ff

dyuliu reviewed Sep 30, 2021

View reviewed changes

frances-h reviewed Oct 4, 2021

View reviewed changes

sarahmish reviewed Oct 5, 2021

View reviewed changes

small fixes

31a7912

		@@ -0,0 +1,8 @@
		The current pyteller benchmark was run on 11 signals for 3 pipelines

		@@ -0,0 +1,34 @@
		dataset,pipeline,signal,prediction length,iteration,MAE,MSE,RMSE,MASE,sMAPE,MAPE,under,over,strategy,lstm_1_units,dropout_1_rate,lstm_2_units,dropout_2_rate,window_size,p,d,q,status,elapsed,run_id


		from pyteller.core import Pyteller

		pipeline = 'pyteller.LSTM.LSTM'



		def plot(dfs, output_path, labels=None):
		def plot_forecast(dfs, output_path=None, labels=['actuals', 'predicted'], frequency=None):

Conversation

skyeeiskowitz commented Jan 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Data Input

Pipeline Outputs

Benchmarking

Style

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sarahmish left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

skyeeiskowitz commented Jan 26, 2021 •

edited

Loading