research-software-development/workflow.qmd at main · ARCTraining/research-software-development · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
---
title: "Research code workflow"
---

This page is designed to be a quick reference for the stages in building a research coding project. Please refer back to the main text for detailed guidance for any of these steps or stages.

::: {.panel-tabset group="workflow"}
## Single directory project

If you select this tab, the instructions below will show information relevant to organising your project in a single project folder: this folder will contain all of your code, application, analysis, figures, and notes.

## Dual directory project

If you select this tab, the instructions below will show information relevant to organising your project in two project folders: one folder for your core code as an installable package, the other containing the application of this code, analysis, notes, figures etc.

:::

## Part One: building the core code

### 1. Brainstorm and gather requirements

- Functional requirements: what must the software do?
- Non-functional requirements: how should it work/behave?
- Constraints: what are the limitations or assumptions?

### 2. Create your project directory structure in a repository

<label>Project name: &nbsp; <input id="name" value="example_name" type="text" placeholder="example_name" pattern="[a-z0-9_]*" style="font-family:monospace;"></label>

:::::{.callout-note collapse="true"}
## View commands to create structure

:::: {.panel-tabset group="workflow"}
## Single directory project


<pre id="output_a" style="background-color: #f8f9fa; padding: 10px; border-radius: 5px; border-left: 4px solid #007acc;"></pre>


## Dual directory project


<pre id="output" style="background-color: #f8f9fa; padding: 10px; border-radius: 5px; border-left: 4px solid #007acc;"></pre>


<script>
const name = document.getElementById('name');

function updateOutput() {
    sanitized = (name.value || "").replace(/[^a-z0-9_]/g, "")
    output.textContent =
`mkdir -p tests src/${sanitized}
touch {pyproject.toml,environment.yml,README.md,CITATION.cff,src/${sanitized}/__init__.py,src/${sanitized}/example.py}
echo -e 'import sys\nsys.path.append("src")' > tests/__init__.py
echo -e 'name: ${sanitized}-env

channels:
  - conda-forge
  - nodefaults

dependencies:
  - python=3.13
  - pytest
  - blackd
  - isort

  # remove/modify these as needed
  - numpy
  - pandas

  # keep this to install your Python package locally
  - pip
  - pip:
    - --editable .' > environment.yml
echo -e '[build-system]
requires = ["setuptools", "wheel"]
build-backend = "setuptools.build_meta"

[project]
name = "${sanitized}"
version = "0.1.0"
description = "Brief description"
authors = [{name = "Your Name"}]
requires-python = ">=3.13"

[tool.setuptools.packages.find]
where = ["src"]' > pyproject.toml`

   output_a.textContent =
`mkdir -p tests src/${sanitized} data/raw data/results notebooks reports
touch {pyproject.toml,environment.yml,README.md,CITATION.cff,src/${sanitized}/__init__.py,src/${sanitized}/example.py}
echo -e 'import sys\nsys.path.append("src")' > tests/__init__.py
echo -e 'name: ${sanitized}-env

channels:
  - conda-forge
  - nodefaults

dependencies:
  - python=3.13
  - pytest
  - blackd
  - isort

  # remove/modify these as needed
  - numpy
  - pandas

  # keep this to install your Python package locally
  - pip
  - pip:
    - --editable .' > environment.yml
echo -e '[build-system]
requires = ["setuptools", "wheel"]
build-backend = "setuptools.build_meta"

[project]
name = "${sanitized}"
version = "0.1.0"
description = "Brief description"
authors = [{name = "Your Name"}]
requires-python = ">=3.13"

[tool.setuptools.packages.find]
where = ["src"]' > pyproject.toml`
    }

name.addEventListener('input', updateOutput);
updateOutput();

</script>


::::
:::::

### 3. Create a development environment

```bash
conda env create -f environment.yml
```

```bash
conda env update --file environment.yaml --prune
```

### 4. Write your pseudocode, comments, and code

- Write code for yourself in a year: name variables and functions sensibly, and use comments to add context

### 5. Write a test suite

Create your tests in a folder called `test`. Call the file `test_<YOUR-PYTHON-MODULE-NAME>.py`, and define the functions inside it as `def test_<YOUR-FUNCTION-NAME>():`.

Use this format:

```python
def test_example():
    '''Test for the example function'''

    # Arrange
    test_variable_1 = 0
    test_variable_2 = 1
    expected_output = 7

    # Act
    output = your_function(test_variable_1, test_variable_2)

    # Assert
    assert output == expected_output

    # No cleanup needed
```

All functions in your core code package should be tested.

### 6. Write documentation

::: {.panel-tabset group="workflow"}
## Single directory project

You'll want to add:

- Module and function level docstrings
- A README.md file
- A `pyproject.toml` file to document your core Python package
- Possibly some example notebooks running the code

## Dual directory project

In your core code/package repository:

- Module and function level docstrings
- A README.md file
- A `pyproject.toml` file to document your core Python package

In your analysis/application file:

- Module and function level docstrings of code
- Example notebooks
- Instructions in your README.md on how to use the code in the package repository.

:::

## Part two: using the core code

### 1. Setting up your folder structure

::: {.panel-tabset group="workflow"}
## Single directory project

You've already sorted this back in [Step 2](#setting-up-your-folder-structure)

## Dual directory project

This is more flexible, but it's generally still a good idea to keep things organised. This folder might look like this:

```text
pallasite-parent-body-evolution/    The project git repository
├── LICENSE
├── README.md
├── environment.yml                 The libraries I need for analysis (including the package from our package repository)
├── data                            I usually load in large data from storage elsewhere
│   ├── interim                     But sometimes do keep small summary datafiles in the repository
│   ├── processed
│   └── raw
├── docs                            Notes on analysis, process etc.
├── notebooks                       Jupyter notebooks used for analysis
├── reports                         For a manuscript source, e.g., LaTeX, Markdown, etc., or any project reports
│   └── figures                     Figures for the manuscript or reports
├── src                             Source code for this project
│   ├── data                        Scripts and programs to process data
│   ├── tools                       Any helper scripts go here
│   └── visualization               Scripts for visualisation of your results, e.g., matplotlib, ggplot2 related.
└── tests                           Test code for this project, benchmarking, comparison to analytical models
```

:::

### 2. Creating a research environment

::: {.panel-tabset group="workflow"}
## Single directory project

We have already done this back in [Step 3](#create-a-development-environment).

## Dual directory project

You have a few options:

### Just keep using your development environment from Step 3

As you have installed your package, you can continue to use your development environment from [Step 3](#create-a-development-environment) across your system, in other folders.

### Create a new environment and hardlink local package

Take the development environment you created in [Step 3](#create-a-development-environment), and copy it into the analysis folder, and replace the following lines:

```yaml
  # keep this to install your Python package locally
  - pip
  - pip:
    - --editable .
```

with:

```yaml
  # keep this to install your Python package locally
  - pip
  - pip:
    - --editable /absolute/path/to/package-repo
```

This is still not easily reusable, but at least it's now more obvious what environment you're using, and you can still easily make changes to the package as you work.

### Install your package via GitHub

Take the development environment you created in [Step 3](#create-a-development-environment), and copy it into the analysis folder, and replace the following lines:

```yaml
  # keep this to install your Python package locally
  - pip
  - pip:
    - --editable .
```

with:

```yaml
  # keep this to install your Python package locally
  - pip
  - pip:
    - <PACKAGE_NAME>@git+https://github.com/<USER_NAME>/<REPO_NAME>
```

Replacing the variables `<In angled brackets>` with the appropriate values.

This can be a bit tricky when you're in the early stages of the project and need to frequently update your core code package (as you'll need to push the package changes to GitHub and then update your environment file every time); I often leave this until my package code is fairly stable. You'll definitely want to do this before releasing your code.

:::

### 3. Do your research!

#### Reloading libraries

When working with an editable install, you may need to force reload the module after changing it to ensure the changes carry through.

Let's say I have updated my package `amazing_project`, and it's installed in my current Conda environment as an editable install. I don't need to update my environment, but if you're using a notebook, you will have to force reload the module. This is very easy using `importlib` (this is included in the core Python library, so doesn't need to be added to your environment):

```python
import amazing_project as ap
import importlib
importlib.reload(ap)
```

#### Using notebooks

Jupyter notebooks can be tricky when it comes to version control: they are filled html formatting that updates when you rerun cells, and this can obscure actual code changes. There are a few different options if you are keep to use notebooks:

- [Jupytext](https://jupytext.readthedocs.io/en/latest/index.html): creating Jupyter notebooks in plain `.py` files
- [Marimo notebooks](https://marimo.io/): an alternative notebook framework, again in plain `.py`.

### 4. Export a record of your environment

::: {.panel-tabset group="workflow"}
## Single directory project

When you've created a "batch" of results that you're happy with, you should record your exact dependencies at that point:

```bash
conda env export > env-record.yml # from inside the activated env
```

## Dual directory project

When you've created a "batch" of results that you're happy with, you should record your exact dependencies in the environment used to produce those results at that point (so your [research environment from Step 2](#creating-a-research-environment)), and save this in your analysis folder:

```bash
conda env export > env-record.yml # from inside the activated env
```

It's often a good idea to create a release of your package (see [Step 5 below](#create-a-release-synced-with-zenodo)) and install it properly using the version number and the GitHub url in your research environment, and *then* do your final analysis and export your research environment.

:::

### 5. Create a release, synced with zenodo

Add your DOI to your citation file!