⚠️ This code is released under open government license and is provided without warranty. It is currently under active development and has not been subjected to peer or user testing. Outputs should be interpreted with caution
The imaging flow Cytometer (IFC) aboard Cefas Endeavour operates continuously, capturing high-resolution data for any small particles in the water. These may be harmful algae, microbes, sewage related flocs, natural or synthetic fibres etc. The instrument can also inject staining dyes to bind RNA, DNA, and other cellular components, enabling detailed insights into cellular processes. Unlike other imaging instruments such as the Plankton Imager, which targets larger trophic levels relevant to fisheries, IFC focuses on smaller organisms, offering unique opportunities for water quality and ecosystem monitoring.
The societal need:
- It offers an alternative to scientists manually filtering water samples and counting objects under the microscope.
- It scales up in a way that manual processing can not. We could in theory get water samples diagnosed every nautical mile.
- In the same sense, it can scale up to "look across" different objects of interest. Rather than just living phytoplankton, we could also be looking at quantities of bacterial colonies known as "flocs" that are associated with river discharge and sewage.
The challenges:
- Data are not typically combined, across instruments. We would like to interpret these datasets as complete time series. The challenge lies in calibration and standardisation.
- Labelling remains manually-focused: experts “read” instrument signals to classify particles. This leads to discontinuities whenever instruments, analysts, or surveys change. For example, a label might state something equivalent to: “Particle 9999 is species A with red laser voltage 5 and blue voltage 6.” Such machine-specific labels cannot transfer across surveys. A better approach would be to use calibration beads as reference points: “Particle 99 is species A, generating twice the red response of a 10 μm bead and one-tenth that of a 100 μm bead.” This relative labelling enables machine learning classifiers to build transferable models, creating robust time series across instruments and institutions.
The data:
- The raw data generated by the flow cytometer is stored in .CYZ files, a proprietary format used by the CytoClus software. These files contain complex, many-dimensional data that must be decoded, structured, and interpreted before it can be used for scientific analysis or machine learning.
This repository exists for the development of Python tools to process, visualize, and classify flow cytometry data. It supports workflows such as:
- Converting .CYZ files to JSON using cyz2json
- Training machine learning models (e.g., random forests) to classify particles
- Visualizing and labeling data interactively
- Processing large datasets stored in Azure Blob containers
- Monitoring local directories for new data and applying trained models automatically
This python code is being developed with marine research in mind, where understanding the composition of microscopic life in water samples can inform studies on biodiversity and environmental change. We also expect the analysis of flow cytometry data will feed into indicators of ecosystem health.
There are a few python tools here in various states, which broadly resembles a minimal localised MLops tool for interfacing with a blob store. This repository is an attempt to put various tools in one place, anticipating that a random forest model is our most likely candidate for automated classification of the flow cytometer data being generated on the RV cefas endeavour.
It is developed around a handful of labelled flow cytometer files held in https://citprodflowcytosa.blob.core.windows.net/public/exampledata/ but you could export your own data from cytoclus software and put them in flowcytometertools/exampledata/. To do this in the manufacturer's cytoclus software, first select your file, with sets defined, under "Database" click Exports and check the box for CYZ file (for set)
This was developed on windows but a github actions workflow tests whether the Download & Train tab did (at one point) verify that it would work on a linux machine.
Ideally any users should be familiar with python because in all likelihood you will be doing different specific experiments, will have a different model of cytosense, something will break, etc.
The "Local Watcher" tab unpacks .cyz files as they are generated, and attempts to categorise them and send a report packet to an endpoint. QC of the data is needed if we are to automate the instrument, and checks so far include (values may differ):
Sufficient volume analysed >3ml Events >5000 Laser balancing (symmetrical left:right 0.75-1.25) Laser temperature <45 degrees Sheath fluid temperature <35 degrees Photomultiplier tubes temperature <55 degrees System temperature <55 degrees Red/Yellow/Orange response curve saturation check Instrument serial ID matches that of the trained model Photomultiplier tube gain settings match that of the trained model The concept of protocols.
Different protocols exist depending on what the user may be looking for. An event is detected when the particle causes a "sufficient" response as it passes through a voltage gate. Set too high, the small particles get missed, set too low and lots of noise gets detected. To detect the small species means operating in or close to the noise range, and to compensate for this the flow rate has to be dropped else many coincidence events occur. In doing so, the volume analysed is very small, not providing a good estimate of the larger species. For that reason, at least two protocols should be defined to capture most phytoplankton - a nano and pico protocol, where respectively, the priority is total volume processed or nearing the limits of detection. If a sample was collected under nano protocol it is not directly comparable to a pico protocol sample. Therefore the reported packet name gets mutated to also contain protocol names. Note a protocol may not and does not need to be limited to just pump speed and trigger level. Other dynamically changeable sample properties candidates include smart trigger settings (essentially regions of interest), PMT settings, injection of calibration beads etc.
** Please use only for development at this moment **
If present, check releases to download a "distributable". Whilst tested on linux, I have only made pyinstaller builds of the software for windows and I am not intending to make them public at this moment in case it triggers crown copyright obligations.
This was developed in miniforge prompt with conda on windows but other distributions could be used. Use conda to install environment.yml, failing that, install environment_generic.yml:
conda env create -f environment.yml
activate flowcytometertool
Navigate to the src directory and call the program script from there:
cd src
python flow_cytometer_tool.py
⚠️ Neither of these have been tested in a long time and the tool is almost guaranteed not to build - rather, please use git clone of this repository and build an environment with environment_generic.yml
This was developed minimally on Windows 10 in miniforge3 prompt. You may be able to compile the program with the command "pyinstaller flow_cytometer_tool.spec", however this is not guaranteed. The tool is in active development and I hope to eventually move towards versioned releases. The current solution, rather than versioning, is to grab the repository sha and save this alongside trained models with many other model training details.
For reproducibility and to test whether this can to run on another machine we use github actions on an ubuntu runner, in git bash terminal you can trigger a test build on github with a new VERSION, e.g: VERSION="0.0.0.1"; git tag -a v$VERSION -m "Release version $VERSION"; git push origin v$VERSION
You can get a public random forest .pkl (~2GB) and associated modetrainsettings.json from: https://citprodflowcytosa.blob.core.windows.net/public/exampleselectedvalidappliedmodel/final_model_20260317_132404.pkl https://citprodflowcytosa.blob.core.windows.net/public/exampleselectedvalidappliedmodel/modeltrainsettings.json
And an example flowcytometertoolconfig.yaml here: https://citprodflowcytosa.blob.core.windows.net/public/exampleselectedvalidappliedmodel/flowcytometertoolconfig.yaml
The tool expects to find them in ~\Documents\flowcytometertool\selectedvalidappliedmodel\ and ~\Documents\flowcytometertool\flowcytometertoolconfig.yaml respectively.
The model was trained on the example dataset https://citprodflowcytosa.blob.core.windows.net/public/exampledata/ESTUARINEUK using the settings in modeltrainsettings.json. This json contains the git sha of flowcytometertool when the model was run, but this sha refers to the private mirror of this repository so you will not see it in the squashed public history.
Note these are examples only and should not be taken as accurate, quality controlled labels and models and they will almost certainly not transfer across instruments (please train your own!).
⚠️ This overview is now badly out of date - issue #28
This tab wraps the model training functions developed during Lucinda Lanoy’s Masters research in a simplified tkinter GUI (replacing the original R Markdown interface). It allows users to train machine learning models, including Random Forests, using scikit-learn.
Click on the buttons in sequence from top-to-bottom, these will:
- Download the data from a blob store. By default this should be set to the public data in https://citprodflowcytosa.blob.core.windows.net/public/exampledata/ which needs no SAS authentication. If you change it to a blob store that needs authenticating, put a path to your SAS key saved as a plain .txt file in the "blob tools" tab.
- Downloads the cyz2json executable you need
- Applies the cyz2json to the downloaded data. Your CYZ files should now be json files instead (this step is invoked by subprocess and no check is implemented to ensure it has worked).
- Converts your json files to listmode csvs. "Listmode" is the part of the json file which pertains to the laser summaries. This step therefore leaves behind any images taken and the full pulse shape is not taken out of the json files either.
- Combine CSVs, specifying the Zone you want to train for. Training across multiple zones is not yet implemented. Note the expertise matrix which assigns a level of expertise (1 being non-expert, 2 being intermediate, 3 being expert). If there is a disagreement on a label, the expert will be prioritised.
- Train model. A split of your data will be taken for training, some will be retained for testing. This trains a LOT of models in sklearn, searching for the best model variables in your data and best hyperparameters.
- Test classifier against the test dataset.
- You can run the training process on both Windows and Linux (tested via GitHub Actions on a Linux runner).
- Note: Building a release hasn’t been tested recently.
Once a model is trained and tested, this tab allows users to explore and label the data interactively.
- Visualizes predictions and other data columns using scatter plots.
- Allows relabeling of data points directly in the interface.
- Supports loading additional CSV files (e.g., the mixfile) for exploration.
- Known issues:
- Crashes if you try to color by a column with too many unique values.
- Some functionality is currently broken, especially when loading external CSVs.
- You must select both X and Y axes before clicking "Update Plot" — otherwise, a
KeyErrorwill occur.
This tool creates a representative training dataset by sampling from processed predictions.
- Performs a 1-in-1000 random shuffle of particles from multiple processed files stored in the blob container.
- Saves the result as a CSV file that can be visualized in the previous tab.
- Intended to capture environmental variability by aggregating data across many samples.
- Known issues:
- Not recently tested.
- Relies on SAS token access but how to do this may not be clear to the user. The app needs some signposting, explanation of how SAS is generated, saved and how to copy paste the path in. Or we could have an encrypted approach that does not exist across sessions?
This tab automates the full pipeline for processing .CYZ files stored in an Azure Blob container.
- Downloads
.CYZfiles from the blob store. - Converts them to JSON using
cyz2json. - Extracts listmode parameters and applies the trained model (using R Random Forest).
- Generates 3D plots and uploads results back to the blob store.
- Known issues:
- Requires manual input of source directory, destination directory, and SAS token. Not user-friendly due to SAS token handling.
This utility monitors a local directory for new .CYZ files and automatically processes them.
- Applies the trained model to each new file as it appears.
- Runs the same processing steps as the blob container tab, including 3D plotting.
- Outputs results to a specified destination directory.
- Known issues:
- Not recently tested.
This is the start of a labelling process which I hope could replace the current labelling approach. It has not yet been used on a real sample and requires further development to make it a "validation and correction" session rather than blind labelling.
- Requires a calibration sample and a labelling sample. Calibration sample needs associated "true micron sizes" and assumes you are using 8-peak rainbow beads.
- Calibration sample is clustered into 8 classes with k-means. Mean fluorescent responses and standard deviations are taken for each cluster to be preserved in your labelling dictionary.
- Extracts the labelling file so the user can choose a particle on a cytogram (currently only to 2 fixed axes).
- Labelling session gets saved in a json with associated sample metadata, and another form asking for basic info about the labeller to gauge their experience.
Lucinda Lanoy for her masters work in custom_functions_for_python.py https://github.com/CefasRepRes/lucinda-flow-cytometry on model training.
Sebastien Galvagno, Eric Payne and Rob Blackwell for their parts played in cyz2json (flowcytometertool uses https://github.com/OBAMANEXT/cyz2json/releases/tag/v0.0.5)
OBAMA-NEXT Data labellers Veronique, Lumi, Zeline, Lotty and Clementine.
• Lotty = EXP1, "expert" level on Mediterranean data, considered "non expert" for the other zones, • Clementine = EXP2, "advanced" level on Mediterranean data, considered "non expert" for the other zones, • Lumi = EXP3, "expert" level on Baltic data, considered "non expert" for the other zones, • Zeline = EXP4, "expert" level on English Channel data, considered "non expert" for the other zones, • Veronique = EXP5, "expert" level on Celtic data, considered "non expert" for the other zones. • Joe = EXP6, "non expert" for all zones.