Framework for exporting data #725

mohamedelabbas1996 · 2025-02-17T07:38:16Z

Summary

This PR implements a framework for defining different export formats and exporting data in the background. It also defines the API for creating & retrieving the history of a project's exports. Users can trigger exports via the API or the Admin Page with optional filtering parameters and track job progress asynchronously.

Related Issues

Closes #720

Detailed Description

This PR enables users to export filtered Occurrence data in CSV or JSON format via the API. The export process runs asynchronously in the background.

Triggering an Export

Users can initiate an export by making a POST request to:

POST http://localhost:8000/api/v2/exports/`

Request Body

The request body should specify the export format and optional filters. Available formats:

project → Occurrences Project.
occurrences_simple_csv → Exports simplified tabular data in CSV format.
occurrences_simple_json → Exports structured JSON matching the API response.
filters → Currently supports filtering by collection (optional).

Request Example (CSV Export with Collection Filter)

POST http://localhost:8000/api/v2/exports/
Content-Type: application/json


        {   
            "project":105,

            "format": "occurrences_simple_csv",
            "filters":{
                "collection":104
            }
        
             
         }

Request Example (JSON Export)

POST http://localhost:8000/api/v2/exports/
Content-Type: application/json

        {   
            "project":105,

            "format": "occurrences_simple_json",
            "filters":{
                "collection":104
            }
        
             
         }

Checking Export Status

Once an export job is triggered, users can check the job status using:

GET http://localhost:8000/api/v2/exports/<export_id>

Fetching All Exports

To retrieve all exports:

GET http://localhost:8000/api/v2/exports/

Filtering Exports by Project ID

GET http://localhost:8000/api/v2/exports/?project_id=<project_id>

Screenshots

How to Test the Changes

Trigger an export by making a POST request to /api/v2/exports/ with valid project_id and request body.
Track export progress using the GET request to /api/v2/exports/<export_id> with the returned export_id.
Once completed, download the exported file from the provided file_url.
Test with different values for format and the collection filter to ensure correct filtering and export behavior.
Verify listing exports using /api/v2/exports/ and filtering with project_id.

netlify · 2025-02-17T07:38:30Z

✅ Deploy Preview for antenna-preview ready!

Name	Link
🔨 Latest commit	`6a50eed`
🔍 Latest deploy log	https://app.netlify.com/sites/antenna-preview/deploys/67f590c16508960008348644
😎 Deploy Preview	https://deploy-preview-725--antenna-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.
Lighthouse	1 paths audited Performance: 71 (🔴 down 9 from production) Accessibility: 89 (no change from production) Best Practices: 92 (no change from production) SEO: 100 (no change from production) PWA: 80 (no change from production) View the detailed breakdown and full score reports

To edit notification comments on pull requests, go to your Netlify site configuration.

annavik · 2025-02-17T11:35:25Z

Woho! Sounds very promising, looking forward testing this out later 🤩 Some early thoughts on the user flow and related backend extensions:

I think it would make sense if the backend could list all export tasks for a specific project. I'm thinking maybe users will leave the app to come back later, then it would be helpful if there was a way to know about both ongoing export tasks. This would also make it possible to share export results in a team.
If we should make it possible to list tasks, then we should probably also make it possible to delete and update them. Similar to jobs? I suggest we handle this in a follow up PR though and keep the first version readonly.
Do you think we could get some progress info in the tasks response, not just the status? It would be nice to present some progress bar or similar for users... It doesn't have to be exact, but some approximate progress between 0-1?
From my experience, users are often interested in joining results from multiple or all pages, is this export feature providing that option? Sorry, I couldn't tell from quickly checking the code changes.
I see the code changes includes functionality for sending emails when a task is done? For me, that is more a "nice to have" feature. I think our focus should be to first integrate this on the web platform!

mohamedelabbas1996 · 2025-02-18T00:03:53Z

Do you think we could get some progress info in the tasks response, not just the status? It would be nice to present some progress bar or similar for users... It doesn't have to be exact, but some approximate progress between 0-1?

I think it's possible but not sure how to implement it may be write the progress value in-memory on redis? @mihow what do you think ?

mohamedelabbas1996 · 2025-02-18T00:07:18Z

I see the code changes includes functionality for sending emails when a task is done? For me, that is more a "nice to have" feature. I think our focus should be to first integrate this on the web platform!

For the email feature, I think we can add it as it is just a small change and we already have email service integration (not working locally for me though)

mohamedelabbas1996 · 2025-02-18T00:09:08Z

think it would make sense if the backend could list all export tasks for a specific project. I'm thinking maybe users will leave the app to come back later, then it would be helpful if there was a way to know about both ongoing export tasks. This would also make it possible to share export results in a team.

If we should make it possible to list tasks, then we should probably also make it possible to delete and update them. Similar to jobs? I suggest we handle this in a follow up PR though and keep the first version readonly.

Yes, i think it would be easy to add export tasks CRUD since we already have the ExportHistory model in place

mohamedelabbas1996 · 2025-02-18T00:37:57Z

From my experience, users are often interested in joining results from multiple or all pages, is this export feature providing that option? Sorry, I couldn't tell from quickly checking the code changes.

Currently pagination is disabled for the export action, but we can bring it back to allow users to get data for selected limit, offset.

mihow · 2025-02-21T18:54:56Z

@mohamedelabbas1996 the implementation for non-ML job types isn't super clear right now, but it is flexible! I am curious if you can create a new job type for exports and so we can use that to view status, logs & errors. The download link can even be shown in the logs for the time being. See the Data Sync Job type and Collection Populate job types as examples.

mihow · 2025-02-21T19:09:17Z

The main formats for now should be

a JSON export with nested data that mostly matches the current API list views
simplified & flattened CSV of each data type (see [Draft] Framework for exporting data (with initial data formats) #634)
a Darwin Core Archive zip of flattened TSVs with data about Occurrences, Sessions and Taxa

This PR can offer the simplest option. It should focus on a scalable background task & API endpoints.

mihow · 2025-02-21T19:11:15Z

ami/main/api/views.py

    def list(self, request, *args, **kwargs):
        return super().list(request, *args, **kwargs)

+    def paginate_queryset(self, queryset):


It may be more scalable to keep pagination, but automatically loop through all the pages. Rather than triggering a single huge database query. Or another way to break it apart? I can give you a large DB snapshot to test on.

I would appreciate it very much if I 've access to the DB snapshot.

mihow · 2025-03-03T05:40:54Z

@mohamedelabbas1996 Will you make the first export format be JSON from the API list view for occurrences? That's what people are currently exporting now (@annavik @rhine3). They are using external python scripts to loop over the paginated API views.

mohamedelabbas1996 · 2025-03-05T18:45:28Z

Occurrence Export Timing Results

The following table presents the export times for occurrences per project.
I tested these using a live database snapshot on my local machine.

project_id	Occurrence count	Export time
47	7	0 s
39	14	0 s
38	44	1 s
1	146	2 s
67	236	3 s
49	628	5 s
105	914	9 s
16	2249	17 s
45	2708	24 s
23	3109	1 m
4	3188	1 m
44	4464	45 s
46	36881	5 m
24	45070	9 m
18	51640	7 m
84	70243	20 m
20	108133	21 m
85	179468	1h 32 m

- Moved export logic to run_export() for better encapsulation. - Added file_size and record_count fields to DataExport for tracking export statistics. - Added unit tests to ensure the number of exported records matches the number of occurrences in the collection for both CSV and JSON formats.

…nickLab/antenna into feat/export-occurrences-data

mihow · 2025-03-28T20:20:46Z

ami/exports/models.py

+        from ami.exports.registry import ExportRegistry
+
+        export_format = self.format
+        export_class = ExportRegistry.get_exporter(export_format)


I like this convention when getting a class dynamically

ExportClass = ExportRegistry.get_exporter(export_format) exporter = ExportClass()

mihow · 2025-03-28T20:34:32Z

ami/exports/models.py

+    file_size = models.PositiveBigIntegerField(default=0)
+
+    @cached_property
+    def filters_display(self):


It is preferable that we save these display filters in a field on the model it self, so we only calculate them once and avoid all N+1 queries when fetching the list of DataExports.

Change filters_display to a field, and rename the function to generate this to get_filters_display(). Then call that in the save() method.

See:
https://github.com/RolnickLab/antenna/blob/d0e0f382d90e6e270002a96c3831593b44c1d67c/ami/main/models.py#L1528C1-L1543C53

Hi @mohamedelabbas1996, I talked with Anna about this today. She tried displaying the extra detail in the UI and it ended up looking cluttered. We agreed that it's okay to revisit this later! Let's keep the filters schema with nested attributes filters_display: {"collection": {"id": 1}". But I think it can be generated on the fly as you have it. Just remove the extra queries and only use the ID from the related objects.

mihow · 2025-03-28T20:52:46Z

ami/exports/serializers.py

+    def get_file_url(self, obj):
+        return obj.get_absolute_url(request=self.context.get("request"))
+
+    def get_file_size(self, obj):


This is nice! for a pattern here, I usually return the raw value as file_size and then the display value as file_size_display. Then you can still sort by the raw value.

There's also a Django util you should compare to see if it's good enough. We use it in a couple places
from django.template.defaultfilters import filesizeformat

mihow · 2025-03-28T21:17:08Z

ami/exports/base.py

+        )
+        if self.job:
+            self.job.progress.add_stage_param(
+                self.job.job_type_key, "Number of records exported", f"{self.queryset.count()}"


Rename this to "Total records to export"
Add another stage parameter: "Number of records exported" that updates with the progress bar.

- Added 'Number of records exported' as a stage param to track the number of records during export. - Introduced filters_display field in the DataExport model to precompute and optimize display-friendly filters, reducing unnecessary queries. - Returned the raw file_size value in the API response to enable sorting, and used Django's filesizeformat to provide a more readable file size format.

annavik · 2025-04-01T15:20:52Z

Small detail, for stage values that represent numbers it's better if we return them as numbers, not strings. If we do so, the web app will automatically format them based on browser locale. So here, the value would be displayed as "70,243" for me instead of "70243".

mihow

I'm trying to update our existing filters in the preview env so that I can re-save them and generate filters_display. However the admin is looking for an old property.

AttributeError at /admin/exports/dataexport/
'DataExport' object has no attribute 'status'

Since we are making the DataExport model for the first in this PR, I don't mind migrating backwards then forwards, but in other cases you can add a migration step to call save() on all existing DataExport instances.

ami/exports/views.py

mihow · 2025-04-03T00:16:22Z

@mohamedelabbas1996 I recreated all the migrations, we went from 9 migrations to 2!

…nickLab/antenna into feat/export-occurrences-data

…-occurrences-data

mihow · 2025-04-08T21:22:51Z

ami/exports/serializers.py

            "updated_at",
        ]

+    def validate_format(self, value):


very clean! nice work

mihow

I'm closing this epic journey! Well done @mohamedelabbas1996

mohamedelabbas1996 added 3 commits February 17, 2025 02:22

feat: added celery export occurrence task

45c1d77

feat: added export & export_status endpoints

f6871ea

added migration files

b3e448d

mohamedelabbas1996 linked an issue Feb 17, 2025 that may be closed by this pull request

Exporting: Support for trigger background exports #720

Closed

mohamedelabbas1996 requested a review from mihow February 17, 2025 08:13

mohamedelabbas1996 self-assigned this Feb 17, 2025

fixed migration conflict

bb745f6

mohamedelabbas1996 marked this pull request as draft February 17, 2025 08:23

mohamedelabbas1996 changed the title ~~Support for Occurrence Data Exports~~ [Draft] Support for Occurrence Data Exports Feb 17, 2025

fix: disabled pagination for export action

518b8df

mohamedelabbas1996 added 2 commits February 17, 2025 19:38

Merge branch 'main' into feat/export-occurrences-data

b3b4369

fix: merged migrations

8d98759

mohamedelabbas1996 linked an issue Feb 18, 2025 that may be closed by this pull request

Exporting: Occurrences to Darwin Core Archive format (w/Camtrap DP) #298

Open

mihow reviewed Feb 21, 2025

View reviewed changes

mohamedelabbas1996 added 2 commits February 23, 2025 17:49

Merge branch 'main' into feat/export-occurrences-data

21470b9

feat: added DataExport Job Type

a8673af

mohamedelabbas1996 removed a link to an issue Feb 25, 2025

Exporting: Occurrences to Darwin Core Archive format (w/Camtrap DP) #298

Open

mohamedelabbas1996 added 3 commits March 4, 2025 09:34

Implemented JSON export for occurrence data

523d177

Merge branch 'main' into feat/export-occurrences-data

ac7cfbc

feat: Added support for csv file format

04ab2cf

mohamedelabbas1996 added 3 commits March 28, 2025 11:02

Merge branch 'main' into feat/export-occurrences-data

2478789

Merge branch 'feat/export-occurrences-data' of https://github.com/Rol…

4bae6c7

…nickLab/antenna into feat/export-occurrences-data

mihow reviewed Mar 28, 2025

View reviewed changes

mihow requested changes Apr 1, 2025

View reviewed changes

mihow added 4 commits April 1, 2025 15:24

fix: make summary count consistent with exports

eded961

feat: update and return total record count before starting export

02dd4b7

feat: update total record count before exporting first batch

058f93e

feat: lower batch size for exports to increase update frequency

b20a851

mihow reviewed Apr 2, 2025

View reviewed changes

ami/exports/views.py Outdated Show resolved Hide resolved

mihow reviewed Apr 2, 2025

View reviewed changes

ami/exports/views.py Outdated Show resolved Hide resolved

mihow added 2 commits April 2, 2025 17:12

chore: reset all migrations to main

a518a74

chore: recreate migrations

0b06579

mihow mentioned this pull request Apr 3, 2025

[Draft] Framework for exporting data (with initial data formats) #634

Closed

mihow changed the title ~~Support for Occurrence Data Exports~~ Framework for exporting data Apr 3, 2025

mohamedelabbas1996 and others added 5 commits April 4, 2025 09:43

chore: moved export format validation logic to the serializer

ee34d2c

chore: changed collection filter param name to collection_id

0900bb0

Merge branch 'feat/export-occurrences-data' of https://github.com/Rol…

a1eb605

…nickLab/antenna into feat/export-occurrences-data

Merge branch 'main' of github.com:RolnickLab/antenna into feat/export…

faeb081

…-occurrences-data

chore: fix type hints

6a50eed

mihow reviewed Apr 8, 2025

View reviewed changes

ami/exports/serializers.py

"updated_at",

]

def validate_format(self, value):

Copy link

Collaborator

mihow Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very clean! nice work

mihow approved these changes Apr 8, 2025

View reviewed changes

mihow merged commit d664184 into main Apr 8, 2025
6 checks passed

mihow deleted the feat/export-occurrences-data branch April 8, 2025 21:24

Framework for exporting data #725

Framework for exporting data #725

Uh oh!

Conversation

mohamedelabbas1996 commented Feb 17, 2025 • edited by mihow Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issues

Detailed Description

Triggering an Export

Request Body

Request Example (CSV Export with Collection Filter)

Request Example (JSON Export)

Checking Export Status

Fetching All Exports

Filtering Exports by Project ID

Screenshots

How to Test the Changes

Uh oh!

netlify bot commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview ready!

Uh oh!

annavik commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mohamedelabbas1996 commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mohamedelabbas1996 commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mohamedelabbas1996 commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mohamedelabbas1996 commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mihow commented Feb 21, 2025 • edited by mohamedelabbas1996 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mihow commented Feb 21, 2025

Uh oh!

mihow Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

mohamedelabbas1996 Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mihow commented Mar 3, 2025

Uh oh!

mohamedelabbas1996 commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Occurrence Export Timing Results

Uh oh!

mihow Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

mihow Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

mihow Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

mihow Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

mihow Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

annavik commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mihow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mihow commented Apr 3, 2025

mohamedelabbas1996 commented Feb 17, 2025 •

edited by mihow

Loading

netlify bot commented Feb 17, 2025 •

edited

Loading

annavik commented Feb 17, 2025 •

edited

Loading

mohamedelabbas1996 commented Feb 18, 2025 •

edited

Loading

mohamedelabbas1996 commented Feb 18, 2025 •

edited

Loading

mohamedelabbas1996 commented Feb 18, 2025 •

edited

Loading

mohamedelabbas1996 commented Feb 18, 2025 •

edited

Loading

mihow commented Feb 21, 2025 •

edited by mohamedelabbas1996

Loading

mohamedelabbas1996 Feb 21, 2025 •

edited

Loading

mohamedelabbas1996 commented Mar 5, 2025 •

edited

Loading

annavik commented Apr 1, 2025 •

edited

Loading