Skip to content

Conversation

@mohamedelabbas1996
Copy link
Contributor

@mohamedelabbas1996 mohamedelabbas1996 commented Feb 17, 2025

Summary

This PR implements a framework for defining different export formats and exporting data in the background. It also defines the API for creating & retrieving the history of a project's exports. Users can trigger exports via the API or the Admin Page with optional filtering parameters and track job progress asynchronously.

Related Issues

Closes #720

Detailed Description

This PR enables users to export filtered Occurrence data in CSV or JSON format via the API. The export process runs asynchronously in the background.

Triggering an Export

Users can initiate an export by making a POST request to:

POST http://localhost:8000/api/v2/exports/`

Request Body

The request body should specify the export format and optional filters. Available formats:

  • project → Occurrences Project.
  • occurrences_simple_csv → Exports simplified tabular data in CSV format.
  • occurrences_simple_json → Exports structured JSON matching the API response.
  • filters → Currently supports filtering by collection (optional).

Request Example (CSV Export with Collection Filter)

POST http://localhost:8000/api/v2/exports/
Content-Type: application/json


        {   
            "project":105,

            "format": "occurrences_simple_csv",
            "filters":{
                "collection":104
            }
        
             
         }

Request Example (JSON Export)

POST http://localhost:8000/api/v2/exports/
Content-Type: application/json

        {   
            "project":105,

            "format": "occurrences_simple_json",
            "filters":{
                "collection":104
            }
        
             
         }

Checking Export Status

Once an export job is triggered, users can check the job status using:

GET http://localhost:8000/api/v2/exports/<export_id>

Fetching All Exports

To retrieve all exports:

GET http://localhost:8000/api/v2/exports/

Filtering Exports by Project ID

GET http://localhost:8000/api/v2/exports/?project_id=<project_id>

Screenshots

image image

How to Test the Changes

  1. Trigger an export by making a POST request to /api/v2/exports/ with valid project_id and request body.
  2. Track export progress using the GET request to /api/v2/exports/<export_id> with the returned export_id.
  3. Once completed, download the exported file from the provided file_url.
  4. Test with different values for format and the collection filter to ensure correct filtering and export behavior.
  5. Verify listing exports using /api/v2/exports/ and filtering with project_id.

@mohamedelabbas1996 mohamedelabbas1996 linked an issue Feb 17, 2025 that may be closed by this pull request
@netlify
Copy link

netlify bot commented Feb 17, 2025

Deploy Preview for antenna-preview ready!

Name Link
🔨 Latest commit 6a50eed
🔍 Latest deploy log https://app.netlify.com/sites/antenna-preview/deploys/67f590c16508960008348644
😎 Deploy Preview https://deploy-preview-725--antenna-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
Lighthouse
Lighthouse
1 paths audited
Performance: 71 (🔴 down 9 from production)
Accessibility: 89 (no change from production)
Best Practices: 92 (no change from production)
SEO: 100 (no change from production)
PWA: 80 (no change from production)
View the detailed breakdown and full score reports

To edit notification comments on pull requests, go to your Netlify site configuration.

@mohamedelabbas1996 mohamedelabbas1996 marked this pull request as draft February 17, 2025 08:23
@mohamedelabbas1996 mohamedelabbas1996 changed the title Support for Occurrence Data Exports [Draft] Support for Occurrence Data Exports Feb 17, 2025
@annavik
Copy link
Member

annavik commented Feb 17, 2025

Woho! Sounds very promising, looking forward testing this out later 🤩 Some early thoughts on the user flow and related backend extensions:

  • I think it would make sense if the backend could list all export tasks for a specific project. I'm thinking maybe users will leave the app to come back later, then it would be helpful if there was a way to know about both ongoing export tasks. This would also make it possible to share export results in a team.
  • If we should make it possible to list tasks, then we should probably also make it possible to delete and update them. Similar to jobs? I suggest we handle this in a follow up PR though and keep the first version readonly.
  • Do you think we could get some progress info in the tasks response, not just the status? It would be nice to present some progress bar or similar for users... It doesn't have to be exact, but some approximate progress between 0-1?
  • From my experience, users are often interested in joining results from multiple or all pages, is this export feature providing that option? Sorry, I couldn't tell from quickly checking the code changes.
  • I see the code changes includes functionality for sending emails when a task is done? For me, that is more a "nice to have" feature. I think our focus should be to first integrate this on the web platform!

@mohamedelabbas1996
Copy link
Contributor Author

mohamedelabbas1996 commented Feb 18, 2025

  • Do you think we could get some progress info in the tasks response, not just the status? It would be nice to present some progress bar or similar for users... It doesn't have to be exact, but some approximate progress between 0-1?

I think it's possible but not sure how to implement it may be write the progress value in-memory on redis? @mihow what do you think ?

@mohamedelabbas1996
Copy link
Contributor Author

mohamedelabbas1996 commented Feb 18, 2025

  • I see the code changes includes functionality for sending emails when a task is done? For me, that is more a "nice to have" feature. I think our focus should be to first integrate this on the web platform!

For the email feature, I think we can add it as it is just a small change and we already have email service integration (not working locally for me though)

@mohamedelabbas1996
Copy link
Contributor Author

mohamedelabbas1996 commented Feb 18, 2025

  • think it would make sense if the backend could list all export tasks for a specific project. I'm thinking maybe users will leave the app to come back later, then it would be helpful if there was a way to know about both ongoing export tasks. This would also make it possible to share export results in a team.

  • If we should make it possible to list tasks, then we should probably also make it possible to delete and update them. Similar to jobs? I suggest we handle this in a follow up PR though and keep the first version readonly.

Yes, i think it would be easy to add export tasks CRUD since we already have the ExportHistory model in place

@mohamedelabbas1996
Copy link
Contributor Author

mohamedelabbas1996 commented Feb 18, 2025

  • From my experience, users are often interested in joining results from multiple or all pages, is this export feature providing that option? Sorry, I couldn't tell from quickly checking the code changes.

Currently pagination is disabled for the export action, but we can bring it back to allow users to get data for selected limit, offset.

@mihow
Copy link
Collaborator

mihow commented Feb 21, 2025

@mohamedelabbas1996 the implementation for non-ML job types isn't super clear right now, but it is flexible! I am curious if you can create a new job type for exports and so we can use that to view status, logs & errors. The download link can even be shown in the logs for the time being. See the Data Sync Job type and Collection Populate job types as examples.

@mihow
Copy link
Collaborator

mihow commented Feb 21, 2025

The main formats for now should be

  1. a JSON export with nested data that mostly matches the current API list views
  2. simplified & flattened CSV of each data type (see [Draft] Framework for exporting data (with initial data formats) #634)
  3. a Darwin Core Archive zip of flattened TSVs with data about Occurrences, Sessions and Taxa

This PR can offer the simplest option. It should focus on a scalable background task & API endpoints.

def list(self, request, *args, **kwargs):
return super().list(request, *args, **kwargs)

def paginate_queryset(self, queryset):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be more scalable to keep pagination, but automatically loop through all the pages. Rather than triggering a single huge database query. Or another way to break it apart? I can give you a large DB snapshot to test on.

Copy link
Contributor Author

@mohamedelabbas1996 mohamedelabbas1996 Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would appreciate it very much if I 've access to the DB snapshot.

@mihow
Copy link
Collaborator

mihow commented Mar 3, 2025

@mohamedelabbas1996 Will you make the first export format be JSON from the API list view for occurrences? That's what people are currently exporting now (@annavik @rhine3). They are using external python scripts to loop over the paginated API views.

@mohamedelabbas1996
Copy link
Contributor Author

mohamedelabbas1996 commented Mar 5, 2025

Occurrence Export Timing Results

The following table presents the export times for occurrences per project.
I tested these using a live database snapshot on my local machine.

project_id Occurrence count Export time
47 7 0 s
39 14 0 s
38 44 1 s
1 146 2 s
67 236 3 s
49 628 5 s
105 914 9 s
16 2249 17 s
45 2708 24 s
23 3109 1 m
4 3188 1 m
44 4464 45 s
46 36881 5 m
24 45070 9 m
18 51640 7 m
84 70243 20 m
20 108133 21 m
85 179468 1h 32 m

- Moved export logic to run_export() for better encapsulation.
- Added file_size and record_count fields to DataExport for tracking export statistics.
- Added unit tests to ensure the number of exported records matches the number of occurrences in the collection for both CSV and JSON formats.
from ami.exports.registry import ExportRegistry

export_format = self.format
export_class = ExportRegistry.get_exporter(export_format)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this convention when getting a class dynamically

ExportClass = ExportRegistry.get_exporter(export_format)
exporter = ExportClass()

file_size = models.PositiveBigIntegerField(default=0)

@cached_property
def filters_display(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is preferable that we save these display filters in a field on the model it self, so we only calculate them once and avoid all N+1 queries when fetching the list of DataExports.

Change filters_display to a field, and rename the function to generate this to get_filters_display(). Then call that in the save() method.

See:
https://github.com/RolnickLab/antenna/blob/d0e0f382d90e6e270002a96c3831593b44c1d67c/ami/main/models.py#L1528C1-L1543C53

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mohamedelabbas1996, I talked with Anna about this today. She tried displaying the extra detail in the UI and it ended up looking cluttered. We agreed that it's okay to revisit this later! Let's keep the filters schema with nested attributes filters_display: {"collection": {"id": 1}". But I think it can be generated on the fly as you have it. Just remove the extra queries and only use the ID from the related objects.

def get_file_url(self, obj):
return obj.get_absolute_url(request=self.context.get("request"))

def get_file_size(self, obj):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice! for a pattern here, I usually return the raw value as file_size and then the display value as file_size_display. Then you can still sort by the raw value.

There's also a Django util you should compare to see if it's good enough. We use it in a couple places
from django.template.defaultfilters import filesizeformat

)
if self.job:
self.job.progress.add_stage_param(
self.job.job_type_key, "Number of records exported", f"{self.queryset.count()}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename this to "Total records to export"
Add another stage parameter: "Number of records exported" that updates with the progress bar.

- Added 'Number of records exported' as a stage param to track the number of records during export.
- Introduced filters_display field in the DataExport model to precompute and optimize display-friendly filters, reducing unnecessary queries.
- Returned the raw file_size value in the API response to enable sorting, and used Django's filesizeformat to provide a more readable file size format.
@annavik
Copy link
Member

annavik commented Apr 1, 2025

Small detail, for stage values that represent numbers it's better if we return them as numbers, not strings. If we do so, the web app will automatically format them based on browser locale. So here, the value would be displayed as "70,243" for me instead of "70243".

Screenshot 2025-04-01 at 17 16 23

Copy link
Collaborator

@mihow mihow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to update our existing filters in the preview env so that I can re-save them and generate filters_display. However the admin is looking for an old property.

AttributeError at /admin/exports/dataexport/
'DataExport' object has no attribute 'status'

Since we are making the DataExport model for the first in this PR, I don't mind migrating backwards then forwards, but in other cases you can add a migration step to call save() on all existing DataExport instances.

@mihow
Copy link
Collaborator

mihow commented Apr 3, 2025

@mohamedelabbas1996 I recreated all the migrations, we went from 9 migrations to 2!

@mihow mihow changed the title Support for Occurrence Data Exports Framework for exporting data Apr 3, 2025
"updated_at",
]

def validate_format(self, value):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very clean! nice work

Copy link
Collaborator

@mihow mihow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm closing this epic journey! Well done @mohamedelabbas1996

@mihow mihow merged commit d664184 into main Apr 8, 2025
6 checks passed
@mihow mihow deleted the feat/export-occurrences-data branch April 8, 2025 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Exporting: Support for trigger background exports

4 participants