-
Notifications
You must be signed in to change notification settings - Fork 11
Framework for exporting data #725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for antenna-preview ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
|
Woho! Sounds very promising, looking forward testing this out later 🤩 Some early thoughts on the user flow and related backend extensions:
|
I think it's possible but not sure how to implement it may be write the progress value in-memory on redis? @mihow what do you think ? |
For the email feature, I think we can add it as it is just a small change and we already have email service integration (not working locally for me though) |
Yes, i think it would be easy to add export tasks CRUD since we already have the |
Currently pagination is disabled for the export action, but we can bring it back to allow users to get data for selected |
|
@mohamedelabbas1996 the implementation for non-ML job types isn't super clear right now, but it is flexible! I am curious if you can create a new job type for exports and so we can use that to view status, logs & errors. The download link can even be shown in the logs for the time being. See the Data Sync Job type and Collection Populate job types as examples. |
|
The main formats for now should be
This PR can offer the simplest option. It should focus on a scalable background task & API endpoints. |
ami/main/api/views.py
Outdated
| def list(self, request, *args, **kwargs): | ||
| return super().list(request, *args, **kwargs) | ||
|
|
||
| def paginate_queryset(self, queryset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be more scalable to keep pagination, but automatically loop through all the pages. Rather than triggering a single huge database query. Or another way to break it apart? I can give you a large DB snapshot to test on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would appreciate it very much if I 've access to the DB snapshot.
|
@mohamedelabbas1996 Will you make the first export format be JSON from the API list view for occurrences? That's what people are currently exporting now (@annavik @rhine3). They are using external python scripts to loop over the paginated API views. |
Occurrence Export Timing ResultsThe following table presents the export times for occurrences per project.
|
- Moved export logic to run_export() for better encapsulation. - Added file_size and record_count fields to DataExport for tracking export statistics. - Added unit tests to ensure the number of exported records matches the number of occurrences in the collection for both CSV and JSON formats.
…nickLab/antenna into feat/export-occurrences-data
ami/exports/models.py
Outdated
| from ami.exports.registry import ExportRegistry | ||
|
|
||
| export_format = self.format | ||
| export_class = ExportRegistry.get_exporter(export_format) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this convention when getting a class dynamically
ExportClass = ExportRegistry.get_exporter(export_format)
exporter = ExportClass()
ami/exports/models.py
Outdated
| file_size = models.PositiveBigIntegerField(default=0) | ||
|
|
||
| @cached_property | ||
| def filters_display(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is preferable that we save these display filters in a field on the model it self, so we only calculate them once and avoid all N+1 queries when fetching the list of DataExports.
Change filters_display to a field, and rename the function to generate this to get_filters_display(). Then call that in the save() method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mohamedelabbas1996, I talked with Anna about this today. She tried displaying the extra detail in the UI and it ended up looking cluttered. We agreed that it's okay to revisit this later! Let's keep the filters schema with nested attributes filters_display: {"collection": {"id": 1}". But I think it can be generated on the fly as you have it. Just remove the extra queries and only use the ID from the related objects.
ami/exports/serializers.py
Outdated
| def get_file_url(self, obj): | ||
| return obj.get_absolute_url(request=self.context.get("request")) | ||
|
|
||
| def get_file_size(self, obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice! for a pattern here, I usually return the raw value as file_size and then the display value as file_size_display. Then you can still sort by the raw value.
There's also a Django util you should compare to see if it's good enough. We use it in a couple places
from django.template.defaultfilters import filesizeformat
ami/exports/base.py
Outdated
| ) | ||
| if self.job: | ||
| self.job.progress.add_stage_param( | ||
| self.job.job_type_key, "Number of records exported", f"{self.queryset.count()}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename this to "Total records to export"
Add another stage parameter: "Number of records exported" that updates with the progress bar.
- Added 'Number of records exported' as a stage param to track the number of records during export. - Introduced filters_display field in the DataExport model to precompute and optimize display-friendly filters, reducing unnecessary queries. - Returned the raw file_size value in the API response to enable sorting, and used Django's filesizeformat to provide a more readable file size format.
mihow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to update our existing filters in the preview env so that I can re-save them and generate filters_display. However the admin is looking for an old property.
AttributeError at /admin/exports/dataexport/
'DataExport' object has no attribute 'status'
Since we are making the DataExport model for the first in this PR, I don't mind migrating backwards then forwards, but in other cases you can add a migration step to call save() on all existing DataExport instances.
|
@mohamedelabbas1996 I recreated all the migrations, we went from 9 migrations to 2! |
…nickLab/antenna into feat/export-occurrences-data
…-occurrences-data
| "updated_at", | ||
| ] | ||
|
|
||
| def validate_format(self, value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very clean! nice work
mihow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm closing this epic journey! Well done @mohamedelabbas1996


Summary
This PR implements a framework for defining different export formats and exporting data in the background. It also defines the API for creating & retrieving the history of a project's exports. Users can trigger exports via the API or the Admin Page with optional filtering parameters and track job progress asynchronously.
Related Issues
Closes #720
Detailed Description
This PR enables users to export filtered Occurrence data in CSV or JSON format via the API. The export process runs asynchronously in the background.
Triggering an Export
Users can initiate an export by making a POST request to:
Request Body
The request body should specify the export format and optional filters. Available formats:
Request Example (CSV Export with Collection Filter)
Request Example (JSON Export)
Checking Export Status
Once an export job is triggered, users can check the job status using:
Fetching All Exports
To retrieve all exports:
Filtering Exports by Project ID
Screenshots
How to Test the Changes
/api/v2/exports/with validproject_idand request body./api/v2/exports/<export_id>with the returnedexport_id.file_url./api/v2/exports/and filtering withproject_id.