Simple aggregator for RSS & HTML feeds - used by githubaction (Free tier) and publish the aggregated_feed.xml in your repo.
You can chose to use RSS aggregator to a single RSS feed by using the rss_aggregator.py. (there is also a html parser option).
Note that you have the options to chose between aggregation or appending to a aggregated_feed.xml. Read all the steps!
hitem
-
Create a new GitHub repository and upload files:
Create a new public GitHub repository (e.g., "rss-aggregator"). You'll store the Python script, aggregated RSS feed and the workflow file in this repository. -
Set up GitHub Pages:
Go to the repository settings, open the GitHub Pages section, and set Build and deployment → Source to "GitHub Actions". The workflow will generate the RSS/XML files and deploy them to GitHub Pages. Save the changes, and you'll get a URL for your GitHub Pages site (e.g., https://<username>.github.io/rss-aggregator/). -
Update the 'link' field in the script: Replace the 'link' field in the
rss_aggregator.pyorhtml_aggregator.pyscript with your GitHub Pages URL:etree.SubElement(channel, "link").text = "https://<username>.github.io/<repo name>/aggregated_feed.xml"
-
Chose and change to RSS or HTML feeds accordingly in
rss_aggregator.ymlFor HTML:
run: | python html_aggregator.py
For RSS:
run: | python rss_aggregator.py
-
Change the github workflow timer accordingly in
rss_aggregator.yml
The Cron job is the main one (how often it runs here on github actions). But one more such setting is that links are only stored for 365 days undername: Update processed links filein the yml to preventprocessed_links.txtto grow to big. -
Chose if you want to aggregate or append
aggregated_feed.xmlby settingappend_modevalues to true or false in RSS or HTML*.pyscript.
Aggregated
This is used to ingest the latest news where the ingestion is triggered elsewhere and the ingestion reads the whole file (not only the latest).append_mode = False
Persistent/Appending
This is to persist processed links inaggregated_feed.xml(Used for feed like feedly or ingestors that "looks for latest rss feed update")
To avoidaggregated_feed.xmlto grow to big, default time to save the links is 365 days, you can adjust according to your needs.append_mode = True max_age_days = 365
-
Then you take the link generated earlier to
aggregated_feed.xmland paste it in to your RSS hook or powerautomate flow (and ingest frequenzy to match your cron configuration inrss_aggregator.yml)
Example from teams:
Example from powerautomate flows:
Example from powerautomate flows with AppendMode=True:
Note: You may also need to set up github access token for the repo in question. Else the github action workflow will not be allowed to checkout and make pullrequests (and merge). By default it uses GITHUB_TOKEN that can be configured on your repository project: https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/enabling-features-for-your-repository/managing-github-actions-settings-for-a-repository#setting-the-permissions-of-the-github_token-for-your-repository
I set these permissions in the workflow yml file:
permissions:
contents: write
pages: write
id-token: writeInclude as few permissions as possible needed for your project.
- Collection Window (
time_threshold): Collects entries from the past 2 hours, its set up in the main*.pyscript.time_threshold = datetime.datetime.utcnow() - datetime.timedelta(hours=2)
- Cron Job Interval: Runs every hour (
'0 */1 * * *'), ensuring regular updates, set up in the github workflow*.ymlfile. - Ingestion Frequency: Runs every hour, aligning with the cron job to process collected entries. (This is set by you, in the ingestion part, such as teams hooks or powerautomate flow).
- Standard (Hourly Updates):
time_threshold: 2 hours
Cron Job Interval: 1 hour
Ingestion Frequency: 1 hour (for AppendMode = False)- Extended (Monthly Updates):
time_threshold: 60 days
Cron Job Interval: 30 days
Ingestion Frequency: 30 days (for AppendMode = False)Note: First run will actually gather 60 days worth of news (or what you set time_treshold to), but every subsquent run there is filters for links that are not already present. Time_treshold need to overlap the cron job and ingestion so you dont miss anything.
Also note that this is not true for AppendMode=True, if appendmode is active it will only ingest whatever timer you set in the py script (default 2hours), so you will not have a 365 days blobb of items added on your first run and no spam to channels will happend.
For your own sanity, if you follow this repo or deploy your own, make sure to:
If you run every hour it will be very chatty :)