RSS Aggregator - For RSS or HTML feeds

Simple aggregator for RSS & HTML feeds - used by githubaction (Free tier) and publish the aggregated_feed.xml in your repo.
You can chose to use RSS aggregator to a single RSS feed by using the rss_aggregator.py. (there is also a html parser option).
Note that you have the options to chose between aggregation or appending to a aggregated_feed.xml. Read all the steps!
hitem

Howto

Create a new GitHub repository and upload files:
Create a new public GitHub repository (e.g., "rss-aggregator"). You'll store the Python script, aggregated RSS feed and the workflow file in this repository.
Set up GitHub Pages:
Go to the repository settings, open the GitHub Pages section, and set Build and deployment → Source to "GitHub Actions". The workflow will generate the RSS/XML files and deploy them to GitHub Pages. Save the changes, and you'll get a URL for your GitHub Pages site (e.g., https://<username>.github.io/rss-aggregator/).
Update the 'link' field in the script: Replace the 'link' field in the rss_aggregator.py or html_aggregator.py script with your GitHub Pages URL:
```
etree.SubElement(channel, "link").text = "https://<username>.github.io/<repo name>/aggregated_feed.xml"
```
Chose and change to RSS or HTML feeds accordingly in rss_aggregator.yml

For HTML:
```
run: |
    python html_aggregator.py
```
For RSS:
```
run: |
   python rss_aggregator.py
```
Change the github workflow timer accordingly in rss_aggregator.yml
The Cron job is the main one (how often it runs here on github actions). But one more such setting is that links are only stored for 365 days under name: Update processed links file in the yml to prevent processed_links.txt to grow to big.
Chose if you want to aggregate or append aggregated_feed.xml by setting append_mode values to true or false in RSS or HTML *.py script.
Aggregated
This is used to ingest the latest news where the ingestion is triggered elsewhere and the ingestion reads the whole file (not only the latest).
```
append_mode = False
```
Persistent/Appending
This is to persist processed links in aggregated_feed.xml (Used for feed like feedly or ingestors that "looks for latest rss feed update")
To avoid aggregated_feed.xml to grow to big, default time to save the links is 365 days, you can adjust according to your needs.
```
append_mode = True
max_age_days = 365
```
Then you take the link generated earlier to aggregated_feed.xml and paste it in to your RSS hook or powerautomate flow (and ingest frequenzy to match your cron configuration in rss_aggregator.yml)
Example from teams:

Example from powerautomate flows:

Example from powerautomate flows with AppendMode=True:

Note: You may also need to set up github access token for the repo in question. Else the github action workflow will not be allowed to checkout and make pullrequests (and merge). By default it uses GITHUB_TOKEN that can be configured on your repository project: https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/enabling-features-for-your-repository/managing-github-actions-settings-for-a-repository#setting-the-permissions-of-the-github_token-for-your-repository
I set these permissions in the workflow yml file:

permissions:
 contents: write
 pages: write
 id-token: write

Include as few permissions as possible needed for your project.

Customizing Timing and Frequency

Timing Configuration

Collection Window (time_threshold): Collects entries from the past 2 hours, its set up in the main *.py script.
```
time_threshold = datetime.datetime.utcnow() - datetime.timedelta(hours=2)
```
Cron Job Interval: Runs every hour ('0 */1 * * *'), ensuring regular updates, set up in the github workflow *.yml file.
Ingestion Frequency: Runs every hour, aligning with the cron job to process collected entries. (This is set by you, in the ingestion part, such as teams hooks or powerautomate flow).

Recommended Settings

Standard (Hourly Updates):

time_threshold: 2 hours 
Cron Job Interval: 1 hour 
Ingestion Frequency: 1 hour (for AppendMode = False)

Extended (Monthly Updates):

time_threshold: 60 days 
Cron Job Interval: 30 days
Ingestion Frequency: 30 days (for AppendMode = False)

Note: First run will actually gather 60 days worth of news (or what you set time_treshold to), but every subsquent run there is filters for links that are not already present. Time_treshold need to overlap the cron job and ingestion so you dont miss anything.
Also note that this is not true for AppendMode=True, if appendmode is active it will only ingest whatever timer you set in the py script (default 2hours), so you will not have a 365 days blobb of items added on your first run and no spam to channels will happend.

Other

For your own sanity, if you follow this repo or deploy your own, make sure to:

If you run every hour it will be very chatty :)

Name		Name	Last commit message	Last commit date
Latest commit History 25,999 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
README.md		README.md
aggregated_feed.xml		aggregated_feed.xml
html_aggregator.py		html_aggregator.py
index.html		index.html
processed_links.txt		processed_links.txt
requirements.txt		requirements.txt
rss_aggregator.py		rss_aggregator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RSS Aggregator - For RSS or HTML feeds

Howto

Customizing Timing and Frequency

Timing Configuration

Recommended Settings

Other

About

Uh oh!

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RSS Aggregator - For RSS or HTML feeds

Howto

Customizing Timing and Frequency

Timing Configuration

Recommended Settings

Other

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages