Refactor PO generator for robust metadata and improved class discovery#3204
Refactor PO generator for robust metadata and improved class discovery#3204pareshjoshij wants to merge 5 commits intovacanza:devfrom
Conversation
|
Caution Review failedFailed to post review comments Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings. WalkthroughUpdated Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR refactors the generate_po_files.py script to enhance the localization workflow with standardized metadata, license headers, and improved class discovery. The changes aim to improve compatibility with translation tools and reduce git diff noise.
Key Changes:
- Added standard gettext headers (POT-Creation-Date, MIME-Version, Report-Msgid-Bugs-To) to generated PO files
- Implemented license header injection from
docs/file_header.txt - Enhanced class discovery logic with name-matching and fallback selection based on docstring length
- Improved timestamp handling to minimize unnecessary updates when content hasn't changed
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if k not in po_file.metadata: | ||
| po_file.metadata[k] = v |
There was a problem hiding this comment.
The metadata update only adds missing fields but never updates existing ones with potentially stale values. When content changes, fields like 'POT-Creation-Date', 'Generated-By', or 'Report-Msgid-Bugs-To' should be updated even if they already exist in the metadata. This could lead to inconsistent or outdated metadata across PO files.
| if k not in po_file.metadata: | |
| po_file.metadata[k] = v | |
| po_file.metadata[k] = v |
| "MIME-Version": "1.0", | ||
| "Content-Type": "text/plain; charset=UTF-8", | ||
| "Content-Transfer-Encoding": "8bit", | ||
| "Generated-By": "Lingva 5.0.5", |
There was a problem hiding this comment.
The hardcoded version string "Lingva 5.0.5" in the metadata is a maintainability issue. This value will become outdated if the lingva library is updated. Consider dynamically retrieving the lingva version from the package or removing this field if not essential.
There was a problem hiding this comment.
@pareshjoshij Let's remove "Generated-By": "Lingva 5.0.5", line, it should be automatically added by Lingva anyway
| if not content.startswith("#"): | ||
| new_parts.append("#") | ||
|
|
There was a problem hiding this comment.
The logic for determining whether to add headers is inconsistent. In the no-change path (lines 154-169), both license header and desc_line are checked before adding. In the changed path (lines 193-198), an extra condition checks if content doesn't start with "#". This inconsistency could lead to different header formatting between files that have changed and those that haven't.
| if not content.startswith("#"): | |
| new_parts.append("#") | |
| new_parts.append("#") |
| try: | ||
| mod = importlib.import_module(module) | ||
| except ImportError: | ||
| continue |
There was a problem hiding this comment.
ImportError is caught but silently ignored without any logging. This could hide genuine import problems and make debugging difficult. Consider adding at least a debug log message when a module fails to import so that issues can be identified during development.
| content = po_path.read_text(encoding="utf-8") | ||
| new_parts = [] | ||
|
|
||
| if license_header and "Authors: Vacanza Team" not in content: |
There was a problem hiding this comment.
The check 'if "Authors: Vacanza Team" not in content' is used as a proxy to determine whether the license header already exists. This is fragile because it only checks for one specific string within the header. If the header format changes or if this string appears elsewhere in the file, the logic will break. Consider using a more robust marker or checking for the complete header structure.
| "POT-Creation-Date": datetime.now().strftime("%Y-%m-%d %H:%M%z"), | ||
| "Last-Translator": "Vacanza Team <dr.prodigy.github@gmail.com>", |
There was a problem hiding this comment.
The timestamp format string '%z' may produce an empty string on some platforms where timezone information is not available. Consider using a fixed timezone or handling the case where timezone offset is not available to ensure consistent output across different environments.
| "POT-Creation-Date": datetime.now().strftime("%Y-%m-%d %H:%M%z"), | |
| "Last-Translator": "Vacanza Team <dr.prodigy.github@gmail.com>", | |
| "POT-Creation-Date": datetime.now().astimezone().strftime("%Y-%m-%d %H:%M%z"), | |
| "Last-Translator": "Vacanza Team <dr-prodigy.github@gmail.com>", |
There was a problem hiding this comment.
I did double check and the bot's astimezone() inclusion is correct - but let's revert to the placeholder name instead
| "POT-Creation-Date": datetime.now().strftime("%Y-%m-%d %H:%M%z"), | |
| "Last-Translator": "Vacanza Team <dr.prodigy.github@gmail.com>", | |
| "POT-Creation-Date": datetime.now().astimezone().strftime("%Y-%m-%d %H:%M%z"), | |
| "Last-Translator": "FULL NAME <EMAIL@ADDRESS>", |
Alternatively, since we're doing this, may as well make the example email RFC 2606-compliant
| "POT-Creation-Date": datetime.now().strftime("%Y-%m-%d %H:%M%z"), | |
| "Last-Translator": "Vacanza Team <dr.prodigy.github@gmail.com>", | |
| "POT-Creation-Date": datetime.now().astimezone().strftime("%Y-%m-%d %H:%M%z"), | |
| "Last-Translator": "FULL NAME <EMAIL@EXAMPLE.COM>", |
| po_path.write_text(final_content, encoding="utf-8") | ||
| return | ||
|
|
||
| timestamp = datetime.now().strftime("%Y-%m-%d %H:%M%z") |
There was a problem hiding this comment.
The same timestamp format issue exists here. The '%z' format specifier may produce an empty string on some platforms where timezone information is not available. This could lead to inconsistent or invalid POT-Creation-Date values.
|
|
||
| all_po_update_tasks: list[tuple[str, str]] = [] | ||
| if not chosen_cls: | ||
| candidates.sort(key=lambda c: len(c.__doc__ or ""), reverse=True) |
There was a problem hiding this comment.
The fallback logic for choosing the "best" class is flawed. Sorting by docstring length may select a less relevant class over the primary one. For example, if a file contains a main holiday class with a short docstring and helper classes with longer docstrings, the wrong class could be selected. Consider using additional heuristics such as class inheritance depth or checking if the class directly implements certain methods.
| candidates.sort(key=lambda c: len(c.__doc__ or ""), reverse=True) | |
| def _class_selection_key(cls) -> tuple[int, int]: | |
| """ | |
| Heuristic for selecting the most relevant HolidayBase subclass | |
| when multiple candidates exist in a module. | |
| Higher score is better; we invert it for use in sort(). | |
| """ | |
| score = 0 | |
| # Prefer classes that implement the core population logic themselves. | |
| if "_populate" in cls.__dict__: | |
| score += 2 | |
| # Prefer classes that directly declare identifying attributes. | |
| if "country" in cls.__dict__ or "market" in cls.__dict__: | |
| score += 1 | |
| doc_len = len(cls.__doc__ or "") | |
| # sort() is ascending, so use negatives to put best candidates first. | |
| return (-score, -doc_len) | |
| candidates.sort(key=_class_selection_key) |
| target_name = path.stem.replace("_", "").lower() | ||
|
|
||
| for cls in candidates: | ||
| if cls.__name__.lower() == target_name: |
There was a problem hiding this comment.
The class name matching algorithm is case-insensitive and removes underscores from the filename, but doesn't account for potential edge cases. For a file named "united_states.py", target_name becomes "unitedstates", but if the class is named "UnitedStates", this will match. However, if there are naming variations or special characters, this logic may fail. Consider more robust matching strategies.
| first_line = entity_docstring.strip().split("\n")[0].strip().rstrip(".") | ||
| if first_line.endswith(" holidays"): | ||
| clean_name = first_line[:-9].strip() | ||
| else: | ||
| clean_name = first_line |
There was a problem hiding this comment.
The description line generation has a specific hardcoded pattern where it strips " holidays" from the end of the first line (line 143). This assumes a specific docstring format. If the docstring doesn't follow this format exactly, the logic may produce unexpected results. Consider documenting this expectation or making the parsing more flexible.
| first_line = entity_docstring.strip().split("\n")[0].strip().rstrip(".") | |
| if first_line.endswith(" holidays"): | |
| clean_name = first_line[:-9].strip() | |
| else: | |
| clean_name = first_line | |
| # Use the first non-empty line of the docstring as a short description. | |
| # Commonly, entity docstrings follow the "<entity> holidays." pattern. | |
| # In that case we strip the trailing "holidays"/"holiday" keyword here | |
| # to avoid duplicating it when building `desc_line` below. If the | |
| # docstring uses a different format, we fall back to the first | |
| # sentence unchanged. | |
| first_line = entity_docstring.strip().split("\n", 1)[0].strip() | |
| # Only consider the first sentence to keep the description concise. | |
| first_sentence = first_line.split(".", 1)[0].strip() | |
| lowered = first_sentence.lower() | |
| if lowered.endswith(" holidays"): | |
| clean_name = first_sentence[: -len(" holidays")].strip() | |
| elif lowered.endswith(" holiday"): | |
| clean_name = first_sentence[: -len(" holiday")].strip() | |
| else: | |
| clean_name = first_sentence |
|
|
I see the automated review suggestions (regarding robustness, timestamps, etc.) and will definitely address them in the next revision. However, before I apply those fixes and refactor scripts/l10n/l10n_helper.py, could you confirm you are happy with this overall approach for headers and metadata? I want to ensure the strategy is sound before doing the final integration. |
|
The code looks overcomplicated. |
There was a problem hiding this comment.
I've include most of the formatting fixes, but I haven't figured out how this PR accidentally adds l10n location tracker back in again after we removed them a few years back i.e.
#. Monday following %s.
#: ./holidays/countries/spain.py:105
#, c-format
msgid "Lunes siguiente a %s"
msgstr ""
#. New Year's Day.
#: ./holidays/countries/spain.py:172 ./holidays/countries/spain.py:215
#: ./holidays/countries/spain.py:249 ./holidays/countries/spain.py:283
#: ./holidays/countries/spain.py:371 ./holidays/countries/spain.py:426
#: ./holidays/countries/spain.py:564 ./holidays/countries/spain.py:671
#: ./holidays/countries/spain.py:754
msgid "Año Nuevo"
msgstr ""| "MIME-Version": "1.0", | ||
| "Content-Type": "text/plain; charset=UTF-8", | ||
| "Content-Transfer-Encoding": "8bit", | ||
| "Generated-By": "Lingva 5.0.5", |
There was a problem hiding this comment.
@pareshjoshij Let's remove "Generated-By": "Lingva 5.0.5", line, it should be automatically added by Lingva anyway
| def _get_standard_metadata(default_language: str = "en_US") -> dict: | ||
| """Returns the standard metadata required for gettext.""" | ||
| return { | ||
| "Report-Msgid-Bugs-To": "dr-prodigy@users.noreply.github.com", |
There was a problem hiding this comment.
| "Report-Msgid-Bugs-To": "dr-prodigy@users.noreply.github.com", |
This wasn't included in any existing l10n files AFAIK, let's remove them for now
There was a problem hiding this comment.
TODO: Reminder for ME
"Report-Msgid-Bugs-To: l10n@vacanza.dev\n"
| "POT-Creation-Date": datetime.now().strftime("%Y-%m-%d %H:%M%z"), | ||
| "Last-Translator": "Vacanza Team <dr.prodigy.github@gmail.com>", |
There was a problem hiding this comment.
I did double check and the bot's astimezone() inclusion is correct - but let's revert to the placeholder name instead
| "POT-Creation-Date": datetime.now().strftime("%Y-%m-%d %H:%M%z"), | |
| "Last-Translator": "Vacanza Team <dr.prodigy.github@gmail.com>", | |
| "POT-Creation-Date": datetime.now().astimezone().strftime("%Y-%m-%d %H:%M%z"), | |
| "Last-Translator": "FULL NAME <EMAIL@ADDRESS>", |
Alternatively, since we're doing this, may as well make the example email RFC 2606-compliant
| "POT-Creation-Date": datetime.now().strftime("%Y-%m-%d %H:%M%z"), | |
| "Last-Translator": "Vacanza Team <dr.prodigy.github@gmail.com>", | |
| "POT-Creation-Date": datetime.now().astimezone().strftime("%Y-%m-%d %H:%M%z"), | |
| "Last-Translator": "FULL NAME <EMAIL@EXAMPLE.COM>", |
| content = HEADER_PATH.read_text(encoding="utf-8").strip() | ||
| if not content: | ||
| return "" | ||
|
|
||
| lines = [] | ||
| for line in content.splitlines(): | ||
| line = line.rstrip() | ||
| if not line: | ||
| lines.append("#") | ||
| elif line.startswith("#"): | ||
| lines.append(line) | ||
| else: | ||
| lines.append(f"# {line}") | ||
|
|
||
| return "\n".join(lines) + "\n" |
There was a problem hiding this comment.
Fix the first line not getting 2 space & simplify this up a bit

| content = HEADER_PATH.read_text(encoding="utf-8").strip() | |
| if not content: | |
| return "" | |
| lines = [] | |
| for line in content.splitlines(): | |
| line = line.rstrip() | |
| if not line: | |
| lines.append("#") | |
| elif line.startswith("#"): | |
| lines.append(line) | |
| else: | |
| lines.append(f"# {line}") | |
| return "\n".join(lines) + "\n" | |
| content = HEADER_PATH.read_text(encoding="utf-8").rstrip("\n") | |
| if not content: | |
| return "" | |
| return "\n".join( | |
| "#" if not line.rstrip() else f"# {line.rstrip()}" | |
| for line in content.splitlines() | |
| ) + "\n" |
| if not has_content_changed: | ||
| if po_path.exists(): | ||
| content = po_path.read_text(encoding="utf-8") | ||
| if "Authors: Vacanza Team" not in content: | ||
| new_parts = [] | ||
| if license_header: | ||
| new_parts.append(license_header) | ||
| if desc_line and desc_line not in content: | ||
| new_parts.append(desc_line) | ||
|
|
||
| if new_parts: | ||
| new_parts.append("#") | ||
| final_content = "\n".join(new_parts) + "\n" + content | ||
| if final_content.strip() != content.strip(): | ||
| po_path.write_text(final_content, encoding="utf-8") | ||
| return |
There was a problem hiding this comment.
| if not has_content_changed: | |
| if po_path.exists(): | |
| content = po_path.read_text(encoding="utf-8") | |
| if "Authors: Vacanza Team" not in content: | |
| new_parts = [] | |
| if license_header: | |
| new_parts.append(license_header) | |
| if desc_line and desc_line not in content: | |
| new_parts.append(desc_line) | |
| if new_parts: | |
| new_parts.append("#") | |
| final_content = "\n".join(new_parts) + "\n" + content | |
| if final_content.strip() != content.strip(): | |
| po_path.write_text(final_content, encoding="utf-8") | |
| return | |
| if not has_content_changed: | |
| if po_path.exists(): | |
| content = po_path.read_text(encoding="utf-8") | |
| content = POGenerator._strip_gettext_boilerplate(content) | |
| if "Authors: Vacanza Team" not in content: | |
| new_parts = [] | |
| if license_header: | |
| new_parts.extend(license_header.rstrip("\n").splitlines()) | |
| new_parts.append("#") | |
| if desc_line and desc_line not in content: | |
| new_parts.append(desc_line) | |
| if new_parts: | |
| new_parts.append("#") | |
| final_content = "\n".join(new_parts) + "\n" + content | |
| if final_content.strip() != content.strip(): | |
| po_path.write_text(final_content, encoding="utf-8") | |
| return |
This and Lingva's boiler plate stripper:
@staticmethod
def _strip_gettext_boilerplate(content: str) -> str:
if content.startswith("# SOME DESCRIPTIVE TITLE"):
return content.split("#, fuzzy", 1)[1].lstrip()
return content.lstrip()| @@ -58,50 +95,107 @@ def _process_entity_worker( | |||
| allow_empty=True, | |||
There was a problem hiding this comment.
This should disable l10n location inclusion in the .po file
| allow_empty=True, | |
| location=False, | |
| allow_empty=True, |
With this (and my other comments), it should now at least work as a proof-of-concept, though I can get it to generate for non-default language yet
|
Here's the updated table of mismatching country name from new implementation (read from individual
*Maybe |
I will wait for the next revision for my review. :) |
Would it be acceptable if I simply updated the docstrings in the actual country/market files to match the preferred names (the ones with ✅)? |
|
@PPsyrius Thank you so much for the incredibly detailed review! ❤️ I really respect the time and effort you put into guiding me through this. Your suggestions gave me exactly the path I was looking for to make this easier and less complex. I am currently focused on my WOC task in this repo, so I might not update this right this second. However, I will definitely carve out some time to address all your suggestions very soon. Thanks again! 🚀 |



Proposed change
Closes #3180
This PR refactors the
generate_po_files.pyscript to significantly improve the robustness and standardization of our localization workflow.POT-Creation-Date,MIME-Version,Report-Msgid-Bugs-To, and other standard gettext headers to ensure better compatibility with translation tools (like Weblate/Crowdin).docs/file_header.txtinto generated.pofiles.# United States holidays.) to the.pofiles.Testing & Notes:
holidays/countries/india.py(adding/changing holidays) and creating dummy blank.pofiles.make checklocally and all checks passed.scripts/l10n/l10n_helper.py. I plan to make necessary changes there accordingly.Type of change
holidaysfunctionality in general)Checklist
make checklocally; all checks and tests passed.