Improve error messages for entity extraction to aid debugging#40
Improve error messages for entity extraction to aid debugging#40devbm7 wants to merge 1 commit intojunquera:masterfrom
Conversation
- Enhanced all error print statements for URL/entity extraction (Skype, Discord, Telegram, WhatsApp, etc.) to include: - The entity type being processed - The problematic value that caused the error - The exception type and message - This makes errors much more descriptive and actionable for debugging, addressing the open issue about insufficient error context.
There was a problem hiding this comment.
Pull request overview
This PR enhances stdout error messages emitted during URL/entity extraction to include the problematic value and the exception type/message, improving debugging context as requested in issue #39.
Changes:
- Add
value=...anderror=<ExceptionType>: <message>context to URL parsing error prints inextract_links. - Add similar contextual information to entity-specific URL extraction errors (I2P/Tor/WhatsApp/Discord/Telegram/Skype) in
_analyze_chunk. - Expand BeautifulSoup parsing error prints to include the relevant value being processed.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| urls.add(UUF(full_url).rebuild()) | ||
| except AttributeError: | ||
| print("[*] AttributeError: Invalid attribute in URL") | ||
| print(f"[*] AttributeError: Invalid attribute in URL, value={href!r}") | ||
| except ValueError: | ||
| print("[*] ValueError: Invalid URL format") | ||
| print(f"[*] ValueError: Invalid URL format, value={href!r}") |
There was a problem hiding this comment.
In this BeautifulSoup link loop, the exception handlers interpolate href (e.g., value={href!r}), but href is assigned inside the try. If an exception is raised before href is set (e.g., during url.get(...)), the handler will raise UnboundLocalError and mask the original failure. Initialize href = None before the try (or use a safer fallback like printing url/full_url via locals().get(...)).
| print(f"[*] TypeError: Invalid input type for BeautifulSoup, value={body!r}") | ||
| except Exception as e: | ||
| print(f"[*] Error with HTML parsing: {e}") | ||
| print(f"[*] Error with HTML parsing: value={body!r}, error={type(e).__name__}: {e}") |
There was a problem hiding this comment.
These error prints include value={body!r}. Since body is the full HTML/text chunk, this can be extremely large and may leak sensitive content into logs/stdout. Consider logging only type(body), len(body), and a truncated preview (or a hash) instead of the full repr.
| print(f"Error parsing text with BeautifulSoup: value={href!r}, error={type(e).__name__}: {e}") | ||
| print(f"[*] Unexpected error: value={href!r}, error={type(e).__name__}: {e}") |
There was a problem hiding this comment.
This exception path prints two separate lines for the same failure and uses a different prefix format than the other error messages. Consider consolidating to a single, consistently-formatted message to reduce noisy/duplicated output while preserving the new context fields.
| print(f"Error parsing text with BeautifulSoup: value={href!r}, error={type(e).__name__}: {e}") | |
| print(f"[*] Unexpected error: value={href!r}, error={type(e).__name__}: {e}") | |
| print(f"[*] Unexpected error parsing text with BeautifulSoup: value={href!r}, error={type(e).__name__}: {e}") |
This would help with #39