Skip to content

Comments

Crawler 101#1

Merged
priyanshum143 merged 35 commits intomainfrom
crawler_101
Jan 14, 2026
Merged

Crawler 101#1
priyanshum143 merged 35 commits intomainfrom
crawler_101

Conversation

@priyanshum143
Copy link
Owner

This PR contains the code for Crawler, as of now crawler is working fine without any issue.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit

black-format

[black-format] reported by reviewdog 🐶


[black-format] reported by reviewdog 🐶


[black-format] reported by reviewdog 🐶

return hashlib.sha256(content.encode('utf-8')).hexdigest()


[black-format] reported by reviewdog 🐶


[black-format] reported by reviewdog 🐶


[black-format] reported by reviewdog 🐶

logger.debug(f"Initializing URL frontier with seed URLs: {CommonVariables.SEED_URLS}")


[black-format] reported by reviewdog 🐶


[black-format] reported by reviewdog 🐶


[black-format] reported by reviewdog 🐶

logger.debug(f"Queue is full. Current size: {queue_size}, Max limit: {CommonVariables.MAX_LIMIT}")


[black-format] reported by reviewdog 🐶

logger.debug(f"New queue size: {self.url_frontier.qsize()}/{CommonVariables.MAX_LIMIT}")


[black-format] reported by reviewdog 🐶

logger.debug(f"Could not add {urls_len - urls_to_add} URLs due to queue capacity limit")
async def _parse_response_and_make_page_model(self, responses: List[httpx.Response]) -> None:


[black-format] reported by reviewdog 🐶


[black-format] reported by reviewdog 🐶

with open(CommonVariables.JSONL_FILE_PATH, 'a', encoding='utf-8') as f:


[black-format] reported by reviewdog 🐶

logger.debug(f"Status code for URL [{url}] is {response.status_code}\n content: {response.content}")


[black-format] reported by reviewdog 🐶

content_type = response.headers.get('Content-Type', '').lower()
if 'xml' in content_type:
parser = 'xml'


[black-format] reported by reviewdog 🐶

parser = 'html.parser'


[black-format] reported by reviewdog 🐶


[black-format] reported by reviewdog 🐶

f.write(json.dumps(asdict(page_model), ensure_ascii=False) + '\n')

priyanshum143 and others added 5 commits January 14, 2026 22:31
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@priyanshum143 priyanshum143 merged commit af56a53 into main Jan 14, 2026
1 check passed
@priyanshum143 priyanshum143 deleted the crawler_101 branch January 14, 2026 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant