ChatAGH_DataCollecting

Data collecting and indexing pipelines for Chat AGH project

Run instructions

Initialize new python environment

python3 -m venv .venv

Activate the environment

source .venv/bin/activate

Data scraping/processing pipeline overview

1. For the given domains (eg. agh.edu.pl, rekrutacja.agh.edu.pl) generate graph of connections between pages under these domains (connections are buttons, links etc.)

This step is performed by graph_generator.py
Output of this process is the json file of following structure:

{
  "source": "<SOURCE URL>",
  "target": "<TARGET URL>"
}

source_url is always a url available under the processed domain.

2. Scrape and process all nodes (urls) in the graphs. - Processing involves html parsing, text extraction, downloading files (.pdf, .docx etc) and filtering. - Nodes filtered out (eg. too short docs.) in the process are removed from the graphs.

3. Domains clustering. The result of the previous steps is a graph composed of multiple strongly connected components, which are not strongly connected or even separated from each other. We want to cluster these components so that the ones which are relatively small and highly correlated are merged. This step is presented in web_graph_eda.ipynd notebook.

4. Once we've grouped the data onto clusters we are indexing them to database.

Whole graph is saved in the separate collection in the database and contains information about nodes, its metadata (cluster id etc.) and connections between them.
For each cluster separately, we are chunking the scraped data, generating the chunks embeddings, and saving them in the vec

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src		src
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
web_graph_eda_2.ipynb		web_graph_eda_2.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChatAGH_DataCollecting

Run instructions

Data scraping/processing pipeline overview

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ChatAGH_DataCollecting

Run instructions

Data scraping/processing pipeline overview

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages