Important
The project is still uncomplete and need some works before final release
ACCESS (Automated Cataloging and Classification Engine for Storage and Search) is a Python-based application designed to provide a powerful and flexible solution for cataloging, classifying, and searching documents. This project aims to streamline document management through intelligent tagging and efficient retrieval mechanisms.
The primary goal of ACCESS is to offer a robust tool that can:
- Automate Document Organization: Automatically process and categorize documents based on their content.
- Enable Efficient Search: Allow users to quickly find relevant documents using a tag-based search system.
- Ensure Scalability: Handle a growing number of documents, from small personal collections to larger archives.
- Maintain High Code Quality: Adhere to professional coding standards, including linting, formatting, static typing, and testing.
- Be Modular and Extensible: Design components that can be easily evolved or replaced.
- SQLite-based Data Persistence: Secure and efficient storage of document metadata and tags using SQLite.
- Automatic Tagging (Work in Progress):
- Custom NLP pipeline for document processing (tokenization, cleaning, lemmatization).
- TF-IDF based scoring for tag relevance.
- Tag-Based Search (Planned): Powerful search 여성anisms based on document tags and their scores.
- Command-Line Interface (CLI) (Planned): Initial interface for interacting with the system.
- Unique Document Storage: Internal mechanism to ensure uniquely stored filenames using UUIDs.
- Robust Database Management: Dedicated
DatabaseManagerclass to handle all database interactions, schema creation, and CRUD operations.
ACCESS is built with a modular architecture to promote maintainability and future enhancements. Key components include:
access-app/core: Contains the core business logic.database_manager.py: Manages all interactions with the SQLite database, including schema definition, document and tag storage, and retrieval.- (Planned)
DocumentProcessor: Handles the NLP tasks like tokenization, stop-word removal, and lemmatization. - (Planned)
TagExtractor: Calculates term frequencies from processed documents. - (Planned)
ScoringEngine: Implements TF-IDF and potentially other scoring algorithms.
access-app/io: (To be defined) Intended for input/output interactions, such as file handling.access-app/utils: (To be defined) For utility functions and helper classes.access-app/cli.py: (To be defined) The main entry point for the command-line interface.data/learned/access_main.db: The SQLite database file where all learned information and metadata are stored.
The core data is organized into three main tables:
Documents: Stores metadata about each document.doc_id(INTEGER, PK, Autoincrement)original_filename(TEXT, NOT NULL)stored_filename(TEXT, NOT NULL, UNIQUE) - Internally generated unique filename.import_date(TIMESTAMP, DEFAULT CURRENT_TIMESTAMP)doc_length(INTEGER) - e.g., number of words or characters.
Tags: Stores unique tags identified in the documents.tag_id(INTEGER, PK, Autoincrement)tag_text(TEXT, NOT NULL, UNIQUE)
DocumentTags: A linking table associating documents with tags and their relevance scores.doc_id(INTEGER, NOT NULL, FK to Documents)tag_id(INTEGER, NOT NULL, FK to Tags)tf_idf_score(REAL) - The TF-IDF score for the tag in relation to the document.- PRIMARY KEY (
doc_id,tag_id) - Foreign keys have
ON DELETE CASCADEto maintain data integrity.
Appropriate indexes are created on these tables to ensure efficient querying.
Currently, ACCESS is in active development. To set up a development environment:
-
Prerequisites:
- Python 3.8 or higher.
-
Clone the Repository:
git clone <repository-url> cd access_v3 # Or your repository's root directory name
-
Set up a Virtual Environment (Recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install Dependencies: The project uses
pyproject.tomlfor project metadata. Core dependencies are minimal (built-insqlite3). Development dependencies can be installed via:pip install -e .[dev]
This installs the package in editable mode (
-e .) and the development tools specified in[project.optional-dependencies].devofpyproject.toml(likeblack,flake8,mypy,pytest). -
Initialize the Database (Automatic): Running any part of the application that instantiates the
DatabaseManager(e.g., future CLI commands or scripts) will automatically create thedata/learned/access_main.dbfile and its schema if they don't exist.
We welcome contributions to ACCESS! Please follow these guidelines:
- Code Style:
- The project uses
blackfor code formatting,isortfor import sorting, andflake8for linting. - (Planned) Pre-commit hooks will be set up to automate these checks.
- The project uses
- Type Hinting:
mypyis used for static type checking. Please include type hints in your code.
- Branching:
- Create a new branch for each feature or bug fix (e.g.,
feature/new-search-algorithmorfix/database-query-error).
- Create a new branch for each feature or bug fix (e.g.,
- Commits:
- Write clear and concise commit messages.
- Testing (Planned):
pytestwill be used for unit and integration tests. Please write tests for new functionalities.
- Documentation:
- Keep inline comments and docstrings up-to-date.
- Update the
docs/dev_book.mdfile with significant architectural decisions or changes to the development plan.
- Pull Requests:
- Ensure your code passes all linting and type checks before submitting a pull request.
- (Planned) GitHub Actions will be set up for CI to automate checks on PRs.
Refer to the docs/dev_book.md for a more detailed overview of the current development status, architectural choices, and a roadmap for immediate next steps.
This project is licensed under the MIT License - see the LICENSE file for details.
- Implementation of the full
learnandsearchcommands. - Development of the
DocumentProcessor,TagExtractor, andScoringEnginecomponents. - Strategies for tag expansion and synonyms.
- Support for various file types (PDF, DOCX, etc.).
- Potential GUI development.
- Enhanced scalability for very large document sets.