In the task of near similar image search, features from Deep Neural Network are often used to compare images and measure similarity. Our analysis explores the vast data to compute k-nearest neighbors using both image and text (both derived fron the image, and provided by the dataset/user). We reduce the problem of searching over multiple modes into a single space by drawing apt correlations between the modes. Algorithmic details can be found in [1] and report.pdf.
- Install
pytorch,torchvision,pickle,annoyandtkinterlibraries. - Ensure paths are correct , vis-a-vis the
model_pathandpwdvariable in all the files of the relevant subdirectory. - When executing for the first time, the models will be saved in the
models/directory. Later, they can be loaded from the corresponding.pthfiles. - Update the execution environment in
user.pyand runpython user.pyfor receiving a prompt to upload query image and text, and voila!
user.py- Top Level User-Interface, prompts user for query image and text, and prresents retrieved resultsmetric.py- for computing similarity index to analyse the quality of the retrieved resultstext_embedding.ipynb- contains functionality to generate the text embeddings from a text file
Contains files for the DeepFashion dataset, trained on the inception_v3 model. w = 2000 set as default. Here, :-
captions.json- contains the per-image captions, generated from label filestrain.py- houses functionality to generate the learnt-database, compressing and storing visual and textual featuresquery.py- houses functionality to read and query the database for an image-text pairclassifier.py- implementation of the model for generating one-hot and (embeddings for )features inferred from the image.visual.py- implementation of the inception framework for extracting visual features of the image.preprocess_words.py- for stopword elimination and text -> vector conversion.
Contains files for the DeepFashion dataset, trained on the inception_v3 model. The caption file used here has colors appended to it, this had to be done manually, since the dataset does not provision for colors. A separate multi-label classifier model for color extraction was trained for 11 colors, and the generated colors per image were appended to the caption file. Files used for this have been bundled in the color/ subdirectory. w = 200 set by default. The other files are similar to inception_df/. This lets the user also query for terms like [blue] or [brown] as well!
Contains files for the DeepFashion dataset, trained on the VGG19 model. This was done to analyse perfomance difference of the retrieval system by contrasting with the inception_v3 model, and to experiment with the vector-size of the visual features used in the dataset. w = 500 set by default.
Contains files for the Fashion 200k dataset, (w = 200 as default) where:-
captions_200.json- contains the per-image captions, generated from label filestrain_200.py- houses functionality to generate the learnt-database, compressing and storing visual and textual featuresquery_200.py- houses functionality to read and query the database for an image-text pairmodels_200.py- implementation of the visual encoder used for this datasetpreprocess_words.py- for stopword elimination and text -> vector conversion.
[1] Jonghwa Yim, Junghun James Kim, Daekyu Shin , “One-Shot Item Search with Multimodal Data”
[2] The DeepFashion - Multimodal Dataset
[4] Spotify's Annoy - Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk
A team project by Atharv Dabli and Gurarmaan S. Panjeta .

