diff --git a/Python/CosmosDB-NoSQL_VectorSearch/CosmosDB-NoSQL-Vector_AzureOpenAI_Tutorial.ipynb b/Python/CosmosDB-NoSQL_VectorSearch/CosmosDB-NoSQL-Vector_AzureOpenAI_Tutorial.ipynb new file mode 100644 index 0000000..afe8fb0 --- /dev/null +++ b/Python/CosmosDB-NoSQL_VectorSearch/CosmosDB-NoSQL-Vector_AzureOpenAI_Tutorial.ipynb @@ -0,0 +1,595 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "936d13db-daa6-4dd1-9cbf-87e41de8eeb0", + "metadata": {}, + "source": [ + "# Building a RAG application with Azure Cosmos DB for NoSQL\n", + "In this notebook, we'll go step-by-step and build a RAG application. We'll demonstrate how you can use Azure Cosmos DB for NoSQL as the knowledgebase for your RAG application.\n", + "The tutorial is structured as below:\n", + "1. Pre setup - Provision Azure Cosmos DB and OpenAI resources\n", + "2. Get the OpenAI and Azure Cosmos DB Account keys and endpoints\n", + "3. Install the requisite python libraries\n", + "4. Load sample data (about Azure app service) into the notebook\n", + "5. Generate embeddings using OpenAI model and update the data with the embeddings\n", + "6. Create an Azure Cosmos DB database\n", + "7. Create an Azure Cosmos DB container with vector embeddings and indexing policy\n", + "8. Take user question input in natural language and perform a vector search on the data stored in Cosmos DB to filter the most relevant items to pass on to the LLM.\n", + "9. Use OpenAI GPT3.5 model to generate responses to the user questions based on the filtered data.\n" + ] + }, + { + "cell_type": "markdown", + "id": "632ebef7-03ce-4f98-a512-eb290818ca4a", + "metadata": {}, + "source": [ + "# Create an Azure Cosmos DB for NoSQL resource\n", + "\n", + "Let's start by creating an Azure Cosmos DB for NoSQL Resource (Cosmos DB Account) by following [this section in the Quickstart guide](https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/quickstart-portal#create-account)\n", + "\n", + "\n", + "While creating the account, it is recommended that you select the **\"Serverless\" Capacity Mode** for this tutorial.\n", + "\n", + "\n", + "## Get Cosmos DB Account Key and Endpoint\n", + "Once the account is provisioned, head over to the provisioned account and navigate to **\"Settings > Keys\"** section in the left-side panel. From the Keys section, make a note of the **Primary Key and the URI** - these will be used later to connect to the cosmos DB account through the python client.\n", + "Store the Primary Key and URI in a .env file" + ] + }, + { + "cell_type": "markdown", + "id": "c934e505-7165-4f56-a9f3-e638206d9f5c", + "metadata": {}, + "source": [ + "# Provision Azure Open AI resource\n", + "Finally, let's setup our Azure OpenAI resource Currently, access to this service is granted only by application. You can apply for access to Azure OpenAI by completing the form at [https://aka.ms/oai/access](https://aka.ms/oai/access)\n", + "\n", + "Once you have access, complete the following step:\n", + "1. Create an Azure OpenAI resource [following this quickstart](https://learn.microsoft.com/azure/ai-services/openai/how-to/create-resource?pivots=-eb-portal)\n", + "2. Deploy an embeddings model. For more information on embeddings, refer to [this article](https://learn.microsoft.com/azure/ai-services/openai/how-to/embeddings)\n", + "3. Deploy a completions model. For more information on completions, refer to [this article](https://learn.microsoft.com/azure/ai-services/openai/how-to/completions)\n", + "4. Make a note of the endpoint and key for your Azure OpenAI resource\n", + "5. Make a note of the **deployment names** of the embedding and completion models.\n", + "\n", + "Store the Endpoint, Key, and deployment names in the .env file\n" + ] + }, + { + "cell_type": "markdown", + "id": "044416e3-170c-480a-af28-28cd9cb979d4", + "metadata": {}, + "source": [ + "# Install the required libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0a3cb0ff-64ca-46c4-931c-07e2310083a3", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "! pip install numpy\n", + "! pip install openai\n", + "! pip install python-dotenv\n", + "! pip install azure-core\n", + "! pip install azure-cosmos" + ] + }, + { + "cell_type": "markdown", + "id": "7f13616f-6ee9-47db-86d3-675d35017485", + "metadata": {}, + "source": [ + "# Necessary imports" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "f63a30f2-bca4-4093-9d03-26e8bf176568", + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "import datetime\n", + "import time\n", + "import urllib \n", + "\n", + "from azure.core.exceptions import AzureError\n", + "from azure.core.credentials import AzureKeyCredential\n", + "\n", + "#Cosmos DB imports\n", + "from azure.cosmos import CosmosClient\n", + "from azure.cosmos.aio import CosmosClient as CosmosAsyncClient\n", + "from azure.cosmos import PartitionKey, exceptions\n", + "\n", + "from openai import AzureOpenAI\n", + "from dotenv import load_dotenv" + ] + }, + { + "cell_type": "markdown", + "id": "7b264a9c-c4b0-494a-9408-5e2e5d47ce6c", + "metadata": {}, + "source": [ + "# Load Keys, Endpoints, and other variables from the .env file" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "247252e7-905a-4eb7-a3ee-1296e7233cad", + "metadata": {}, + "outputs": [], + "source": [ + "from dotenv import dotenv_values\n", + "\n", + "# specify the name of the .env file name \n", + "env_name = \"sample_env_file.env\" # following sample_env.env template change to your own .env file name\n", + "config = dotenv_values(env_name)\n", + "\n", + "OPENAI_API_KEY = config['openai_key']\n", + "OPENAI_API_ENDPOINT = config['openai_endpoint']\n", + "OPENAI_API_VERSION = config['openai_api_version'] # at the time of authoring, the api version is 2024-02-01\n", + "COMPLETIONS_MODEL_DEPLOYMENT_NAME = config['openai_completions_deployment']\n", + "EMBEDDING_MODEL_DEPLOYMENT_NAME = config['openai_embeddings_model']\n", + "COSMOSDB_NOSQL_ACCOUNT_KEY = config['cosmos_key']\n", + "COSMOSDB_NOSQL_ACCOUNT_ENDPOINT = config['cosmos_uri']" + ] + }, + { + "cell_type": "markdown", + "id": "7d72da5a-97f3-4ca7-a108-03c589cb7316", + "metadata": {}, + "source": [ + "# Instantiate the Azure Open AI client" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2d9d8f16-97d8-483c-9683-dfbf08494696", + "metadata": {}, + "outputs": [], + "source": [ + "AOAI_client = AzureOpenAI(api_key=OPENAI_API_KEY, azure_endpoint=OPENAI_API_ENDPOINT, api_version=OPENAI_API_VERSION,)" + ] + }, + { + "cell_type": "markdown", + "id": "e35a3ecb-88a1-4960-8c1b-7a5aac58fcba", + "metadata": {}, + "source": [ + "# Generating Embedding\n", + "We'll use the deployed embeddings model to generate the embeddings" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "aed68e4a-b7dc-43a3-8b73-ff7af6b774a9", + "metadata": {}, + "outputs": [], + "source": [ + "def generate_embeddings(text):\n", + " '''\n", + " Generate embeddings from string of text.\n", + " This will be used to vectorize data and user input for interactions with Azure OpenAI.\n", + " '''\n", + " response = AOAI_client.embeddings.create(input=text, model=EMBEDDING_MODEL_DEPLOYMENT_NAME)\n", + " embeddings =response.model_dump()\n", + " time.sleep(0.5) \n", + " return embeddings['data'][0]['embedding']" + ] + }, + { + "cell_type": "markdown", + "id": "f3a01080-1bec-489a-b3cb-e8d98cdea7a8", + "metadata": {}, + "source": [ + "# Load the data with embeddings or generate embeddings\n", + "We have a sample data file with embeddings but you can generate the embeddings afresh before uploading the data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a5eb3dff-e7f4-41ad-9bb5-e2667be9fc40", + "metadata": {}, + "outputs": [], + "source": [ + "# Load text-sample_w_embeddings.json which has embeddings pre-computed\n", + "data_file = open(file=\"../../DataSet/AzureServices/text-sample_w_embeddings.json\", mode=\"r\") \n", + "\n", + "# OR Load text-sample.json data file. Embeddings will need to be generated using the function below.\n", + "# data_file = open(file=\"../../DataSet/AzureServices/text-sample.json\", mode=\"r\")\n", + "\n", + "data = json.load(data_file)\n", + "data_file.close()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2c64529f-b09e-4d26-a816-27fe7b0b08b9", + "metadata": {}, + "outputs": [], + "source": [ + "# Take a peek at one data item\n", + "print(json.dumps(data[0], indent=2))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "69070d49-04ea-45d2-8269-ac2597200255", + "metadata": {}, + "outputs": [], + "source": [ + "# Generate embeddings for title and content fields\n", + "n = 0\n", + "for item in data:\n", + " n+=1\n", + " item['id'] = str(n)\n", + " title = item['title']\n", + " content = item['content']\n", + " title_embeddings = generate_embeddings(title)\n", + " content_embeddings = generate_embeddings(content)\n", + " item['titleVector'] = title_embeddings\n", + " item['contentVector'] = content_embeddings\n", + " item['@search.action'] = 'upload'\n", + " print(\"Creating embeddings for item:\", n, \"/\" ,len(data), end='\\r')\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1d3db5b1-5d3a-413e-9914-4f8eb486ae38", + "metadata": {}, + "outputs": [], + "source": [ + "# Save embeddings to sample_text_w_embeddings.json file\n", + "with open(\"../../DataSet/AzureServices/text-sample_w_embeddings.json\", \"w\") as f:\n", + " json.dump(data, f)" + ] + }, + { + "cell_type": "markdown", + "id": "708d72eb-522e-489c-8230-ad40aab7def9", + "metadata": {}, + "source": [ + "# Connect and setup Cosmos DB for NoSQL\n", + "Now that we have the data with embeddings ready, we need to upload this data to Azure Cosmos DB container with vector search capability. For this, we need to create a new container (as vector search is currently supported in new containers only) with vector embedding and indexing policy.\n", + "\n", + "## Set up the connection" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a7eadab3-b1ee-4246-b16a-7d0406afb3c8", + "metadata": {}, + "outputs": [], + "source": [ + "cosmos_client = CosmosClient(url=COSMOSDB_NOSQL_ACCOUNT_ENDPOINT, credential=COSMOSDB_NOSQL_ACCOUNT_KEY)" + ] + }, + { + "cell_type": "markdown", + "id": "e5c8a087-b9f1-4585-8800-dffae64ec07d", + "metadata": {}, + "source": [ + "## Create a new database or use existing one" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1da236ab-35c4-4ed1-8c88-01149290bcd5", + "metadata": {}, + "outputs": [], + "source": [ + "#create database\n", + "DATABASE_NAME = \"vector-nosql-db\"\n", + "db= cosmos_client.create_database_if_not_exists(\n", + " id=DATABASE_NAME\n", + ")\n", + "properties = db.read()\n", + "print(json.dumps(properties))" + ] + }, + { + "cell_type": "markdown", + "id": "7d3b706f-4f6e-4920-8e7e-a8194645cc78", + "metadata": {}, + "source": [ + "## Author the vector embedding policy\n", + "Vector embedding policy defines the necessary information for the vector search queries as detailed below: \n", + "* “path”: what properties contain vectors \n", + "* “datatype”: What type are the vector’s elements (default Float32) \n", + "* “dimensions”: The length of each vector in the path (default 1536) \n", + "* “distanceFunction”: The metric used to compute distance/similarity (default Cosine)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8773dd2f-e773-451a-84e6-8fffedc1b709", + "metadata": {}, + "outputs": [], + "source": [ + "vector_embedding_policy = {\n", + " \"vectorEmbeddings\": [\n", + " {\n", + " \"path\":\"/titleVector\",\n", + " \"dataType\":\"float32\",\n", + " \"distanceFunction\":\"dotproduct\",\n", + " \"dimensions\":1536\n", + " },\n", + " {\n", + " \"path\":\"/contentVector\",\n", + " \"dataType\":\"float32\",\n", + " \"distanceFunction\":\"cosine\",\n", + " \"dimensions\":1536\n", + " }\n", + " ]\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "676249b5-5f67-46d4-a583-bfcb603b993f", + "metadata": {}, + "source": [ + "## Add vector indexes to indexing policy" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a5ba1f4d-2881-4d3a-8670-aeb9656b8651", + "metadata": {}, + "outputs": [], + "source": [ + "indexing_policy = {\n", + " \"includedPaths\": [\n", + " {\n", + " \"path\": \"/*\"\n", + " }\n", + " ],\n", + " \"excludedPaths\": [\n", + " {\n", + " \"path\": \"/\\\"_etag\\\"/?\"\n", + " }\n", + " {\n", + " \"path\": \"/titleVector/*\"\n", + " }\n", + " {\n", + " \"path\": \"/contentVector/*\"\n", + " }\n", + " ],\n", + " \"vectorIndexes\": [\n", + " {\"path\": \"/titleVector\",\n", + " \"type\": \"quantizedFlat\"\n", + " },\n", + " {\"path\": \"/contentVector\",\n", + " \"type\": \"quantizedFlat\"\n", + " }\n", + " ]\n", + "}\n" + ] + }, + { + "cell_type": "markdown", + "id": "08a30e78-3016-4644-8708-84ec2cb9ad0b", + "metadata": {}, + "source": [ + "## Create container with the embedding and indexing policy" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ad03e59a-570a-45d4-9d19-1331c8b478ea", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "CONTAINER_NAME = \"vector-nosql-cont\"\n", + "try: \n", + " container = db.create_container_if_not_exists(\n", + " id=CONTAINER_NAME,\n", + " partition_key=PartitionKey(path='/id', kind='Hash'),\n", + " indexing_policy=indexing_policy,\n", + " vector_embedding_policy=vector_embedding_policy)\n", + "\n", + " print('Container with id \\'{0}\\' created'.format(id))\n", + "\n", + "except exceptions.CosmosResourceExistsError:\n", + " print('A container with id \\'{0}\\' already exists'.format(id))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cba1e8e7-b97f-40e4-9617-650cc6d29e15", + "metadata": {}, + "outputs": [], + "source": [ + "container = db.get_container_client(CONTAINER_NAME)" + ] + }, + { + "cell_type": "markdown", + "id": "c56e8954-0bd7-4d79-ab59-cd10d0e3b544", + "metadata": {}, + "source": [ + "## Upload data to the container\n", + "Azure Cosmos DB Python SDK does not currently support bulk inserts so we'll have to insert the items sequentially" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "94994535-4d75-41c3-8d04-e6019998feed", + "metadata": {}, + "outputs": [], + "source": [ + "with open('../../DataSet/AzureServices/text-sample_w_embeddings.json') as f:\n", + " data = json.load(f)\n", + "\n", + "container_client = db.get_container_client(CONTAINER_NAME)\n", + "\n", + "for item in data:\n", + " print(\"writing item \", item['id'])\n", + " container_client.upsert_item(item)" + ] + }, + { + "cell_type": "markdown", + "id": "b716aea3-19e7-4023-b21a-57ed4e9ab084", + "metadata": {}, + "source": [ + "## Vector search in Azure Cosmos DB for NoSQL\n", + "Let's write a function that will take in user's query, generate embeddings for the query text and then use the embedding to run a vector search to find the similar items. The most similar items must be used as additional knowledgebase for the completions model to answer the user's query" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a4735a6c-8eb5-4f35-bb2d-e2a69c2564d0", + "metadata": {}, + "outputs": [], + "source": [ + "# Simple function to assist with vector search\n", + "def vector_search(query, num_results=5):\n", + " query_embedding = generate_embeddings(query)\n", + " results = container.query_items(\n", + " query='SELECT TOP @num_results c.content, c.title, c.category, VectorDistance(c.contentVector,@embedding) AS SimilarityScore FROM c ORDER BY VectorDistance(c.contentVector,@embedding)',\n", + " parameters=[\n", + " {\"name\": \"@embedding\", \"value\": query_embedding}, \n", + " {\"name\": \"@num_results\", \"value\": num_results} \n", + " ],\n", + " enable_cross_partition_query=True)\n", + " #correct this\n", + " return results" + ] + }, + { + "cell_type": "markdown", + "id": "899f62a4-5c73-4362-9b0b-092534e0401a", + "metadata": {}, + "source": [ + "Let's run a test below" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6e5edd1d-c7c8-44d2-a357-939237500bbb", + "metadata": {}, + "outputs": [], + "source": [ + "query = \"What are some NoSQL databases in Azure?\"#\"What are the services for running ML models?\"\n", + "results = vector_search(query)\n", + "for result in results: \n", + "# print(result)\n", + " print(f\"Similarity Score: {result['similarityScore']}\") \n", + " print(f\"Title: {result['title']}\") \n", + " print(f\"Content: {result['content']}\") \n", + " print(f\"Category: {result['category']}\\n\") " + ] + }, + { + "cell_type": "markdown", + "id": "53d9941e-470e-4685-bea1-4ed690e6d652", + "metadata": {}, + "source": [ + "# Q&A over the data with GPT-3.5\n", + "Finally, we'll create a helper function to feed prompts into the Completions model. Then we'll create interactive loop where you can pose questions to the model and receive information grounded in your data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d24c8f2f-d537-4763-b161-734af946c7a9", + "metadata": {}, + "outputs": [], + "source": [ + "#This function helps to ground the model with prompts and system instructions.\n", + "\n", + "def generate_completion(vector_search_results, user_prompt):\n", + " system_prompt = '''\n", + " You are an intelligent assistant for Microsoft Azure services.\n", + " You are designed to provide helpful answers to user questions about Azure services given the information about to be provided.\n", + " - Only answer questions related to the information provided below, provide at least 3 clear suggestions in a list format.\n", + " - Write two lines of whitespace between each answer in the list.\n", + " - If you're unsure of an answer, you can say \"\"I don't know\"\" or \"\"I'm not sure\"\" and recommend users search themselves.\"\n", + " - Only provide answers that have products that are part of Microsoft Azure and part of these following prompts.\n", + " '''\n", + "\n", + " messages=[{\"role\": \"system\", \"content\": system_prompt}]\n", + " for item in vector_search_results:\n", + " messages.append({\"role\": \"system\", \"content\": item['content']})\n", + " messages.append({\"role\": \"user\", \"content\": user_prompt})\n", + " response = AOAI_client.chat.completions.create(model=COMPLETIONS_MODEL_DEPLOYMENT_NAME, messages=messages,temperature=0)\n", + " \n", + " return response.dict()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c406f424-4726-49c5-8cd7-c7174f861474", + "metadata": {}, + "outputs": [], + "source": [ + "# Create a loop of user input and model output. You can now perform Q&A over the sample data!\n", + "\n", + "user_input = \"\"\n", + "print(\"*** Please ask your model questions about Azure services. Type 'end' to end the session.\\n\")\n", + "user_input = input(\"User prompt: \")\n", + "while user_input.lower() != \"end\":\n", + " search_results = dummy_vector_search()\n", + " completions_results = generate_dummy_completion(search_results, user_input)\n", + " print(\"\\n\")\n", + " print(completions_results['choices'][0]['message']['content'])\n", + " user_input = input(\"User prompt: \")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4d6909f8-0cfa-4844-9765-f669721c7691", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}