DEV Community

Yoshio Terada for Microsoft Azure

Posted on

How to use the Azure OpenAI Embedding model to find the most relevant documents

1. Introduction

Initially, I had never worked with machine learning concepts like "Embedding", "Integrated", or "Embedded" before, so I was quite puzzled when I came across these terms. I didn't understand what could be done with this model. Therefore, I decided to research and actually use the model to grasp its usefulness. In this entry, I'd like to share my findings with you.

2. High-Accuracy and Affordable Models Have Arrived!

OpenAI has a model called text-embedding-ada-002 for text embedding purposes. By using this model, you can find the most relevant documents at a much lower cost.

There are other embedding models available, such as the Davinci model and a few others. However, text-embedding-ada-002 offers higher accuracy in most processes and is considerably more affordable compared to the others. Therefore, it is highly recommended to use this model for tasks like similarity search.

According to the Azure OpenAI Service Pricing, as of June 8, 2023, there were the following differences in pricing:

Embedding Models Per 1,000 Tokens
Ada $0.0004
Babbage $0.005
Curie $0.02
Davinci $0.20

3. What is Embedding?

Before using embedding, it is essential to understand its basic concept; otherwise, we won't know how to use it effectively. In short, the idea of embedding is based on the concept of vectors, which we learned in high school.

During high school, we learned that two arrows are considered the same vector if they point in the same direction and have the same length. By applying this concept, we can find vectors with the closest possible length and direction, which implies that they are the most similar. In the illustration below, the blue line represents an identical vector. If we can find the vector "1" that is closest to this blue line, it would indicate the content most similar to it.

Vector Image

In practice, when you pass a natural language string to the text-embedding-ada-002 model, you will obtain a high-dimensional array (vector) consisting of 1536 floating-point numbers, as shown below.

Example: An excerpt of the result for the string "Please tell me about Azure Blob"

[-0.013197514, -0.025243968, 0.011384923, -0.015929632, -0.006410221, 0.031038966, 
-0.016921926, -0.010776317, -0.0019300125, -0.016300088, 0.01767607, -0.0047100903,
0.009691408, -0.014183193, -0.017001309, -0.014434575, 0.01902559, 0.010961545, 
0.013561356, -0.017371766, -0.007964816, 0.0026841562, 0.0019663966, -0.0019878964, 
-0.025614424, -0.0030298054, 0.020229574, -0.01455365, 0.022703694, -0.02033542, 
0.035696134, -0.002441044, -0.008057429, 0.0061191483, 0.004263558, -0.0025518502,
0.018046526, 0.011411385, 0.0063804523, -0.0021020102, 0.027572552, -0.017967142,
0.0077663567, 0.005361697, -0.0116693815, 0.004524862, -0.043581568, -0.01028017, 

.... Omitted

-0.0017265921, 0.083035186, -0.006205147, -0.008646191, 0.0070651355, -0.019052051, 
0.008374964, 0.024225213, 0.01522841, 0.019951731, -0.006516066, 0.017967142, 
0.0058082296, -0.0053253127, -0.009929558, -0.039109625, -0.031277116, -0.015863478, 
0.011040928, 0.012529369, 0.013012286, 0.022981536, -0.013706892, 0.012965979, 
0.011953839, -0.01903882, 0.015347485, 0.019052051, -0.0046538603, 0.012191989, 
-0.020983716, 0.0078722015, -0.0018605519, -0.02775778, -0.026739024, -0.010359553, 
-0.013918581, -0.011933993, 0.0066814483, 0.005196315, -0.0045744767, -2.7598185E-4, 
0.012251527, -0.018178832, -0.013276898, 0.011709073, -0.022928614, 0.002131779, 
-0.007462053, 0.0044554016]
Enter fullscreen mode Exit fullscreen mode

The method using embeddings involves comparing the vector of the user's input string with pre-saved vectors in a database and searching for the closest ones.

There are several ways to find the closest match, but one of the most commonly known methods is calculating the cosine similarity. When you pass two vectors to calculate cosine similarity, you'll get a result ranging from -1 to 1. This calculation is performed alongside the data stored in the database.

Ultimately, the content with the result closest to 1 is considered the most similar.

Example of a user's input text:
AAA BBB CCC DDD EEE FFF GGG HHH III JJJ

Calculation results of cosine similarity with stored strings in the database:
AAA BBB CCC DDD EEE FFF GGG HHH III JJ  =       0.9942949478994986  <---- Closest to 1 (highest similarity)
AAA BBB KKK LLL MMM NNN OOO PPP QQQ RRR =       0.930036739659776   <---- Next closest to 1
Today, we are presenting at an IT event.=           0.7775105340227892  <---- Furthest from 1
Enter fullscreen mode Exit fullscreen mode

In this way, the user's input query string is vectorized, and by combining it with the pre-saved vectors (arrays) in the database, a similarity search is performed.

4. Points to consider when handling strings

Token limit

The maximum number of tokens (roughly equivalent to the character count) that can be handled by text-embedding-ada-002 is 8192 tokens. Therefore, when dealing with texts exceeding approximately 8000 characters, splitting is necessary.

Preparing strings for processing

As mentioned in the "Replace newlines with a single space" section, it has been confirmed that "the presence of newline characters may result in unexpected outcomes."

Hence, it is recommended to replace newline characters (\n) with a space before sending the message to text-embedding-ada-002.

For example, if you have a text like the one below, replace all newline characters and convert it into a single line string (please refer to the sample code provided).

Visual Studio Code for Java is an open-source code editor provided by Microsoft.

It is an extension on Visual Studio Code that supports the Java programming language.

It provides an environment for Java developers to efficiently write, build, debug, test, and execute code. Visual Studio Code is a lightweight text editor that supports multiple languages and is known for its high extensibility. For Java developers, Visual Studio Code for Java can be used as an excellent development environment.

Visual Studio Code for Java can be used by installing the Java Development Kit (JDK), a basic toolset for compiling and running Java programs.

To start a Java project in Visual Studio Code, first install the JDK, and then install the Java Extension Pack from the Visual Studio Code extension marketplace.

The Java Extension Pack includes features such as Java language support, debugging, testing, and project management.

The main features of Visual Studio Code for Java are as follows:

1. Syntax highlighting: Color-coding for Java code to improve readability.
2. Code completion: Suggests possible code during input, allowing for efficient code writing.
3. Code navigation: Easy jumping to classes, methods, and variables, and searching for definitions.
4. Refactoring: Features to change code structure and names, improving code quality.
5. Debugging: Setting breakpoints, step execution, and variable monitoring.
6. Testing: Supports testing frameworks like JUnit and TestNG, creating, running, and displaying test results.
7. Project management: Supports build tools like Maven and Gradle, managing project configurations and dependencies.
8. Git integration: Integrated with Git for source code version control.

Visual Studio Code for Java offers a wealth of features to improve developer productivity and provides an environment suitable for Java project development. In addition, the Visual Studio Code extension marketplace offers various Java-related extensions that can be added as needed. With these extensions, Java developers can use Visual Studio Code as an integrated development environment.
Enter fullscreen mode Exit fullscreen mode

4. Verification of Operation

4.1 Vector DB options available in Azure

As of June 8, 2023, there are several options for storing vectors in Azure. Please choose the appropriate database according to your needs.

  1. How to enable and use pgvector on Azure Database for PostgreSQL - Flexible Server
  2. Using vector search on embeddings in Azure Cosmos DB for MongoDB vCore
  3. Azure Cognitive Search (Private Preview)
  4. Azure Cache for Redis Enterprise

4.2 Vector Search with Azure PostgreSQL Flexible Server

As mentioned earlier, there are various options for Vector DB, and you can choose the one that suits your needs. However, for verification purposes, we have decided to use PostgreSQL Flexible Server in this case. The steps to set up PostgreSQL Flexible Server to handle vectors are described below. Please give it a try if you're interested. If you choose a different persistence destination, please skip this section.

4.2.1 Setting environment variables

To create resources on Azure, please modify and set the following environment variables accordingly.

export RESOURCE_GROUP=PostgreSQL
export LOCATION=eastus
export POSTGRES_SERVER_NAME=yoshiopgsql3
export POSTGRES_USER_NAME=yoterada
export POSTGRES_USER_PASS='!'$(head -c 12 /dev/urandom | base64 | tr -dc '[:alpha:]'| fold -w 8 | head -n 1)$RANDOM
echo "GENERATED PASSWORD: " $POSTGRES_USER_PASS
export POSTGRES_DB_NAME=VECTOR_DB
export SUBSCRIPTION_ID=********-****-****-****-************
export PUBLIC_IP=$(curl ifconfig.io -4)
Enter fullscreen mode Exit fullscreen mode

In the above configuration example, the password is automatically generated and output to the standard output. Please enter your own password or make a note of the generated password.

4.2.2 Installing Azure PostgreSQL Flexible Server

Please execute the following three commands. When you run these commands, the following tasks will be performed:

  1. Install Azure PostgreSQL Flexible Server
  2. Configure the firewall
  3. Create a new database
az postgres flexible-server create --name $POSTGRES_SERVER_NAME \
    -g $RESOURCE_GROUP \
    --location $LOCATION \
    --admin-user $POSTGRES_USER_NAME \
    --admin-password $POSTGRES_USER_PASS \
    --public-access $PUBLIC_IP
    --yes
az postgres flexible-server firewall-rule create \
    -g $RESOURCE_GROUP \
    -n $POSTGRES_SERVER_NAME \
    -r AllowAllAzureIPs \
    --start-ip-address 0.0.0.0 \
    --end-ip-address 255.255.255.255
az postgres flexible-server db create \
    -g $RESOURCE_GROUP \
    -s $POSTGRES_SERVER_NAME \
    -d $POSTGRES_DB_NAME
Enter fullscreen mode Exit fullscreen mode

4.2.3 Configuring Azure PostgreSQL Flexible Server for multi language support

Since the data to be persisted in this case includes Japanese strings, please perform the following settings to enable handling of Japanese UTF-8 within the database.

az postgres flexible-server parameter set \
    -g $RESOURCE_GROUP \
    --server-name $POSTGRES_SERVER_NAME \
    --subscription $SUBSCRIPTION_ID \
    --name lc_monetary --value "ja_JP.utf-8"
Enter fullscreen mode Exit fullscreen mode
az postgres flexible-server parameter set \
    -g $RESOURCE_GROUP \
    --server-name $POSTGRES_SERVER_NAME \
    --subscription $SUBSCRIPTION_ID \
    --name lc_numeric --value "ja_JP.utf-8"
Enter fullscreen mode Exit fullscreen mode
az postgres flexible-server parameter set \
    -g $RESOURCE_GROUP \
    --server-name $POSTGRES_SERVER_NAME \
    --subscription $SUBSCRIPTION_ID \
    --name timezone --value "Asia/Tokyo"
Enter fullscreen mode Exit fullscreen mode

4.2.4 Installing extensions on Azure PostgreSQL Flexible Server

To enable handling of UUID and Vector in PostgreSQL, we will make use of extensions. Please execute the following command.

Note:

Do not leave any space between "VECTOR,UUID-OSSP".

az postgres flexible-server parameter set \
    -g $RESOURCE_GROUP \
    --server-name $POSTGRES_SERVER_NAME \
    --subscription $SUBSCRIPTION_ID \
    --name azure.extensions --value "VECTOR,UUID-OSSP"
Enter fullscreen mode Exit fullscreen mode

4.3 Creating a table to handle Vector in PostgreSQL

Now that the PostgreSQL configuration is complete, execute the following command to connect.

> psql -U $POSTGRES_USER_NAME -d $POSTGRES_DB_NAME \
      -h $POSTGRES_SERVER_NAME.postgres.database.azure.com 
Enter fullscreen mode Exit fullscreen mode

Once connected successfully, enable the use of the extensions added earlier within PostgreSQL. Please execute the CREATE EXTENSION command for each, as shown below.

SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.

VECTOR_DB=>
VECTOR_DB=> CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE EXTENSION
VECTOR_DB=> CREATE EXTENSION IF NOT EXISTS "vector";
CREATE EXTENSION
Enter fullscreen mode Exit fullscreen mode

Finally, create a table to store the Vector data. The Vector information will be stored in the embedding VECTOR(1536) part. For simplicity, we will also save the original text together, and display the original string of the most similar string. In practice, you may want to use a URL as the link destination, or you can join it with another table later if desired.

VECTOR_DB=> CREATE TABLE TBL_VECTOR_TEST(
    id uuid,
    embedding VECTOR(1536),
    origntext varchar(8192),
    PRIMARY KEY (id)
    );
CREATE TABLE
Enter fullscreen mode Exit fullscreen mode

4.4 Creating a Java application

4.4.1 Adding dependencies to the Maven project

To use the Azure OpenAI library, connect to PostgreSQL, and perform data persistence, at least the following dependencies need to be added. Please add them to your pom.xml.

        <dependency>
            <groupId>com.azure</groupId>
            <artifactId>azure-ai-openai</artifactId>
            <version>1.0.0-beta.1</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.postgresql/postgresql -->
        <dependency>
            <groupId>org.postgresql</groupId>
            <artifactId>postgresql</artifactId>
            <version>42.6.0</version>
        </dependency>
Enter fullscreen mode Exit fullscreen mode

4.4.2 Creating and setting the properties file

Please create an application.properties file in the src/main/resources directory and write the property settings information in it.

azure.openai.url=https://YOUR_OWN_AZURE_OPENAI.openai.azure.com
azure.openai.model.name=gpt-4
azure.openai.api.key=************************************

azure.postgresql.jdbcurl=jdbc:postgresql://YOUR_POSTGRESQL.postgres.database.azure.com:5432/VECTOR_DB
azure.postgresql.user=yoterada
azure.postgresql.password=************************************

logging.group.mycustomgroup=com.yoshio3
logging.level.mycustomgroup=DEBUG
logging.level.root=INFO
Enter fullscreen mode Exit fullscreen mode

4.4.3 Implementing the Java program

Finally, implement the Java code.

package com.yoshio3;

import java.io.IOException;
import java.util.Arrays;
import java.util.List;
import java.util.Properties;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.UUID;
import java.util.concurrent.TimeUnit;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.azure.ai.openai.OpenAIClient;
import com.azure.ai.openai.OpenAIClientBuilder;
import com.azure.ai.openai.models.EmbeddingsOptions;
import com.azure.core.credential.AzureKeyCredential;

public class VectorTest {
    private final static Logger LOGGER = LoggerFactory.getLogger(VectorTest.class);

    // Azure OpenAI API KEY
    private String OPENAI_API_KEY = "";
    // Azure OpenAI Instance URL
    private String OPENAI_URL = "";

    private String POSTGRESQL_JDBC_URL = "";
    private String POSTGRESQL_USER = "";
    private String POSTGRESQL_PASSWORD = "";

    private final static List<String> INPUT_DATA = Arrays.asList(
            "Visual Studio Code for Java is an extension for the open-source code editor Visual Studio Code provided by Microsoft, which supports the Java programming language. It offers an environment for Java developers to efficiently write, build, debug, test, and execute code. Visual Studio Code is a lightweight text editor with multi-language support and high extensibility. For Java developers, Visual Studio Code for Java can be an excellent development environment. Visual Studio Code for Java can be used by installing the Java Development Kit (JDK), which is a basic toolset for compiling and running Java programs. To start a Java project in Visual Studio Code, install the JDK and then install the Java Extension Pack from the Visual Studio Code extension marketplace. The Java Extension Pack includes features such as Java language support, debugging, testing, and project management. The main features of Visual Studio Code for Java are as follows: 1. Syntax highlighting: Color-coding Java code to improve readability. 2. Code completion: Suggesting possible code while entering it, enabling efficient code writing.3. Code navigation: Easy jumping to classes, methods, variables, and finding definitions. 4. Refactoring: Changing the structure and names of code to improve its quality. 5. Debugging: Setting breakpoints, stepping through code, and monitoring variables. 6. Testing: Supporting testing frameworks such as JUnit and TestNG, allowing test creation, execution, and result display. 7. Project management: Supporting build tools such as Maven and Gradle, enabling project configuration and dependency management. 8. Git integration: Integrating with Git for source code version control. Visual Studio Code for Java offers a wealth of productivity-enhancing features and provides an environment suitable for Java project development. Additionally, the Visual Studio Code extension marketplace has various Java-related extensions, which can be added as needed. With these extensions, Java developers can use Visual Studio Code as an integrated development environment.",
            "Azure App Service for Java is a fully managed platform on Microsoft's cloud platform Azure, designed for hosting, deploying, and managing Java applications. Azure App Service supports the development and execution of web applications, mobile apps, APIs, and other backend applications, allowing Java developers to quickly deploy and scale their applications. By abstracting infrastructure management, developers can focus on their application code. Azure App Service for Java includes the Java Development Kit (JDK) and a web server, supporting Java runtimes such as Tomcat, Jetty, and JBoss EAP. Furthermore, Azure App Service provides features to support the entire lifecycle of Java applications, such as building CI/CD (Continuous Integration/Continuous Delivery) pipelines, setting up custom domains, managing SSL certificates, and monitoring and diagnosing applications.",
            "Azure Container Apps is a fully managed service on Microsoft's cloud platform Azure for deploying, managing, and scaling container-based applications. Azure Container Apps is suitable for applications that implement microservices architecture, web applications, backend services, and job execution. This service abstracts container orchestration platforms like Kubernetes, freeing developers from infrastructure management and scaling, allowing them to focus on application code. Azure Container Apps deploys applications using Docker container images and provides features such as automatic scaling, rolling updates, and auto-recovery. Moreover, Azure Container Apps is platform-independent, supporting any programming language or framework. Developers can easily deploy applications using the Azure portal, Azure CLI, or CI/CD pipelines such as GitHub Actions.In terms of security, Azure Container Apps offers features like network isolation, private endpoints, and Azure Active Directory (AAD) integration, ensuring the safety of your applications. Additionally, it provides features to support application monitoring and diagnostics, enabling the identification of performance and issues using tools like Azure Monitor and Azure Application Insights.",
            "Azure Cosmos DB is a global, distributed multi-model database service from Microsoft, offering low latency, high availability, and high throughput. Cosmos DB is a NoSQL database that supports multiple data models, including key-value, document, column-family, and graph. This fully managed service caters to various applications and can be used for the development of web, mobile, IoT, and gaming solutions. The key features of Azure Cosmos DB include: Global distribution: Automatically replicates data across multiple geographic regions, providing high availability and low latency. Horizontal scaling: Utilizes partition keys to split data across multiple physical partitions, enabling flexible scaling of throughput and storage capacity. Five consistency models: Choose from five consistency models, ranging from strong to eventual consistency, depending on the consistency requirements of your globally distributed applications. Real-time analytics: Integrates with Azure Synapse Analytics and Azure Functions for real-time data processing and analysis. Additionally, Azure Cosmos DB offers multiple APIs, such as SQL API, MongoDB API, Cassandra API, Gremlin API, and Table API, enabling developers to build applications using familiar APIs. In terms of data security, Cosmos DB provides features like encryption, network isolation, and access control to ensure data protection. Furthermore, Azure Monitor and Azure Application Insights can be utilized to identify database performance and issues.",
            "Azure Kubernetes Service (AKS) is a Kubernetes cluster management service provided by Microsoft that simplifies the deployment, scaling, and operation of containerized applications. As a managed Kubernetes service, AKS automates tedious infrastructure management and update tasks, allowing developers to focus on application development. AKS offers enterprise-grade security, monitoring, and operational management features, and it easily integrates with DevOps pipelines. Additionally, it can work with other Azure services to support flexible application development. The main features and benefits of AKS are as follows: Cluster provisioning and scaling: AKS automates cluster infrastructure management, allowing you to add or remove nodes as needed, ensuring efficient resource usage and operations. Security and access control: AKS provides built-in Azure Active Directory (AD) integration and enables secure cluster access management using role-based access control (RBAC). Integration with CI/CD pipelines: AKS integrates with CI/CD tools such as Azure DevOps and Jenkins, automating application build, test, and deployment processes. Monitoring and logging: AKS integrates with monitoring tools like Azure Monitor, Prometheus, and Grafana, allowing you to monitor cluster performance and resource usage. Logs can be centrally managed through Azure Log Analytics. Networking and storage: AKS uses Azure Virtual Networks (VNet) to run clusters within private networks. It also provides persistent data storage using Azure Storage and Azure Disks. Collaboration with other Azure services: AKS can work with other Azure services such as Azure Functions and Azure Cosmos DB, extending the functionality of your applications. These features enable AKS to reduce the burden of infrastructure management and operations for developers, allowing them to focus on application development. The provision of enterprise-grade security, monitoring, and operational management features ensures a reliable user experience.",
            "Azure Cognitive Services is a cloud-based service that integrates AI capabilities provided by Microsoft, allowing you to easily add artificial intelligence features to applications, websites, and bots. Even without deep learning or machine learning expertise, you can use these features through APIs. Azure Cognitive Services is divided into the following five categories: 1. Vision: Analyzes images and videos, providing features such as facial recognition, image recognition, and Optical Character Recognition (OCR). It includes Computer Vision, Custom Vision, Face API, Form Recognizer, and Video Indexer. 2. Speech: Provides speech-related features such as speech recognition, speech synthesis, and speech translation. It includes Speech Services, Speech Translation, and Speaker Recognition. 3. Language: Offers natural language processing (NLP) capabilities, allowing text analysis, machine translation, document summarization, and keyword extraction. It includes Text Analytics, Language Understanding (LUIS), QnA Maker, and Translation. 4. Decision: Provides features to support decision-making and recommendations, enabling personalized content and actions for individual users. It includes Personalization, Anomaly Detector, and Content Moderator. 5. Web Search: Utilizes Bing's search engine to provide web search, image search, video search, news search, and map search capabilities. It includes Bing Web Search API, Bing Image Search API, Bing Video Search API, Bing News Search API, and Bing Maps API. By combining these AI features, you can improve user experience and enhance the value of applications and services. In addition, Azure Cognitive Services is designed with privacy and security in mind, allowing businesses and developers to use it with confidence.",
            "Azure Container Instances (ACI) is a service provided by Microsoft that allows quick and easy deployment of containers. With ACI, you can run application containers without managing virtual machines or orchestration, reducing the workload on infrastructure management for developers. ACI offers per-second billing, with costs generated based on resource usage. The main features and benefits are as follows: 1. Simple and fast deployment: ACI can deploy containers quickly using Docker container images. Containers running on ACI are also compatible with Docker commands and Kubernetes clusters. 2. No need to manage the operating system: ACI eliminates the need to manage and update the host OS, allowing developers to focus on application development and operations. 3. Seamless scaling: ACI enables flexible scaling of the number of containers, allowing you to increase or decrease resources according to load. 4. Security: ACI uses Azure's security features to protect containers and data, and also offers network isolation. 5. Flexible resource allocation based on requirements: With ACI, you can allocate CPU and memory individually, optimizing resources according to application requirements. 6. Event-driven container execution: ACI can be integrated with services like Azure Functions and Logic Apps to enable event-driven container execution. These features make Azure Container Instances effective for various scenarios, such as short-term workloads, batch processing, and development and testing environments. Additionally, by combining with AKS, you can also achieve container management with orchestration capabilities.",
            "Azure Data Lake Storage (ADLS) is a large-scale data lake solution provided by Microsoft, designed to efficiently store, process, and analyze petabyte-scale data. ADLS is part of Azure Storage and centrally manages unstructured, semi-structured, and structured data, enabling advanced analytics such as big data analysis, machine learning, and real-time analysis. The main features and benefits of ADLS are as follows: 1. Scalability: ADLS provides high scalability that can store petabyte-scale data, flexibly adapting to the growth of data. 2. Performance: ADLS offers optimized performance for large-scale data reads and writes, making it suitable for big data processing and real-time analysis. 3. Security and Compliance: ADLS provides security features such as data encryption, access control, and audit logs, addressing corporate compliance requirements. 4. High Compatibility: ADLS is compatible with the Hadoop Distributed File System (HDFS) and can be integrated with existing Hadoop ecosystems as well as big data analytics tools like Apache Spark and Azure Databricks. 5. Hierarchical Storage: ADLS offers three storage tiers - hot, cool, and archive - providing optimal cost-performance based on data access frequency. 6. Integration of Data Lake and Object Storage: Azure Data Lake Storage Gen2 integrates ADLS and Azure Blob Storage, offering a solution that combines the benefits of large-scale data lakes and object storage. Azure Data Lake Storage is a powerful platform for enterprises to efficiently manage large amounts of data and achieve advanced data processing, such as big data analysis and machine learning. This enables businesses to fully leverage the value of their data, gaining business insights and improving decision-making.",
            "Azure Blob Storage is an object storage service provided by Microsoft, offering a cloud-based solution for storing and managing large amounts of unstructured data. It can securely and scalably store various types of data, such as text, binary data, images, videos, and log files. The main features and benefits are as follows: 1. Scalability: Azure Blob Storage provides high scalability that can store petabyte-scale data, flexibly adapting to the growth of data. 2. Performance: Azure Blob Storage offers high performance for data reads and writes, enabling rapid processing of large amounts of data. 3. Hierarchical Storage: Azure Blob Storage provides three storage tiers - hot, cool, and archive - offering optimal cost-performance based on data access frequency. 4. Security and Compliance: Azure Blob Storage offers security features such as data encryption, access control, and audit logs, addressing corporate compliance requirements. 5. Global Access: Azure Blob Storage leverages Microsoft's Azure data centers, allowing fast and secure data access from anywhere in the world. 6. Integration and Compatibility: Azure Blob Storage can be integrated with Azure Data Lake Storage Gen2, other Azure services, and on-premises systems, enabling centralized data management and analysis. Azure Blob Storage is effectively used in various scenarios, such as web applications, backups, archives, big data analytics, and IoT devices. This enables businesses to fully leverage the value of their data, gaining business insights and improving decision-making.");

    private OpenAIClient client;

    public VectorTest() throws IOException {
        Properties properties = new Properties();
        properties.load(this.getClass().getResourceAsStream("/application.properties"));
        OPENAI_API_KEY = properties.getProperty("azure.openai.api.key");
        OPENAI_URL = properties.getProperty("azure.openai.url");
        POSTGRESQL_JDBC_URL = properties.getProperty("azure.postgresql.jdbcurl");
        POSTGRESQL_USER = properties.getProperty("azure.postgresql.user");
        POSTGRESQL_PASSWORD = properties.getProperty("azure.postgresql.password");

        client = new OpenAIClientBuilder()
                .credential(new AzureKeyCredential(OPENAI_API_KEY))
                .endpoint(OPENAI_URL)
                .buildClient();
    }


    public static void main(String[] args) {
        VectorTest test;
        try {
            test = new VectorTest();
            // Execute the insertion into the database only once.
            // test.insertDataToPostgreSQL();

            // Retrieve similar documents using vector search for the data registered in the DB
            test.findMostSimilarString("Please tell me about Azure Blob Storage");
        } catch (IOException e) {
            LOGGER.error("Error : ", e);
        }
    }

    /**
     * Invoke Azure OpenAI (text-embedding-ada-002)
     */
    private List<Double> invokeTextEmbedding(String originalText) {
        EmbeddingsOptions embeddingsOptions = new EmbeddingsOptions(Arrays.asList(originalText));
        var result = client.getEmbeddings("text-embedding-ada-002", embeddingsOptions);
        var embedding = result.getData().stream().findFirst().get().getEmbedding();
        return embedding;
    }

    private void insertDataToPostgreSQL() {
        try (var connection = DriverManager.getConnection(POSTGRESQL_JDBC_URL, POSTGRESQL_USER, POSTGRESQL_PASSWORD)) {
            var insertSql = "INSERT INTO TBL_VECTOR_TEST (id, embedding, origntext) VALUES (?, ?::vector, ?)";

            for (String originText : INPUT_DATA) {
                // Call the text embedding and obtain the vector array
                List<Double> embedding = invokeTextEmbedding(originText);
                // Sleep for 10 seconds to prevent errors 
            // due to sending a large number of requests in a short period of time
                TimeUnit.SECONDS.sleep(10);

                PreparedStatement insertStatement = connection.prepareStatement(insertSql);
                insertStatement.setObject(1, UUID.randomUUID());
                insertStatement.setArray(2, connection.createArrayOf("double", embedding.toArray()));
                insertStatement.setString(3, originText);
                insertStatement.executeUpdate();
            }
        } catch (SQLException | InterruptedException e) {
            LOGGER.error("Connection failure." + e.getMessage());
        }
    }

    public void findMostSimilarString(String data) {
        try (var connection = DriverManager.getConnection(POSTGRESQL_JDBC_URL, POSTGRESQL_USER, POSTGRESQL_PASSWORD)) {
            // Create a vector array by calling the text embedding for the string the user wants to search
            List<Double> embedding = invokeTextEmbedding(data);
            String array = embedding.toString();
            LOGGER.info("Embedding: \n" + array);

            // Search with the vector array (find the closest string to the user's input)
            String querySql = "SELECT origntext FROM TBL_VECTOR_TEST ORDER BY embedding <-> ?::vector LIMIT 1;";
            PreparedStatement queryStatement = connection.prepareStatement(querySql);
            queryStatement.setString(1, array);
            ResultSet resultSet = queryStatement.executeQuery();
            while (resultSet.next()) {
                String origntext = resultSet.getString("origntext");
                LOGGER.info("Origntext: " + origntext);
            }
        } catch (SQLException e) {
            LOGGER.error("Connection failure." + e.getMessage());
        }
    }
}
Enter fullscreen mode Exit fullscreen mode
Point 1

A list of strings called private final static List<String> INPUT_DATA is defined. This list contains descriptions of the following services, saved as a list of character arrays. As mentioned earlier, if newline characters are included, the results may not be accurate, so all newline characters are replaced with whitespace.

  1. Visual Studio Code
  2. Azure App Service for Java
  3. Azure Container Apps
  4. Azure Cosmos DB
  5. Azure Kubernetes Service
  6. Azure Cognitive Service
  7. Azure Container Instances
  8. Azure Data Lake Storage
  9. Azure Blob Storage
Point 2

In the following invokeTextEmbedding method, the Azure OpenAI Embedded model is called. By providing a string to this method, a list of floating-point numbers is returned as the result.

    private List<Double> invokeTextEmbedding(String originalText) {
        EmbeddingsOptions embeddingsOptions = new EmbeddingsOptions(Arrays.asList(originalText));
        var result = client.getEmbeddings("text-embedding-ada-002", embeddingsOptions);
        var embedding = result.getData().stream().findFirst().get().getEmbedding();
        return embedding;
    }
Enter fullscreen mode Exit fullscreen mode
Point 3

The following method takes one element at a time from the prepared list of strings (INPUT_DATA), calls Azure OpenAI's Embedded, receives a multidimensional array (vector), and then saves it to the database. This process is for inserting test data, so please execute it only once.

    private void insertDataToPostgreSQL() {
    private void insertDataToPostgreSQL() {
        try (var connection = DriverManager.getConnection(POSTGRESQL_JDBC_URL, POSTGRESQL_USER, POSTGRESQL_PASSWORD)) {
            var insertSql = "INSERT INTO TBL_VECTOR_TEST (id, embedding, origntext) VALUES (?, ?::vector, ?)";

            for (String originText : INPUT_DATA) {
                // Call the text embedding and obtain the vector array
                List<Double> embedding = invokeTextEmbedding(originText);
                // Sleep for 10 seconds to prevent errors 
            // due to sending a large number of requests in a short period of time
                TimeUnit.SECONDS.sleep(10);

                PreparedStatement insertStatement = connection.prepareStatement(insertSql);
                insertStatement.setObject(1, UUID.randomUUID());
                insertStatement.setArray(2, connection.createArrayOf("double", embedding.toArray()));
                insertStatement.setString(3, originText);
                insertStatement.executeUpdate();
            }
        } catch (SQLException | InterruptedException e) {
            LOGGER.error("Connection failure." + e.getMessage());
        }
    }
Enter fullscreen mode Exit fullscreen mode
Point 4

Finally, compare the information stored in the database with the information entered by the user, and find the most relevant document.

    public static void main(String[] args) {
        VectorTest test;
        try {
            test = new VectorTest();
            // Retrieve similar documents using vector search for the data registered in the DB
            test.findMostSimilarString("Please tell me about Azure Blob Storage");
Enter fullscreen mode Exit fullscreen mode

In this sample, the string "Please tell me about Azure Blob" is entered in the main() method as shown above. Please note that a multidimensional array is also created by calling invokeTextEmbedding(data); for this string.

That multidimensional array is passed to the following query:

SELECT origntext FROM TBL_VECTOR_TEST ORDER BY embedding <-> ?::vector LIMIT 1;"

Since LIMIT 1 is specified above, only the most relevant document is output. If you want to return multiple results, change this value.

Furthermore, the <-> notation can be changed. In PostgreSQL's pgvector, you can calculate similarity by specifying one of the following three operators. Change the operator as needed.

Operator Description
<-> Euclidean distance: measures the straight-line distance between two vectors in n-dimensional space
<#> Negative inner product
<=> Cosine similarity: measures the cosine of the angle between two vectors

Euclidean distance measures the straight-line distance between two vectors in n-dimensional space, while cosine similarity measures the cosine of the angle between two vectors.

The query result is implemented to return the original text.

    public void findMostSimilarString(String data) {
        try (var connection = DriverManager.getConnection(POSTGRESQL_JDBC_URL, POSTGRESQL_USER, POSTGRESQL_PASSWORD)) {
            // Create a vector array by calling the text embedding for the string the user wants to search
            List<Double> embedding = invokeTextEmbedding(data);
            String array = embedding.toString();
            LOGGER.info("Embedding: \n" + array);

            // Search with the vector array (find the closest string to the user's input)
            String querySql = "SELECT origntext FROM TBL_VECTOR_TEST ORDER BY embedding <-> ?::vector LIMIT 1;";
            PreparedStatement queryStatement = connection.prepareStatement(querySql);
            queryStatement.setString(1, array);
            ResultSet resultSet = queryStatement.executeQuery();
            while (resultSet.next()) {
                String origntext = resultSet.getString("origntext");
                LOGGER.info("Origntext: " + origntext);
            }
        } catch (SQLException e) {
            LOGGER.error("Connection failure." + e.getMessage());
        }
    }
Enter fullscreen mode Exit fullscreen mode

By using multidimensional vector arrays in this way, you can find highly relevant documents.
If you have such use cases, please give it a try.

Reference Information

The reference information is listed below. Please take a look as needed.

Top comments (1)

Collapse
 
schsu01 profile image
schsu01

Very helpful, thanks a lot! m( _ _ )m