How to Create a Custom Translation Model on Azure Machine Learning

#azure #translation

Azure Machine Learning is a cloud-based platform that enables you to build, train, deploy, and manage machine learning models at scale. One of the applications of machine learning is natural language processing (NLP), which is the ability of computers to understand and generate natural language. One of the tasks of NLP is machine translation, which is the process of automatically translating text or speech from one language to another.

Machine translation can be useful for many scenarios, such as:

Communicating with people who speak different languages
Accessing information that is not available in your native language
Localizing your products or services for different markets
Enhancing your learning or research by exploring diverse sources of knowledge

However, machine translation is not a one-size-fits-all solution. Different domains, industries, and styles may have specific terminology, jargon, or expressions that are not well captured by generic translation models. For example, a medical document may use technical terms that are not common in everyday language, or a literary text may use figurative language that is not easy to translate literally.

To address this challenge, you can use Azure Machine Learning to create a custom translation model that reflects your specific needs and preferences. A custom translation model is a machine learning model that is trained on your own data, such as previously translated documents, glossaries, or dictionaries. By using your own data, you can teach the model to learn the preferred translations for your domain, industry, or style. This way, you can improve the accuracy and fluency of your translations and provide a better experience for your users.

In this blog post, I will show you how to create a custom translation model on Azure Machine Learning in four steps:

Step 1: Prepare your data
Step 2: Create a workspace and a project
Step 3: Train your model
Step 4: Deploy and use your model

Step 1: Prepare your data

The first step to create a custom translation model is to prepare your data. You need two types of data:

Parallel data: These are pairs of documents where one (target) is the translation of the other (source). One document in the pair contains sentences in the source language and the other document contains sentences translated into the target language. For example, if you want to create a custom translation model from English to French, you need parallel data that consists of English documents and their corresponding French translations.
Dictionary data: These are pairs of words or phrases where one (target) is the translation of the other (source). For example, if you want to create a custom translation model from English to French, you need dictionary data that consists of English words or phrases and their corresponding French translations.

The quality and quantity of your data are important factors that affect the performance of your custom translation model. Ideally, you should have:

High-quality data: Your data should be accurate, consistent, and relevant to your domain, industry, or style. You should avoid using data that contains errors, inconsistencies, or irrelevant content. For example, if you want to create a custom translation model for medical documents, you should use data that contains medical terminology and follows medical standards and conventions.
Sufficient data: Your data should cover enough examples and variations of your domain, industry, or style. You should use as much data as possible to train your custom translation model. However, the minimum amount of data required depends on the complexity and diversity of your domain, industry, or style. For example, if you want to create a custom translation model for medical documents, you may need more data than if you want to create a custom translation model for general documents.

You can obtain parallel data and dictionary data from various sources, such as:

Your own existing data: You may already have some translated documents or glossaries that you can use as parallel data or dictionary data. For example, if you have previously translated some medical documents from English to French using human translators or other tools, you can use them as parallel data for your custom translation model.
Publicly available data: You may find some translated documents or glossaries that are publicly available on the internet or other platforms that you can use as parallel data or dictionary data. For example, you can use this website to find parallel corpora for various languages and domains.
Data providers: You may purchase some translated documents or glossaries from professional data providers that offer high-quality and domain-specific data. For example, you can use this website to find data providers for various languages and domains.

Once you have obtained your parallel data and dictionary data, you need to format them according to the requirements of Azure Machine Learning. You need to:

Convert your parallel data into tab-separated values (TSV) files. Each line in the file should contain a source sentence and a target sentence separated by a tab. For example, if you have a parallel document that contains English sentences and their corresponding French translations, you need to convert it into a TSV file that looks like this:



Hello\tBonjour
How are you?\tComment allez-vous?
I am fine, thank you.\tJe vais bien, merci.

Convert your dictionary data into comma-separated values (CSV) files. Each line in the file should contain a source word or phrase and a target word or phrase separated by a comma. For example, if you have a dictionary that contains English words or phrases and their corresponding French translations, you need to convert it into a CSV file that looks like this:



doctor,médecin
hospital,hôpital
prescription,ordonnance

Zip your parallel data and dictionary data into separate zip files. Each zip file should contain one or more TSV or CSV files. For example, if you have multiple parallel documents and dictionaries for different domains, you can zip them into separate zip files like this:



medical_parallel_data.zip


medical_document_1.tsv
medical_document_2.tsv
...
medical_dictionary_data.zip
medical_dictionary_1.csv
medical_dictionary_2.csv
...
general_parallel_data.zip
general_document_1.tsv
general_document_2.tsv
...
general_dictionary_data.zip
general_dictionary_1.csv
general_dictionary_2.csv
...

Step 2: Create a workspace and a project

The second step to create a custom translation model is to create a workspace and a project on Azure Machine Learning. A workspace is a work area for composing and building your custom translation system. A workspace can contain multiple projects, models, and documents. A project is a wrapper for models, documents, and tests. Each project includes all documents that are uploaded into that workspace with the correct language pair.

To create a workspace and a project, you need to:

Sign in to the Custom Translator portal using your Microsoft account and Azure subscription.
Click on the Create workspace button on the top right corner of the portal.
Enter a name for your workspace, such as "Custom Translation Workspace".
Select the source language and the target language for your custom translation model, such as "English" and "French".
Click on the Create button to create your workspace.
Click on the Create project button on the top right corner of the portal.
Enter a name for your project, such as "Custom Translation Project".
Select the category for your project, such as "General" or "Medical".
Click on the Create button to create your project.

Step 3: Train your model

The third step to create a custom translation model is to train your model on Azure Machine Learning. A model is the system that provides translation for a specific language pair. The outcome of a successful training is a model. When you train a model, three mutually exclusive document types are required: training, tuning, and testing.

To train your model, you need to:

Upload your parallel data and dictionary data to your project. You can do this by clicking on the Upload button on the top right corner of the portal and selecting the zip files that contain your data. You can also drag and drop the zip files to the portal.
Assign document types to your data. You can do this by clicking on the Documents tab on the left side of the portal and selecting the documents that you want to assign. You can choose from three document types: training, tuning, and testing. Training data is used to train your model. Tuning data is used to optimize your model's parameters. Testing data is used to evaluate your model's performance.
Queue a training run for your model. You can do this by clicking on the Train button on the top right corner of the portal. You can choose from two training modes: standard or advanced. Standard mode is recommended for most users as it automatically selects the best settings for your model. Advanced mode allows you to customize some settings for your model, such as training time or dictionary weight.
Wait for your training run to complete. You can monitor the progress of your training run by clicking on the Models tab on the left side of the portal and selecting the model that you are training. You can see information such as status, duration, BLEU score, and date of creation.

Step 4: Deploy and use your model

The fourth and final step to create a custom translation model is to deploy and use your model on Azure Machine Learning. Deploying your model means making it available for use by other applications or users through an endpoint. Using your model means sending requests to translate text or speech from one language to another using your custom translation system.

For github link of our model

https://github.com/karleeov/azure-custom-translator