Continuing with the series of articles on voice file management, we are going to see how we can convert text into audio and receive the file with the chosen voice.
We will also explore how a service from OpenAI can help us analyze a text and determine the mood expressed in it.
Let's analyze how you can create your own voice file and how it can “read” your feelings.
The voice
OpenAI offers a service that can create a voice file with your text. It is a lifesaver if you want to integrate a voice response service for visually impaired people.
I know that there are a lot of applications that can read a text and convert it to a voice, for instance, Loquendo, AWS, Text-to-speech by Google, etc...
However, the problem with most of them is that the voice is not as natural as you might expect. It happens because they typically translate the phoneme as it sounds in verbal language, but they do not place emphasis or emotion in the text.
On the other hand, when you use AI, it understands the context of the text and tries to add the correct intonation according to it, e.g. exclamation, question, sadness, joy, etc...
This service retrieves the file directly as a binary stream in the response object. It also lets you use different types of audio files: MP3, OVG, etc...
I did not want to return the content directly, but I planned to return a Base64 string that the user could convert to a file, play it directly, etc., and that can be used through a file API.
The first problem that I faced was converting the %Stream.GlobalBinary file into a Base64.
My first attempt was to read the binary and convert it as a string decoding a Base64
do tHttpResponse.Data.Rewind()
set response = ""
while tHttpResponse.Data.AtEnd {
set temp=stream.Read(4000)
set temp=$system.Encryption.Base64Encode(temp)
set response = response_temp
}
However, the content was not correctly converted to Base64.
Yet, as always, the community saves my life (again). The idea was to convert the %Stream.GlobalBinary to a %Stream.GlobalCharacter and then read the content in Base64. Thanks you, Marc Mundt for your amazing response.
After that, I created the following class to convert my GlobalBinary to GlobalCharacter stream.
Class St.OpenAi.B64.Util Extends %RegisteredObject
{
/// Be cautious if changing CHUNKSIZE. Incorrect values could cause the resulting encoded data to be invalid.
/// It should always be a multiple of 57 and needs to be less than ~2.4MB when MAXSTRING is 3641144
Parameter CHUNKSIZE = 2097144;
ClassMethod B64EncodeStream(pStream As %Stream.Object, pAddCRLF As %Boolean = 0) As %Stream.Object
{
set tEncodedStream=##class(%Stream.GlobalCharacter).%New()
do pStream.Rewind()
while ('pStream.AtEnd) {
set tReadLen=..#CHUNKSIZE
set tChunk=pStream.Read(.tReadLen)
do tEncodedStream.Write($System.Encryption.Base64Encode(tChunk,'pAddCRLF))
if (pAddCRLF && 'pStream.AtEnd) {
do tEncodedStream.Write($c(13,10))
}
}
do tEncodedStream.Rewind()
quit tEncodedStream
}
}
My next step was to call it from the St.OpenAi.BO.Api.Connect class, and get the Base64 correctly.
set pResponse = ##class(St.OpenAi.Msg.Audio.SpeachResponse).%New()
do pResponseStream.Content.Rewind()
set contentfile = ##class(St.OpenAi.B64.Util).B64EncodeStream(pResponseStream.Content)
set content = ""
while ('contentfile.AtEnd)
{
set content = content_contentfile.Read()
}
set pResponse.Content=content
I am very lucky to have such a great community by my side!
What do you mean in your text?
What do you think the intention is here if you have the following text?
“My life has no meaning. I want to leave everything forever.”
You may feel that the speaker intends to do something bad to himself. Indeed, since you are human, you understand what the context is, and you know how to "read between the lines"
To a computer, a word is a collection of 0s and 1s, which should be transformed into characters for us to understand. We even put the characters in a different order; we also have words in other languages: "Hello", "Hola", and "Ciao".
We can create a relation between the word “Hello” and its counterpart in Spanish (“Hola”) or Italian (“Ciao”).
However, there is no easy way to relate the phrase "My life has no meaning" to something negative, or “I want to leave everything forever“ to your desire to harm yourself.
If you try to do the same as before, we can consider that the phrase “I want to leave everything forever” means that you want to harm yourself. It might make sense, but the same line in the following phrase “I don't like this city. I want to leave everything forever” is neutral.
Indeed, the context is crucial to determine if a text is neutral or contains violence, hate, or self-harm.
How can you train your AI in moderation?
Training an AI to moderate texts involves several steps and essential considerations:
Data Collection with Labels: It is critical to gather a sufficiently large and diverse dataset containing examples of texts you need to moderate, labeled as appropriate or inappropriate according to your moderation criteria.
Definition of Moderation Criteria: You need to clearly define what types of content you consider inappropriate and what actions should be taken regarding them (e.g., removing, hiding, or marking them as potentially offensive).
Selection of Algorithms and Models: You can use supervised machine learning techniques, where models are trained with labeled examples, or semi-supervised machine learning techniques, where both labeled and unlabeled examples are leveraged. Models like BERT, GPT, or specific text classification models can come in handy in this case.
Data Preprocessing: Before training the model, you should perform such tasks as tokenization, text normalization (e.g., converting all parts of the text to lowercase), removal of special characters, etc.
Model Training: Utilize your labeled data to train the selected model. During training, the model will learn to distinguish between appropriate and inappropriate texts based on the defined criteria.
Model Performance Evaluation: After training, evaluate the model's performance with the help of a separate test dataset. It will help you determine how well the model is generalizing and whether it needs additional adjustments.
Fine-Tuning and Continuous Improvement: You are likely to need to fine-tune and improve your model as you gather more data and observe its performance in the real world. It may involve retraining the model with updated data and parameter optimization.
Production Deployment: Once you are satisfied with the model's performance, you can deploy it in your online moderation system to help automate the text moderation process.
It is vital to note that no automated moderation system is perfect, so it is always advisable to combine AI with human supervision to address difficult cases or new types of inappropriate content.
The main purpose of OpenAI's "moderation" service is to provide artificial intelligence tools and models to help online platforms moderate and manage user-generated content. It includes detecting and mitigating such inappropriate content as spam, hate speech, harassment, violence, sexually explicit content, etc. The goal is to help develop a safer and healthier online environment for users by reducing the presence of harmful content. We can do it by identifying content that might be harmful and taking action.
When using OpenAI’s moderation, the previous text would throw us the following result:
{
"id": "modr-9FRqaywTVudh3Jk9FEYnxuRbrDmUH",
"model": "text-moderation-007",
"results": [
{
"flagged": true,
"categories": {
"sexual": false,
"hate": false,
"harassment": false,
"self-harm": true,
"sexual/minors": false,
"hate/threatening": false,
"violence/graphic": false,
"self-harm/intent": true,
"self-harm/instructions": false,
"harassment/threatening": false,
"violence": false
},
"category_scores": {
"sexual": 6.480985575763043e-6,
"hate": 0.00005180266089155339,
"harassment": 0.000108763859316241,
"self-harm": 0.861529529094696,
"sexual/minors": 6.210859737620922e-7,
"hate/threatening": 9.927841659873593e-8,
"violence/graphic": 0.000012115803656342905,
"self-harm/intent": 0.9326919317245483,
"self-harm/instructions": 0.00005927650636294857,
"harassment/threatening": 7.471672233805293e-6,
"violence": 0.0008052702760323882
}
}
]
}
It has detected that there is an intent of self-harm, the percentage is:
Self-harm: 86.15% Self-harm intent: 93.26%
You can develop some filters using these percentages and detected categories and activate alerts or resend other types of responses more intelligently.
The model classifies the following categories:
CATEGORY | DESCRIPTION |
---|---|
hate | Content that expresses, incites, or promotes hate based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste. Hateful content aimed at non-protected groups (e.g., chess players) is harassment. |
hate/threatening | Hateful content that includes violence or serious harm towards the targeted group based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste. |
harassment | Content that expresses, incites, or promotes harassing language towards any target. |
harassment/threatening | Harassment content that additionally includes violence or serious harm towards any target. |
self-harm | Content that promotes, encourages, or depicts such acts of self-harm as suicide, self-cutting/injury, and eating disorders. |
self-harm/intent | Content where the speaker expresses that they are engaging or intend to engage in such acts of self-harm as suicide, self-cutting/injury, and eating disorders. |
self-harm/instructions | Content that encourages performing such acts of self-harm as suicide, self-cutting/injury, and eating disorders, or that gives instructions or advice on how to commit such acts. |
sexual | Content meant to arouse sexual excitement, e.g. the description of sexual activity, or that promotes sexual services (excluding sex education and wellness). |
sexual/minors | Sexual content that includes an individual who is under 18 years old. |
violence | Content that depicts death, violence, or physical injury. |
violence/graphic | Content that depicts death, violence, or physical injury in graphic detail. |
The score value is between 0 and 1, where higher values denote higher confidence. The scores should not be interpreted as probabilities.
Speech
Endpoint: POST https://api.openai.com/v1/audio/speech
The Audio API provides a speech endpoint based on the TTS (text-to-speech) model. It comes with 6 built-in voices and can be used to do the following:
- Narrate a written blog post.
- Produce spoken audio in multiple languages.
- Give real time audio output using streaming.
The input parameters would be as mentioned below:
- model: Required.
- It is related to the available TTS models: tts-1 or tts-1-hd.
- input: Required.
- It is connected to the text used to generate audio. The maximum length is 4096 characters.
- voice: Required.
- It is linked to the voice to employ when generating the audio. Supported voices are “alloy”, “echo”, “fable”, “onyx”, “nova”, and “shimmer”. Previews of the voices are available in Text to speech guide.
- response_format: Optional.
- It is associated with the format of the audio. Supported formats are “mp3”, “opus”, “aac”, “flac”, “wav”, and “pcm”. if not indicated, the default value is “mp3”.
- speed: Optional.
- It is bonded with the speed of the generated audio. Select a value from “0.25” to “4.0”. If not indicated, the default value is “1.0”.
Moderations
Endpoint: POST https://api.openai.com/v1/moderations
The input parameters would be as mentioned below:
- model: Optional.
- It is related to the available moderation models: text-moderation-stable or text-moderation-latest. if not indicated, the default value is “text-moderation-latest”.
- input: Required.
- It is connected to the text to classify.
What's next?
Since OpenAI is evolving continuously, nobody knows what next feature they will release.
Do not forget to mark the article with a “like” if you enjoyed it.
Top comments (0)