DEV Community

Cover image for On building a digital assistant for the rest of us (part 1)
Thomas Künneth
Thomas Künneth

Posted on

On building a digital assistant for the rest of us (part 1)

Large language models (LLM) have been in the headlines for quite a while now. Chatbots based on LLMs are not only capable of answering complex questions, but can also solve sophisticated tasks like writing code, creating pictures, and composing music. Sometimes their answers and produced results are so good that AI advocates credit LLMs with human capabilities like creativity and intuition. It has even become popular to claim that AI is going to take over a lot of work that currently is done by humans.

While LLMs have made remarkable progress, they still produce incorrect or misleading information quite often. Their output may be influenced by biases that is present in the data they were trained on. So, for the foreseeable future, it's crucial to use LLMs critically, and to verify their results. Also, human judgment and creativity remain, at least for now, a human feature. Artificial intelligence is a tool that can complement human expertise, but not replace it. It's important to approach the development of AI with a balanced perspective. While there is great potential to benefit society, it's also essential to address the challenges and risks associated with its use.

Defining some terms

The best way to understand the opportunities and challenges of AI is to work with it. That is, to use it in one of your projects. But before we dive in, let's clarify some terms first. What does Artificial intelligence mean? Here's what Gemini has to say:

Screenshot of Gemini defining the term artificial intelligence

I know what you are thinking. Why on earth is he starting the conversation with Please?

It certainly is not about being polite to an algorithm or computer program. But is about remaining polite in conversations with other humans. Habit is very important for us. We do things because we are used to it. Now, what do you think will happen if we are getting used to not saying Please? That's right, we will stop using it in conversations with other humans. And this is something we should not strive for, right?

I already used the term Large language model, so let's ask Gemini for a definition.

Screenshot of Gemini defining the term Large language model

Finally, Chatbot:

Screenshot of Gemini defining the term Chatbot

While chatbots have become pretty popular in recent years, a conversational interaction with an LLM may not always be the best approach. For example, to increase the volume of your phone, you just want to type, or even better, speak a command like Please increment the volume. So there is a distinction between command and conversation.

Project viewfAInder

Executing voice commands is nothing new. We have been able to do this using the Google Assistant on Android and Siri on iOS for many years. For a lot of tasks, both technologies work really well. Still, both have been lacking conversational skills. This is where large language models can help. Consequently, both Apple and Google are incorporating LLMs into their digital assistants. A much enhanced Siri will debut in iOS 18. On Android, the transition from the Google Assistant to Gemini is in full swing. For now, the old assistant is lending Gemini a hand now and then. Time will tell when Gemini can do everything the Google Assistant is capable of.

Building a digital assistant that uses an LLM, and integrating it into Android requires the use of quite a few libraries and technologies. Consequently, it's a great way of learning how to use large language models in Android apps, as well as modern Android development libraries and best practices.

This series of articles is about implementing a small digital assistant on Android that uses Gemini 1.5 Pro. The app is called viewfAInder. Its source code is available on GitHub. The current version uses camerax to get a continuous camera preview. When the user taps on the screen, the preview is sent to Gemini. The LLM returns a description of what's on the picture. Subsequent parts will explain how to make the app an assistant, how to recognise voice commands, and how to draw on screen so that the app can ask Gemini to focus on that highlighted area.

But let's take one step at a time. How can we grab the camera preview and send it to Gemini? The following screenshot shows how the app looks like after Gemini has returned an image description.

Screenshot of the viewfAInder app after Gemini has returned an image description

Here's the main activity:

class MainActivity : ComponentActivity() {

  private val cameraPermissionFlow: MutableStateFlow<Boolean>
                  = MutableStateFlow(false)
  private val launcher = registerForActivityResult(
    ActivityResultContracts.RequestPermission()
  ) { granted ->
    cameraPermissionFlow.update { granted }
  }

  override fun onCreate(savedInstanceState: Bundle?) {
    super.onCreate(savedInstanceState)
    enableEdgeToEdge()
    setContent {
      MaterialTheme {
        Surface(
          modifier = Modifier.fillMaxSize(),
          color = MaterialTheme.colorScheme.background,
        ) {
          val hasCameraPermission
                  by cameraPermissionFlow.collectAsState()
          val mainViewModel: MainViewModel = viewModel()
          val uiState by mainViewModel.uiState.collectAsState()
          MainScreen(uiState = uiState,
            hasCameraPermission = hasCameraPermission,
            setBitmap = { mainViewModel.setBitmap(it) },
            askGemini = { mainViewModel.askGemini() },
            reset = { mainViewModel.reset() })
        }
      }
    }
  }

  override fun onStart() {
    super.onStart()
    if (ContextCompat.checkSelfPermission(
        this, Manifest.permission.CAMERA
      ) == PackageManager.PERMISSION_GRANTED
    ) {
      cameraPermissionFlow.update { true }
    } else {
      launcher.launch(Manifest.permission.CAMERA)
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Besides enabling edge to edge and requesting the android.permission.CAMERA permission, nothing exciting is happening here, right? Next, let's look at the main screen.

@Composable
fun MainScreen(
  uiState: UiState,
  hasCameraPermission: Boolean,
  setBitmap: (Bitmap?) -> Unit,
  askGemini: () -> Unit,
  reset: () -> Unit
) {
  Box(contentAlignment = Alignment.Center) {
    if (hasCameraPermission) {
      CameraPreview(setBitmap = { setBitmap(it) },
        onClick = { if (uiState !is UiState.Success) askGemini() })
    }

    when (uiState) {
      is UiState.Success -> {
        Column(
          modifier = Modifier
            .fillMaxSize()
            .background(color = Color(0xa0000000))
            .safeContentPadding()
        ) {
          MarkdownText(
            markdown = uiState.outputText,
            modifier = Modifier
              .fillMaxWidth()
              .weight(1F)
              .verticalScroll(rememberScrollState()),
            style = MaterialTheme.typography.bodyLarge.merge(Color.White)
          )
          Button(
            onClick = { reset() },
            modifier = Modifier
              .padding(all = 32.dp)
              .align(Alignment.End)
          ) {
            Text(text = stringResource(id = R.string.done))
          }
        }
      }

      is UiState.Error -> {
        Text(
          text = uiState.errorMessage,
          style = MaterialTheme.typography.bodyLarge,
          color = Color.Red,
          modifier = Modifier
            .fillMaxWidth()
            .background(color = Color(0xa0000000))
        )
      }

      is UiState.Loading -> {
        CircularProgressIndicator()
      }

      else -> {}
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Nothing unexpected here, either. The fancy Markdown output you may have noticed in the screenshot above is provided by the incredible compose-markdown library of Jeziel Lago. We just need to invoke the MarkdownText() composable.

@OptIn(ExperimentalGetImage::class)
@Composable
fun CameraPreview(setBitmap: (Bitmap?) -> Unit, onClick: () -> Unit) {
  val context = LocalContext.current
  val lifecycleOwner = LocalLifecycleOwner.current
  val cameraProviderFuture = remember {
          ProcessCameraProvider.getInstance(context)
  }

  AndroidView(modifier = Modifier
    .fillMaxSize()
    .clickable {
      onClick()
    }, factory = { ctx ->
    val previewView = PreviewView(ctx)
    val executor = ContextCompat.getMainExecutor(ctx)
    cameraProviderFuture.addListener({
      val cameraProvider = cameraProviderFuture.get()
      val preview = Preview.Builder().build().also {
        it.setSurfaceProvider(previewView.surfaceProvider)
      }

      val imageAnalyzer = ImageAnalysis.Builder()
              .setBackpressureStrategy(
                  ImageAnalysis.STRATEGY_KEEP_ONLY_LATEST)
              .build().also {
          it.setAnalyzer(executor) { imageProxy ->
            setBitmap(imageProxy.toBitmap())
            imageProxy.close()
          }
        }

      try {
        cameraProvider.unbindAll()
        cameraProvider.bindToLifecycle(
          lifecycleOwner, CameraSelector.DEFAULT_BACK_CAMERA,
          preview, imageAnalyzer
        )
      } catch (e: Exception) {
        // Handle exceptions, e.g., log the error
      }
    }, executor)
    previewView
  })
}
Enter fullscreen mode Exit fullscreen mode

The CameraPreview() composable uses an AndroidView() to render the camera preview (PreviewView). Two things worth noting are:

  • Preview.Builder() is used to get a continuous preview
  • ImageAnalysis.Builder() is used to feed a new image into our ViewModel. This is done by invoking the setBitmap() callback. Which brings us to the last piece of code I want to show you in this part.
class MainViewModel : ViewModel() {

  private val _uiState: MutableStateFlow<UiState> =
                  MutableStateFlow(UiState.Initial)
  val uiState: StateFlow<UiState> = _uiState.asStateFlow()

  private val _bitmap: MutableStateFlow<Bitmap?> =
                  MutableStateFlow(null)

  private val generativeModel = GenerativeModel(
    modelName = modelName, apiKey = BuildConfig.apiKey
  )

  fun setBitmap(bitmap: Bitmap?) {
    _bitmap.update { bitmap }
  }

  fun askGemini() {
    _bitmap.value?.let { bitmap ->
      sendPrompt(bitmap = bitmap)
    }
  }

  fun reset() {
    _uiState.update { UiState.Initial }
  }

  private fun sendPrompt(bitmap: Bitmap) {
    _uiState.update { UiState.Loading }
    viewModelScope.launch(Dispatchers.IO) {
      try {
        val response = generativeModel.generateContent(content {
          image(bitmap)
          text(prompt)
        })
        response.text?.let { outputContent ->
          _uiState.value = UiState.Success(outputContent)
        }
      } catch (e: Exception) {
        _uiState.value = UiState.Error(e.localizedMessage ?: "")
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

As you can see, setBitmap() just updates a property. The communication with Gemini is started inside askGemini(), which delegates the work to sendPrompt(). Which in turn calls generateContent(). This function belongs to package com.google.ai.client.generativeai. The corresponding library is included in our libs.versions.toml file.

This concludes the first part of this series. I hope you enjoyed reading. Please share your thoughts in the comments.

Top comments (0)