Are you a big TV show binge-watcher? I sure am! As a dev, one thing I really enjoy is when a particular show highlights how technology interacts with the real world in believable ways, impacting us with unexpected—and often unintentionally hilarious—results.
For instance, back in 2017, HBO’s Silicon Valley aired an episode with the “Hot dog — Not hot dog” scene, where Jin Yang creates an app that recognizes hot dogs and everything else as “not hot dog.” The scene depicts a classic first step in training an AI for visual recognition.
For this tutorial, I will train a custom AI model using IBM Watson, and then use that model to detect “hot dog” or “not hot dog” within a live camera view. I’ll use augmented reality to display the result to the user. Since everything is more fun with friends, we’ll add a live streaming component to it!
Prerequisites
- Basic understanding of Swift
- Basic understanding of ARKit
- has Basic understanding of CoreML and AI
- Agora.io Developer Account
- IBM Cloud Account with Watson Studio
Please Note: While no CoreML/AI knowledge is needed to follow along, certain basic concepts won't be explained along the way.
Device Requirements
In this project, we’ll be using ARKit, so we have some device requirements:
- iPhone 6S or newer
- iPhone SE
- iPad (2017)
- All iPad Pro models
Training the AI
Before we can build our iOS app, we first need to train the computer vision AI model. I chose IBM’s Watson Studio because they provide a very simple, drag-n-drop interface for training a computer vision model.
Create the Watson Project
Once you’ve created and logged into your Watson Studio account, click Create Project button. Give your project a name/description, add the storage, and then click to create the project.
Next, click Add Project and select Visual Recognition. Make sure to follow the prompts to add a Watson Visual Recognition service to the project. On the Custom Models screen, we’re going to Classify Images, so click the Create Model button. Now let’s name our model. I chose the name HotDog!
Source Training Images
Now that we've set up our Watson instance, we need images to train our model. Sourcing images for AI training may seem like a daunting undertaking. While it is quite a heavy lift, there are tools that help make this task easier.
I chose to use the Google Images Download python script. The script makes it easy to scrape images from Google while still respecting the original owner's copyrights.
Once you have set up the Google Images Download script, let's open up the command line and run it using:
googleimagesdownload --keywords "hotdog" --usage_rights labeled-for-reuse
We need to remove any photos like these so we are left with only pictures of real hot dogs (images can include toppings). Once all images are removed, you’ll notice that we’re left with about 50 photos. This isn’t very many considering that you’d usually want thousands of photos to train your model. While Watson could probably work with only 50 or so photos, let’s run the script a few more times with other keywords. These are the commands I used:
googleimagesdownload --keywords "hotdog" --usage_rights labeled-for-reuse
googleimagesdownload --keywords "plain hotdog" --usage_rights labeled-for-reuse
googleimagesdownload --keywords "real hotdog" --usage_rights labeled-for-reuse
googleimagesdownload --keywords "hotdog no toppings" --usage_rights labeled-for-reuse
After running the script a few times and removing any non-real hot dog images from each set of results, I was able to source 170 images for my model.
Let's put all our hot dog images into a single folder and name it hotdog
. Now that we have our hot dog images, we need to find some not-hotdog images. Again, use the Google Images Download script but this time with a batch of keywords. I used:
googleimagesdownload --keywords "cake, pizza, hamburger, french fries, cup, plate, fork, glasses, computer, sandwich, table, dinner, meal, person, hand, keyboard" --usage_rights labeled-for-reuse
IBM Watson's free tier imposes file size limits (250mb - per round) for training models, so once we've downloaded all of our non-hotdog images, we need to remove any images with large file sizes. Let's move all the images into a single folder and name it nothotdog
. Next, zip each folder so you have hotdog.zip
and nothotdog.zip
.
Now, go back to the Watson Studio project, and upload the hotdog.zip
file to our computer vision model. Once our zip finishes uploading, you'll notice that a new class hotdog
has been created for us.
Next, upload your nothotdog.zip
file. After it finishes uploading, you'll have two classes: hotdog
and nothotdog
. For this example, we only need one class, hotdog
; the other class needs to be migrated into the existing Negative class. To do this, we need to open up the nothotdog
class and select all the images. To do so, select the list view from the top, then scroll to the bottom and set the list length to 200, then scroll back to the top and click the select all button.
With all your images selected, click the Reclassify button, select the Negative class, and click Submit.
Once all the images have been reclassified, click back to the list of models to select and delete the nothotdog
class. Now, we are ready to click the Train Model button to get Watson trained on our images.
Note: if you want to use a large data set you’ll have to break it up into pieces and repeat the training process (above) for each batch.
That's about it for collecting training images, all in all it wasn't too bad.
Building the iOS App
Now that Watson is training the visual recognition model, we are ready to build our iOS app.
In this example we’ll build an app that allows users to create or join a channel. Users that create a channel are then able to live stream themselves while they use IBM Watson custom model to infer hotdog or not hotdog.
Let’s start by creating a new single view app in Xcode.
Remove Scene Delegate
Since we are using the Storyboard interface, we can remove the SceneDelegate.swift
and the Scene Manifest entry from the info.plist
. Then we need to open the AppDelegate.swift
and remove the Scene Delegate methods, and add the window
property. Your AppDelegate.swift
should look like this:
Since this project will implement ARKit and Agora, we’ll use the AgoraARKit library to simplify the implementation and UI for us.
Create a Podfile
, open it and add the AgoraARKit
pod.
platform :ios, '12.2'
target 'Agora Watson ARKit Demo' do
use_frameworks!
# Pods for Agora Watson ARKit Demo
pod 'AgoraARKit'
target 'Agora Watson ARKit DemoTests' do
inherit! :search_paths
# Pods for testing
end
target 'Agora Watson ARKit DemoUITests' do
# Pods for testing
end
end
Then run the install:
pod install
Permissions
Add NSCameraUsageDescription
, NSMicrophoneUsageDescription
, NSPhotoLibraryAddUsageDescription
, and NSPhotoLibraryUsageDescription
to the info.plist
with a brief description for each. AgoraARKit uses the popular ARVideoKit framework and the last two permissions are required by ARVideoKit because of its ability to store photos/videos.
Note: I'm not implementing on-device recording so we don't need any of the library permissions; but if you plan to use this in production you will need to include them because they are requirements for ARVideoKit. For more information review Apple's guidelines on permissions.
Building the UI
We are ready to start building the UI. For this app we will have to build two views, the initial view and the AR view.
Within any live streaming and communication apps, you (as the developer) have two options for setting a channel name, do it for the user or allow users to input their own. The latter is more flexible, so we're going to extend our initial view to need to inherit from AgoraLobbyVC
and allow users to input a channel name. Open your ViewController.swift
, add import AgoraARKit
just below the import UIKit line and set your ViewController
class to inherit from AgoraLobbyVC
.
Next, set your Agora App Id within the loadViewmethod and also set a custom image for the bannerImage property.
Next let's override the joinSession
and createSession
methods within our ViewController
to set the images for the audience
and broadcaster
views.
import UIKit
import AgoraARKit
class ViewController: AgoraLobbyVC {
override func loadView() {
super.loadView()
AgoraARKit.agoraAppId = ""
// set the banner image within the initial view
if let agoraLogo = UIImage(named: "watson_live_banner") {
self.bannerImage = agoraLogo
}
}
override func viewDidLoad() {
super.viewDidLoad()
// Do any additional setup after loading the view.
}
// MARK: Button Actions
@IBAction override func joinSession() {
if let channelName = self.userInput.text {
if channelName != "" {
let arAudienceVC = ARAudience()
if let exitBtnImage = UIImage(named: "exit") {
arAudienceVC.backBtnImage = exitBtnImage
}
arAudienceVC.channelName = channelName
arAudienceVC.modalPresentationStyle = .fullScreen
self.present(arAudienceVC, animated: true, completion: nil)
} else {
// TODO: add visible msg to user
print("unable to join a broadcast without a channel name")
}
}
}
@IBAction override func createSession() {
if let channelName = self.userInput.text {
if channelName != "" {
let arBroadcastVC = ARBroadcaster()
if let exitBtnImage = UIImage(named: "exit") {
arBroadcastVC.backBtnImage = exitBtnImage
}
if let micBtnImage = UIImage(named: "mic"),
let muteBtnImage = UIImage(named: "mute") {
arBroadcastVC.micBtnImage = micBtnImage
arBroadcastVC.muteBtnImage = muteBtnImage
}
arBroadcastVC.channelName = channelName
arBroadcastVC.modalPresentationStyle = .fullScreen
self.present(arBroadcastVC, animated: true, completion: nil)
} else {
// TODO: add visible msg to user
print("unable to launch a broadcast without a channel name")
}
}
}
}
Adding in the AI
Once Watson has finished training your model, you’ll need to download the CoreML file. Open Watson Studio and select the Hotdog Model. Within the model details, select the Implementation tab, then select the Core ML tab from the sub-menu on the left side of the screen. At the top of the Core ML section is the link to download the *CoreML*app model file.
Once you’ve downloaded the Hotdog.mlmodel
file, drag the file into your Xcode project.
The computer vision will be running within our AR view, which is also the camera view being streamed into Agora, so we'll extend the ARBroadcaster
class. The ARBoadcaster
class is a bare-bones ARSCNView
that is set up as a custom video source for Agora's SDK.
Create a new class called arHotDogBroadcaster
which inherits from ARBroadcaster
. Next we need to add properties for VNRequest
and the DispatchQueue
. Next extend the viewDidLoad and import the coreML model.
let mlModel: MLModel = Hotdog().model
var visionRequests = [VNRequest]()
let dispatchQueueML = DispatchQueue(label: "io.agora.dispatchqueue.ml") // A Serial Queue
override func viewDidLoad() {
super.viewDidLoad()
// Set up Vision Model
guard let hotDogModel = try? VNCoreMLModel(for: mlModel) else {
fatalError("Could not load model. Ensure Coreml model is in your XCode Project and part of a target (see: https://stackoverflow.com/questions/45884085/model-is-not-part-of-any-target-add-the-model-to-a-target-to-enable-generation ")
}
// Set up Vision-CoreML Request
let classificationRequest = VNCoreMLRequest(model: hotDogModel, completionHandler: classificationCompleteHandler)
classificationRequest.imageCropAndScaleOption = VNImageCropAndScaleOption.centerCrop // Crop from centre of images and scale to appropriate size.
visionRequests = [classificationRequest]
}
We'll use the currentFrame
from the ARKit scene as our input for our computer vision. Use the currentFrame.capturedImage
to create a CIImage
that will be used as input for our VNImageRequestHandler
.
func runCoreML() {
// Get Camera Image as RGB
guard let sceneView = self.sceneView else { return }
guard let currentFrame = sceneView.session.currentFrame else { return }
let pixbuff : CVPixelBuffer = currentFrame.capturedImage
let ciImage = CIImage(cvPixelBuffer: pixbuff)
// Prepare CoreML/Vision Request
let imageRequestHandler = VNImageRequestHandler(ciImage: ciImage, options: [:])
// Run Image Request
do {
try imageRequestHandler.perform(self.visionRequests)
} catch {
print(error)
}
}
How do we know what the results are? If you look at the viewDidLoad
snippet above, you'll notice we set classificationCompleteHandler
as the completion block for any classification requests.
func classificationCompleteHandler(request: VNRequest, error: Error?) {
// Catch Errors
if error != nil {
print("Error: " + (error?.localizedDescription)!)
return
}
guard let observations = request.results else {
print("No results")
return
}
// Get Classifications
let classification: VNClassificationObservation = observations.first as! VNClassificationObservation
DispatchQueue.main.async {
// Print Classifications
print("--")
// Display Debug Text on screen
let debugText: String = "- \(classification.identifier) : \(classification.confidence)"
print(debugText)
// Display prediction
var objectName: String = "Not Hotdog"
if classification.confidence > 0.4 {
objectName = "Hotdog"
}
// show the result
self.showResult(objectName)
}
}
Every time the CoreML engine returns a response we need to parse it and display to the user, “Hot Dog” or “Not Hot Dog.” You’ll notice in the snippet, that once we have a result, we parse it and then check the confidence level. I set the bar fairly low with a 40% confidence. That means that the AI model only has to be 40% confident that it sees a hotdog.
During testing, 40% confidence proved adequate for the intents of this project, but you may want to adjust that value depending on how sensitive you want your AI to be.
All that's left now is to display the result to the user using augmented reality. You'll notice in the classificationCompleteHandler
, we call the function self.showResult
and pass in a string with the value of either "Hot Dog" or "Not Hot Dog." Within showResult
, we need to get the estimated position of the object and add an AR text label.
func showResult(_ result: String) {
// HIT TEST : REAL WORLD
// Get Screen Centre
let screenCentre : CGPoint = CGPoint(x: self.sceneView.bounds.midX, y: self.sceneView.bounds.midY)
let arHitTestResults : [ARHitTestResult] = sceneView.hitTest(screenCentre, types: [.featurePoint]) // Alternatively, we could use '.existingPlaneUsingExtent' for more grounded hit-test-points.
if let closestResult = arHitTestResults.first {
// Get Coordinates of HitTest
let transform : matrix_float4x4 = closestResult.worldTransform
let worldCoord : SCNVector3 = SCNVector3Make(transform.columns.3.x, transform.columns.3.y, transform.columns.3.z)
// Create 3D Text
let node : SCNNode = createNewResultsNode(result)
resultsRootNode.addChildNode(node)
node.position = worldCoord
}
}
func createNewResultsNode(_ text : String) -> SCNNode {
// Warning: Programmatically generating 3D Text is susceptible to crashing. To reduce chances of crashing; reduce number of polygons, letters, smoothness, etc.
print("shwo result: \(text)")
// Billboard contraint to force text to always face the user
let billboardConstraint = SCNBillboardConstraint()
billboardConstraint.freeAxes = SCNBillboardAxis.Y
// SCN Text
let scnText = SCNText(string: text, extrusionDepth: CGFloat(textDepth))
var font = UIFont(name: "Helvetica", size: 0.15)
font = font?.withTraits(traits: .traitBold)
scnText.font = font
scnText.alignmentMode = CATextLayerAlignmentMode.center.rawValue
scnText.firstMaterial?.diffuse.contents = UIColor.orange
scnText.firstMaterial?.specular.contents = UIColor.white
scnText.firstMaterial?.isDoubleSided = true
scnText.chamferRadius = CGFloat(textDepth)
// Text Node
let (minBound, maxBound) = scnText.boundingBox
let textNode = SCNNode(geometry: scnText)
// Centre Node - to Centre-Bottom point
textNode.pivot = SCNMatrix4MakeTranslation( (maxBound.x - minBound.x)/2, minBound.y, textDepth/2)
// Reduce default text size
textNode.scale = SCNVector3Make(0.2, 0.2, 0.2)
// Sphere Node
let sphere = SCNSphere(radius: 0.005)
sphere.firstMaterial?.diffuse.contents = UIColor.cyan
let sphereNode = SCNNode(geometry: sphere)
// Text Parent Node
let textParentNode = SCNNode()
textParentNode.addChildNode(textNode)
textParentNode.addChildNode(sphereNode)
textParentNode.constraints = [billboardConstraint]
return textParentNode
}
Now that we have our model ready to run, we need to add a way for the user to invoke the computer vision model. Let's use the View's touchesBegan method to call the runCoreML method.
override func touchesBegan(_ touches: Set<UITouch>, with event: UIEvent?) {
dispatchQueueML.async {
self.runCoreML()
}
}
Implement new broadcaster class
We're almost done. The last step (before we can start testing) is to set the ARBroadcaster
in the ViewController.swift
by updating line 43
to:
let arBroadcastVC = arHotDogBroadcaster()
This will set the ARbroadcaster
to use our new arHotDogBroadcaster
and we are ready to start testing!
That's It!
The core application is done, I'll leave it up to you to customize the UI. Thanks for following along. If you have any questions or feedback, please leave a comment.
I've uploaded my complete code with UI customizations, to GitHub so feel free to fork the repo and make PR's for new features.
Top comments (0)