FiftyOne Computer Vision Tips and Tricks - Feb 23, 2024

#computervision #machinelearning #ai #datascience

Welcome to our weekly FiftyOne tips and tricks blog where we recap interesting questions and answers that have recently popped up on Slack, GitHub, Stack Overflow, and Reddit.

As an open source community, the FiftyOne community is open to all. This means everyone is welcome to ask questions, and everyone is welcome to answer them. Continue reading to see the latest questions asked and answers provided!

Wait, what’s FiftyOne?

FiftyOne is an open source machine learning toolset that enables data science teams to improve the performance of their computer vision models by helping them curate high quality datasets, evaluate models, find mistakes, visualize embeddings, and get to production faster.

If you like what you see on GitHub, give the project a star.
Get started! We’ve made it easy to get up and running in a few minutes.
Join the FiftyOne Slack community, we’re always happy to help.

Ok, let’s dive into this week’s tips and tricks!

Using dataset.export to specify arbitrary file paths

Community Slack member Dimitrios asked:

Is there a way to add an arbitrary file's path as a sample field, and have the file copied over to a new location when running dataset.export?

One approach is to edit one of the exporters that currently exist for media files or write your own custom exporter that does this for you. Check out the “Exporting FiftyOne Datasets” section in the Docs as well as the “Custom Formats” section.

Clearing selected samples in the FiftyOne App programmatically

Community Slack member Nadav asked:

Is there SDK code that allows me to programmatically clear all the selected samples in the Fiftyone App?

Depending on your use case, you have two possible options.

You can use the session.clear_selected() method or
ctx.trigger('clear_selected_samples') within a plugin, if you don’t have the session object initialized. Learn more in the Docs.

Adding custom attributes

Community Slack member Villus asked:

I would like to extend FiftyOne’s metadata fields. Is there a way to do this?

Yes, you can make use of FiftyOne’s dataset.add_dynamic_sample_fields() method. You can learn more working with arbitrary custom attributes in the “Dynamic Attributes” section of the Docs.

Working with very long video samples

Community Slack member Daniel asked:

I'm using the FiftyOne app to retrieve a view of video samples that have a certain tag. Unfortunately, I get an error when I click on the tag. I suspect I am getting this error because the size of the frames field is likely too big (around an hour.) Is there a way to solve this or should I just split the big videos into smaller chunks?

There’s a couple of workarounds to consider here:

You can increase the maximum BSON document size limit in MongoDB, but this is generally not recommended as it can lead to performance issues and other unintended consequences. MongoDB's limits are set for good reasons related to performance and stability.
You can optimize your aggregation pipeline so that it is more efficient. This might involve filtering data earlier in the pipeline to reduce the amount of data being processed in the $lookup stage.
As you suggest, you can split up the videos. Here’s some code to get you started:

import subprocess
import os

import fiftyone as fo

def split_video(video_path, segment_duration):
    # Ensure the output directory exists
    output_dir = "split_videos"
    os.makedirs(output_dir, exist_ok=True)

    # Construct the ffmpeg command
    base_name = os.path.basename(video_path).split('.')[0]
    command = [
        "ffmpeg",
        "-i", video_path,
        "-c", "copy",
        "-map", "0",
        "-segment_time", str(segment_duration), # segment duration in seconds
        "-f", "segment",
        "-reset_timestamps", "1",
        os.path.join(output_dir, f"{base_name}_%03d.mp4")
    ]

    # Run the command
    subprocess.run(command)

# Load your dataset
dataset = fo.load_dataset("your_dataset_name")

# Function to create subsets of frames
def create_frame_subsets(sample, frame_step=1000):
    frames = sample.frames

    # Determine the range of frames
    frame_numbers = sorted(frames.keys())
    max_frame = frame_numbers[-1]

    # Create subsets
    for start_frame in range(1, max_frame, frame_step):
        end_frame = min(start_frame + frame_step - 1, max_frame)

        # Create a view with the frame range
        frame_view = dataset.filter_labels("frames", (foe.FrameNumber >= start_frame) & (foe.FrameNumber <= end_frame))

        # Process this frame_view or add it to a new dataset
        # ...

# Apply to each sample in the dataset
for sample in dataset:
    create_frame_subsets(sample)

How to filter all unlabeled samples programmatically

Community Slack member Nadav asked:

I have a classification field in my data. What is the best way to filter all unlabeled samples? Currently I'm using ctx.dataset.match(F(source_field) != None) but Pycharm raises a warning about that.

Three possible solutions:

pycharm is probably saying you should do is not instead of != when using None
You could also use something like dataset.match(F(source_field).exists(False))
And finally, you could try dataset.exists(source_field, bool=False). Check out the “exists” section on the “FiftyOne Core Collections” Docs.