Vladislav Radchenko

Posted on Nov 13

Practical Experience: Integrating Over 50 Neural Networks Into One Open-Source Project

#python #tutorial #ai #refactorit

A year and a half ago, I embarked on an open-source project that has since grown and evolved significantly. Inspired by the AUTOMATIC1111 project, which was just starting to gain traction at the time, I kept adding new features and capabilities. Today, my project integrates over 50 different neural networks, each handling a unique task. In this article, I want to share some practical tips and key takeaways from my journey. I hope they prove helpful to you and motivate you to refactor your code.

My open-source project focuses on creating and editing video, images, and audio using neural networks. Often, different methods can achieve similar outcomes, but ensuring consistency across the project has been a major challenge. As I integrated open-source solutions, optimized them, and added new functionality, maintaining a unified approach became essential. For instance, features like face swapping, lip synchronization, and portrait animation all require facial recognition. Rather than using separate methods for each, as was common in the original solutions, I opted for a single, shared model for facial recognition. Consequently, the 50+ neural networks are organized such that each one serves a unique purpose without redundancy.

During development, I made a key decision: to avoid TensorFlow and any related frameworks, focusing solely on PyTorch and ONNX Runtime.

For those curious about the specific features or the neural networks I used, I have included several links: a YouTube playlist documenting the project's evolution and a short video created using my software.

Each model in this project is diverse and complex, performing tasks like image and video generation, facial recognition, segmentation, and much more. There are no simple solutions; every neural network fulfills a distinct role.

Let’s dive into the insights

Disclaimer: I am sharing my personal experience, the life hacks I use, and the challenges I've encountered during development. Since libraries and frameworks are constantly being updated, please ensure that you verify compatibility and check for any changes in the latest versions. The tips provided here reflect my approach at the time of writing and may require adjustments as new updates and features become available.

Tip 1: One Model is One Task

One of the first surprises I encountered was that you can’t load a single model into VRAM and use it simultaneously for multiple tasks. Each model must be loaded separately for its specific task. This realization laid the groundwork for many of the strategies I developed later on.

Tip 2: Managing a Task Queue

My application is built on Flask, which means users don’t have to wait for tasks to finish processing. They can initiate multiple tasks simultaneously, potentially putting a heavy load on memory. To prevent memory overflow, I introduced artificial delays between task executions, with random intervals to minimize the chances of multiple tasks starting at the same time. This approach also ties into Tip 3.

Tip 3: Monitoring Memory Usage

Before launching a task, I measure the available memory on the device. If the memory falls below the threshold needed for the model, I intentionally delay the task execution. This proactive approach helps ensure that the system remains stable and tasks don’t fail due to insufficient memory.

import torch
import psutil

def get_vram_gb(device="cuda"):
    if torch.cuda.is_available():
        properties = torch.cuda.get_device_properties(device)  # Get the values for a specific GPU, which is our device
        total_vram_gb = properties.total_memory / (1024 ** 3)
        available_vram_gb = (properties.total_memory - torch.cuda.memory_allocated()) / (1024 ** 3)
        busy_vram_gb = total_vram_gb - available_vram_gb
        return total_vram_gb, available_vram_gb, busy_vram_gb
    return 0, 0, 0


def get_ram_gb():
    mem = psutil.virtual_memory()
    total_ram_gb = mem.total / (1024 ** 3)
    available_ram_gb = mem.available / (1024 ** 3)
    busy_ram_gb = total_ram_gb - available_ram_gb
    return total_ram_gb, available_ram_gb, busy_ram_gb

Tip 4: Handling “CUDA Out of Memory” Errors

In addition to delaying task execution, I implemented checks for the most common error: “CUDA out of memory.” The solution is straightforward: if this error occurs, the system clears unnecessary data from memory and retries the process. This approach ensures that tasks can still complete successfully, even under memory constraints.

min_delay = 20
max_delay = 180
try:
    # Launch the method with a neural network
except RuntimeError as err:
    if 'CUDA out of memory' in str(err):
        # Clear memory
        sleep(random.randint(min_delay, max_delay))
        # Clear memory again
        # Launch the method again
    else:
        raise err

Tip 5: Organizing Backend Modules
The backend of my application is organized into modules, categorized by specific properties: altering videos or images, generating videos or images, and modifying audio. Each module is also classified based on whether it handles frontend or backend tasks. Models that need to provide immediate results to users, such as segmentation, txt2img, and img2img, are prioritized differently from those that process larger, time-intensive tasks in the background.

Frontend models, like those used with:

await ort.InferenceSession.create(MODEL_DIR).then(console.log("Model loaded"));

...are not part of this backend task management. As a result, I have to preload models into memory for quick response times, ensuring that different users don’t simultaneously access the same model (as discussed in Tip 1). Additionally, these preloaded models are reserved for tasks requiring rapid feedback, and they are not used for long-running processes, to avoid violating the constraints outlined in Tip 1.

Tip 6: Managing Memory-Intensive Models

Models designed for long-running tasks can be highly demanding, often consuming all available VRAM. From an optimization standpoint, frequently loading and unloading such models is inefficient, though sometimes necessary. To mitigate this, I use a strategy involving "micro models" — lightweight models that take up less memory but still require time for loading and unloading.

When processing tasks, we group them based on the method’s processing duration. Tasks from the same group are handled using these micro models, forming a queue before loading into a larger, memory-intensive model. Remember Tips 3 and 4? We have two strategies: estimating memory usage before loading the model or launching the model and handling a "CUDA out of memory" error.

Need to clear ~~cash~~ cache!

When we encounter this error, we clear VRAM of unnecessary models, including those used for rapid responses, and clean up any residual data. This approach ensures that memory-intensive models can run efficiently without disrupting other tasks.

if torch.cuda.is_available():  # If CUDA is available, because the application can work without CUDA
    torch.cuda.empty_cache() # Frees unused memory in the CUDA cache
    torch.cuda.ipc_collect() # Performs garbage collection on CUDA objects accessed via IPC (interprocess communication)
gc.collect() # Calls Python's garbage collector to free memory occupied by unused objects

Tip 7: Clearing Memory After Task Completion

After each task is completed, it's crucial to free up memory by removing variables and unloading models that are no longer needed. This can be done using:

del ...

This practice helps maintain efficient memory usage and prevents unnecessary VRAM and RAM consumption, ensuring the system stays optimized for subsequent tasks.

Tip 8: Layer-Wise Model Loading

To manage limited VRAM, models can be loaded layer by layer, distributing them between the GPU and CPU or even across multiple GPUs. However, all components of a single layer must reside on the same GPU. This method is particularly useful for tasks like image and video generation but can also be applied to other resource-intensive processes. By strategically loading models in this manner, you can maximize memory efficiency while still enabling complex operations.

device_map = {
    'encoder.layer.0': 'cuda:0',
    'encoder.layer.1': 'cuda:1',
    'decoder.layer.0': 'cuda:0',
    'decoder.layer.1': 'cuda:1',
}
# Or
device_map = {
    'encoder.layer.0': 'cuda',
    'encoder.layer.1': 'cpu',
    'decoder.layer.0': 'cuda',
    'decoder.layer.1': 'cpu',
}

Tip 9: Memory Optimization Techniques
Don’t forget to use enable_xformers_memory_efficient_attention() if your model's pipeline supports it. This method can significantly reduce memory usage. Additionally, there are other optimization techniques detailed in the documentation, such as enable_model_cpu_offload(), enable_vae_tiling(), and enable_attention_slicing(). In my project, these methods are especially useful for tasks like video restyling. However, for video generation, I rely on different, more specialized optimization strategies.

if vram < 12:
    pipe.enable_sequential_cpu_offload()
    print("VRAM below 12 GB: Using sequential CPU offloading for memory efficiency. Expect slower generation.")
elif vram < 20:
    print("VRAM between 12-20 GB: Medium generation speed enabled.")
elif vram < 30:
    # Load essential modules to GPU
    for module in [pipe.vae, pipe.dit, pipe.text_encoder]:
        module.to("cuda")
    cpu_offloading = False
    print("VRAM between 20-30 GB: Sufficient memory for faster generation.")
else:
    # Maximize performance by disabling memory-saving options
    for module in [pipe.vae, pipe.dit, pipe.text_encoder]:
        module.to("cuda")
    cpu_offloading = False
    save_memory = False
    print("VRAM above 30 GB: Maximum speed enabled for generation.")

Tip 10: Efficient Frame Handling

Storing frames in memory can be a double-edged sword. On powerful machines with constraints on resolution or content duration, keeping everything in memory can be beneficial for speed. However, many users of my project run it on lower-end devices, often processing hour-long, high-resolution videos. To accommodate this, I rewrote all methods to work with the current frame and values, saving data to the hard drive rather than keeping it in memory. By accessing data as needed and storing only file references in a list, I managed to make the process more efficient and hardware-friendly. Additionally, using generators or chunked processing helps manage large datasets, a strategy I leverage in modules like face replacement.

Tip 11: Frame Resolution Adjustments

Depending on the model, I sometimes need to resize frames to dimensions that the user's device can handle. After processing, I restore the frame size using basic resizing techniques or more advanced upscaling methods. This step is crucial for ensuring compatibility across a wide range of hardware setups.

Tip 12: Are Models Always Synchronous?

This statement isn’t set in stone, as the world of AI is ever-evolving, but here’s my experience: I haven’t seen significant benefits from using asynchronous methods with models. The exceptions are data processing operations not directly related to the model and requests for downloading or validating model versions. Otherwise, models operate synchronously, and that's been sufficient for most scenarios I’ve encountered.

Tip 13: Library Version Compatibility

Managing library versions, especially for packages like torch, torchvision, torchaudio, and xformers, is critical. Here’s how to ensure everything works seamlessly:

Check Your CUDA Version

Run:

   nvcc -V

Then visit the PyTorch download page to understand version compatibility. For instance, if your CUDA version is 11.8 (cu118), remember that it can support older versions of torch. Even CUDA 12.6 can work with a torch version designed for cu118.

Align Library Versions

Typically, torch and torchaudio share the same version (e.g., 2.4.1), while torchvision may differ (e.g., 0.19.1). You can infer version compatibility, like torch 2.2.2 with torchvision 0.17.2. Understanding these dependencies is essential.

Additionally, you can download .whl files from official sources and unpack them manually or by pip. This step is crucial for my project because it installs via an installer. For Windows users, it fetches torch, torchaudio, and torchvision according to their selected options, displaying download status before unpacking.

Check xformers Compatibility

Visit the xformers GitHub repo to ensure compatibility with your torch and CUDA versions. Support for older versions can be dropped, so staying updated is vital, especially if you're running CUDA 11.8 and want to leverage xformers for limited VRAM.

Optional: Flash-Attn Installation

Flash-attention can boost performance, and you can install it efficiently using:

   MAX_JOBS=4 pip install flash-attn

Adjust the number of jobs to suit your setup. Here’s how I use it:

try:
    from flash_attn import flash_attn_qkvpacked_func, flash_attn_func
    from flash_attn.bert_padding import pad_input, unpad_input, index_first_axis
    from flash_attn.flash_attn_interface import flash_attn_varlen_func
except ImportError:
    flash_attn_func = None
    flash_attn_qkvpacked_func = None
    flash_attn_varlen_func = None

Tip 14: Ensuring CUDA is Available for ONNX Runtime

To verify CUDA support in ONNX Runtime, run this code:

access_providers = onnxruntime.get_available_providers()
if "CUDAExecutionProvider" in access_providers:
    provider = ["CUDAExecutionProvider"] if torch.cuda.is_available() and self.device == "cuda" else ["CPUExecutionProvider"]
else:
    provider = ["CPUExecutionProvider"]

For CUDA 12.x, unlike version 11.8, you’ll need to install cuDNN 9.x on Linux (though this might not be necessary on Windows). Be cautious: sometimes onnxruntime-gpu installs without CUDA support. Once you ensure your torch version is CUDA-compatible, it's a good idea to reinstall onnxruntime-gpu:

pip install -U onnxruntime-gpu

Tip 15: Handling Library Version Conflicts

What if some models work only with older libraries, while others need the latest ones? I faced this in gfpganer, which required an old torchvision version, but video generation needed new torch libraries. Here’s how I solved it:

try:
    # Check if `torchvision.transforms.functional_tensor` and `rgb_to_grayscale` are missing
    from torchvision.transforms.functional_tensor import rgb_to_grayscale
except ImportError:
    # Import `rgb_to_grayscale` from `functional` if it’s missing in `functional_tensor`
    from torchvision.transforms.functional import rgb_to_grayscale
    import types
    import sys

    # Create a module for `torchvision.transforms.functional_tensor`
    functional_tensor = types.ModuleType("torchvision.transforms.functional_tensor")
    functional_tensor.rgb_to_grayscale = rgb_to_grayscale

    # Add this module to `sys.modules` so other imports can access it
    sys.modules["torchvision.transforms.functional_tensor"] = functional_tensor

This approach imports the modified methods from newer versions, ensuring compatibility across different models and libraries.

Tip 16: Watch Out for Warnings

Always pay attention to warnings in your project. These often hint at breaking changes in future library versions. Proactively address these warnings by updating or adding parameters as needed, preventing inconsistencies when you eventually upgrade. Keeping your codebase in sync with evolving libraries is crucial for long-term stability.

A reaction to console messages Warning from Torch:

Tip 17: GPU Management in a Cluster

When working with a cluster of multiple machines, remember that you can't combine the VRAM from separate GPUs. However, if your GPUs are on the same local network, libraries like Ray allow centralized GPU management from a single controller. Note: VRAM summing doesn't work except on a single machine with multiple GPUs. Here, techniques from Tip 8 apply, but VRAM still isn't cumulative across devices.

Tip 18: Model Compilation with `torch.jit`

Using torch.jit to compile models can greatly speed up execution. Try torch.jit.trace() or torch.jit.script() to convert your model into an optimized format, ideal for repeated calls:

import torch

# Tracing a model example
model = ...  # your model
example_input = ...  # an input sample for the model
traced_model = torch.jit.trace(model, example_input)

# Use traced_model for faster execution
output = traced_model(example_input)

This method shines when the same model is used repeatedly across various tasks.

Tip 19: Profiling for Performance Optimization

Tools like torch.profiler are invaluable for pinpointing bottlenecks in your model's performance. By profiling, you can see which operations consume the most time or memory and adjust your code for efficiency:

import torch
from torch.profiler import profile, record_function

with profile(profile_memory=True) as prof:
    with record_function("model_inference"):
        output = model(input_data)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

This helps allocate resources better and focus on optimizing the right sections of your code.

A Heartfelt Conclusion

And there we have it: 19 tips to supercharge your neural network projects! But I believe there's room for one more: your Tip 20. Drop your favorite optimization or development trick in the comments to complete this list together!

I have a dream: to see 4,096 stars on my GitHub project. Your support fuels my passion, drives me to improve code, develop new techniques, and share my experiences. If my work has been helpful, please star the project. Your encouragement means the world and inspires me to keep creating. And don’t forget to share your neural network projects on GitHub in the comments! 🖐

DEV Community

Practical Experience: Integrating Over 50 Neural Networks Into One Open-Source Project

Let’s dive into the insights

Tip 1: One Model is One Task

Tip 2: Managing a Task Queue

Tip 3: Monitoring Memory Usage

Tip 4: Handling “CUDA Out of Memory” Errors

Tip 6: Managing Memory-Intensive Models

Tip 7: Clearing Memory After Task Completion

Tip 8: Layer-Wise Model Loading

Tip 10: Efficient Frame Handling

Tip 11: Frame Resolution Adjustments

Tip 12: Are Models Always Synchronous?

Tip 13: Library Version Compatibility

Tip 14: Ensuring CUDA is Available for ONNX Runtime

Tip 15: Handling Library Version Conflicts

Tip 16: Watch Out for Warnings

Tip 17: GPU Management in a Cluster

Tip 18: Model Compilation with `torch.jit`

Tip 19: Profiling for Performance Optimization

A Heartfelt Conclusion

Top comments (0)

Read next

How to Install Vertica DB Client (vsql) on Ubuntu 24.04

7 Popular AI Video Generation Tools

How to resume neovim session

How I Created a Hover Reveal Text Animation with TailwindCSS and React

Let’s dive into the insights

Tip 1: One Model is One Task

Tip 2: Managing a Task Queue

Tip 3: Monitoring Memory Usage

Tip 4: Handling “CUDA Out of Memory” Errors

Tip 6: Managing Memory-Intensive Models

Tip 7: Clearing Memory After Task Completion

Tip 8: Layer-Wise Model Loading

Tip 10: Efficient Frame Handling

Tip 11: Frame Resolution Adjustments

Tip 12: Are Models Always Synchronous?

Tip 13: Library Version Compatibility

Tip 14: Ensuring CUDA is Available for ONNX Runtime

Tip 15: Handling Library Version Conflicts

Tip 16: Watch Out for Warnings

Tip 17: GPU Management in a Cluster

Tip 18: Model Compilation with torch.jit

Tip 19: Profiling for Performance Optimization

A Heartfelt Conclusion

Read next

How to Install Vertica DB Client (vsql) on Ubuntu 24.04

7 Popular AI Video Generation Tools

How to resume neovim session

How I Created a Hover Reveal Text Animation with TailwindCSS and React

Tip 18: Model Compilation with `torch.jit`