DEV Community

Cover image for How to Detect NSFW Content in PDFs with Python and the NSFW API
Tarana Murtuzova for API4AI

Posted on

How to Detect NSFW Content in PDFs with Python and the NSFW API

Introduction

In the current digital landscape, the proliferation of NSFW (Not Safe For Work) material in digital documents is a growing issue. Research indicates that around 20% of documents circulated in professional settings contain some form of inappropriate content, including NSFW images. This highlights the pressing need for effective methods to detect and filter such content to maintain a secure and productive work environment.

Identifying NSFW images in PDFs is essential for multiple reasons. For companies, it aids in avoiding legal troubles, upholding workplace professionalism, and safeguarding the company's reputation. Educational institutions must ensure that the content provided to students is suitable and devoid of harmful material. Digital content platforms, which manage a vast array of user-generated content, need to monitor for NSFW images rigorously to maintain community standards and ensure a safe user experience. Failure to detect and control NSFW content can result in severe consequences, such as legal risks, reputation damage, and loss of user trust.

In this blog post, we will delve into using AI technology to effectively detect NSFW images in PDFs. We will specifically focus on the NSFW API from API4AI, an efficient tool that streamlines the identification of inappropriate content. We will offer a comprehensive step-by-step guide on creating a Python script that integrates with the NSFW API to extract and analyze images from PDFs for NSFW content. By the end of this post, you will have a thorough understanding of how to implement this solution in your workflows, ensuring that your digital documents remain safe and compliant.

Understanding NSFW Content

Definition of NSFW: What Constitutes NSFW Material?

NSFW (Not Safe For Work) content includes any material unsuitable for viewing in professional or public environments. This encompasses explicit content such as pornography, graphic violence, and other obscene images that may be offensive or disturbing. The term NSFW is widely used to label content that should be confined to private viewing to prevent discomfort, offense, or disruption in professional settings.

Examples of NSFW Content

NSFW content can appear in various forms within digital documents. Common types of NSFW images found in PDFs include:

  • Pornographic Images: Visual representations of explicit sexual acts or nudity.
  • Graphic Violence: Images depicting severe violence, blood, and gore.
  • Obscene Material: Depictions of lewd acts, gestures, or other forms of explicit content.
  • Hate Symbols: Images containing hate speech, symbols, or gestures that promote discrimination or hostility.

These forms of content are not only inappropriate but can also create a hostile environment. Therefore, it is crucial to detect and eliminate them from professional and educational materials.

Risks and Consequences

Failing to identify NSFW content in PDFs can lead to several significant risks and adverse outcomes:

  • Legal Issues: Organizations may face legal consequences if they distribute or allow NSFW content in their documents. This can include lawsuits for disseminating obscene material or creating a hostile work environment.
  • Workplace Harassment: NSFW content can foster an environment of workplace harassment, making employees feel uncomfortable or unsafe. This can lead to complaints, increased employee turnover, and reduced productivity.
  • Reputation Damage: The presence of NSFW content in publicly accessible documents can severely harm an organization's reputation. Clients, partners, and the public may lose trust in an organization that fails to manage inappropriate content properly.
  • Compliance Violations: Various industries have stringent regulations regarding content distribution. Organizations must adhere to these regulations to avoid penalties and maintain their good standing within the industry.

By understanding what NSFW content entails and recognizing the associated risks and consequences, organizations can take proactive measures to ensure their digital documents are safe and suitable for all audiences. In the following sections, we will explore how to utilize AI technology, specifically the NSFW API, to effectively detect and manage such content in PDFs.

The Challenge of Detecting NSFW Content in PDFs

Technical Difficulties

Identifying NSFW images in PDFs involves several technical obstacles that make the process quite complex:

  • Varying Image Formats: PDFs can contain images in a variety of formats such as JPEG, PNG, GIF, and vector graphics. A detection system must be adaptable enough to manage different file types and compression techniques.
  • Embedded Images: Images within PDFs can be embedded in multiple ways. They might be directly embedded in the document's content stream or referenced from external sources, making image extraction from a PDF a challenging task.
  • Encryption: Many PDFs are encrypted or password-protected to safeguard their content. This encryption must be appropriately handled to access the images within the document.
  • Complex Layouts: PDFs can feature intricate layouts with overlapping text and images, complicating the accurate extraction and identification of images without capturing unwanted elements.
  • Scalability: Processing large volumes of PDFs quickly and efficiently demands significant computational resources, which can be a constraint for many organizations.

Manual vs. Automated Detection

Comparing manual methods with automated AI-based solutions highlights the benefits of using technology:

  • Manual Detection: Involves individuals examining each page of a PDF to identify NSFW content. This approach is time-consuming, labor-intensive, and susceptible to human error. It is impractical for managing large volumes of documents and can result in inconsistent outcomes.
  • Automated AI-based Detection: Employs machine learning algorithms to analyze images within PDFs. Automated systems, such as those utilizing the NSFW API, can swiftly and accurately process a large number of documents. These systems can be trained to recognize various types of NSFW content with high accuracy, thereby reducing the workload on human reviewers and enhancing overall efficiency.

Importance of Accuracy

Precision is crucial when detecting NSFW content to prevent the issues associated with false positives and negatives:

  • False Positives: Incorrectly flagging safe content as NSFW can result in unnecessary censorship, loss of valuable information, and frustration for users who must manually review and rectify these mistakes.
  • False Negatives: Missing actual NSFW content carries significant risks, including the spread of inappropriate material, potential legal consequences, and harm to an organization's reputation.
  • Balancing Sensitivity and Specificity: Achieving high accuracy requires balancing sensitivity (detecting all NSFW content) and specificity (correctly identifying non-NSFW content). Automated systems can be fine-tuned to optimize this balance, ensuring reliable and robust detection.
  • Continuous Improvement: AI models can continuously learn and improve from new data, enhancing their detection capabilities over time. This adaptability is vital for maintaining high accuracy as new forms of NSFW content emerge.

By understanding and addressing these challenges, organizations can implement effective strategies for detecting NSFW images in PDFs. Leveraging automated AI solutions like the NSFW API ensures digital content remains appropriate and compliant.

AI-Powered Solutions for Detecting NSFW Images

Overview of AI and Machine Learning

Artificial Intelligence (AI) and Machine Learning (ML) have transformed image detection methodologies. AI encompasses the development of algorithms capable of performing tasks that typically require human intelligence. ML, a branch of AI, allows these algorithms to learn from data and enhance their performance over time. For image detection, AI and ML utilize techniques such as convolutional neural networks (CNNs) to analyze visual data and identify patterns.

These advanced technologies can be trained on extensive datasets of images to recognize various types of content, including NSFW material. By processing and learning from labeled examples, AI models can accurately classify new, unseen images based on the features they have learned.

Introduction to the NSFW API

The NSFW API, offered by API4AI, is a specialized tool designed to streamline the detection of NSFW images. This API utilizes advanced AI models specifically trained to identify inappropriate content in images. By incorporating this API into your applications, you can automate the process of scanning and classifying images within PDFs, ensuring that any NSFW content is detected and flagged accordingly.

The NSFW API simplifies the implementation process, eliminating the need for organizations to develop and train their own AI models. It provides a ready-to-use solution that can be seamlessly integrated into various applications and workflows, enabling businesses to concentrate on their core activities while ensuring a safe and compliant digital environment.

Features of NSFW API

The NSFW API offers several important features and advantages that make it a powerful tool for detecting NSFW content:

  • High Accuracy: Powered by advanced AI models trained on diverse datasets, the API ensures high precision in identifying various types of NSFW content, reducing the occurrence of false positives and negatives.
  • Ease of Integration: Designed for seamless integration, the API provides clear documentation and simple endpoints, enabling developers to quickly add NSFW detection to their applications using standard HTTP requests.
  • Scalability: Capable of handling large volumes of images, the NSFW API is suitable for businesses and platforms of all sizes. Whether processing a few documents or thousands, the API scales efficiently without sacrificing performance.
  • Real-Time Processing: Offering real-time processing capabilities, the API allows for immediate detection and response. This is essential for applications that need quick identification of NSFW content, such as content moderation platforms.
  • Versatility: The NSFW API can be applied in various scenarios, from scanning images within PDFs to monitoring user-uploaded content on social media platforms. Its flexibility makes it a valuable asset for any organization dealing with digital images.
  • Continuous Updates: The team at API4AI continually updates and enhances the models behind the NSFW API, ensuring it remains effective against new types of NSFW content and adapts to changing standards and requirements.

By utilizing the NSFW API, organizations can leverage AI to efficiently and accurately detect NSFW images in their digital documents. This not only helps maintain a professional and safe environment but also reduces the burden on human reviewers, allowing them to focus on more complex tasks that require human judgment.

Developing a Python Script for NSFW Detection

In this section, we will guide you through the steps to develop a Python script for detecting NSFW images in PDFs using the NSFW API. We'll cover setting up your environment, extracting images from PDFs, integrating with the NSFW API, and testing the finalized script.

Setting Up the NSFW API

To use the NSFW API, you'll need to create an account on Rapid API Hub, where API4AI's solutions are hosted, and obtain an API key for authenticating your requests.

  1. Create an Account: Visit the Rapid API Hub website and sign up for an account.
  2. Find the NSFW API: After logging in, search for the NSFW API on the platform or follow this link to locate it directly.
  3. Subscribe to the API: Navigate to the subscription plans on the NSFW API page and subscribe to the appropriate plan.
  4. Obtain Your API Key:
  • After subscribing, go to the Dashboard.
  • On the left side of the screen, look for an entry like "default-application_xxxxxx."
  • Click on it, then go to the Authorization section and copy your Application Key.

Rapid API
Rapid API Hub: Obtaining API Key

Prerequisites

Before starting with the code, ensure you have the following prerequisites:

  • Python: Confirm that Python is installed on your system. You can download it from the official Python website.
  • Required Libraries: Install the necessary Python libraries, such as requests, using a package manager like pip. This will configure your development environment and get you ready for scripting.

Parsing Command-line Arguments

To enhance the script's flexibility, we will use the argparse library to manage command-line arguments. This enables users to specify the PDF file path and the API key directly from the command line, improving the script’s usability.

First, create a function to parse these arguments. This function will initialize an argument parser, define the arguments, and return the parsed arguments. By setting up command-line arguments, the script becomes adaptable to different input files and configurations without needing code modifications.



def parse_args():
    """Parse command line arguments."""
    parser = argparse.ArgumentParser()
    parser.add_argument('--api-key', help='Rapid API key.', required=True)  # Get your token at https://rapidapi.com/api4ai-api4ai-default/api/nsfw3/pricing
    parser.add_argument('pdf', type=Path,
                        help='Path to a pdf.')
    return parser.parse_args()


Enter fullscreen mode Exit fullscreen mode

Obtaining NSFW Probabilities for Each PDF Page

With the command-line arguments set up, the next step is to create a function that processes each page of the PDF, extracts images, and sends them to the NSFW API for analysis.

This step is straightforward because the NSFW API natively supports PDFs. We can directly send the PDF for processing and then parse the results, which will include the NSFW probability for each page.




def get_nsfw_probs(pdf_path: Path, api_key: str) -> list:
    """
    Get probabilities of NSFW content in PDF using NSFW API.
    Returns list of probabilities that content is NSFW, representing pdf pages.
    """
    # We strongly recommend you use exponential backoff.
    error_statuses = (408, 409, 429, 500, 502, 503, 504)
    s = requests.Session()
    retries = Retry(backoff_factor=1.5, status_forcelist=error_statuses)

    s.mount('https://', HTTPAdapter(max_retries=retries))

    url = f'{API_URL}'
    with pdf_path.open('rb') as f:
        api_res = s.post(url, files={'image': f},
                         headers={'X-RapidAPI-Key': api_key}, timeout=20)
    api_res_json = api_res.json()

    # Handle processing failure.
    if (api_res.status_code != 200 or
            api_res_json['results'][0]['status']['code'] == 'failure'):
        print('Image processing failed.')
        sys.exit(1)

    # Each page is a different result.
    probs = [result['entities'][0]['classes']for result in api_res_json['results']]
    return probs


Enter fullscreen mode Exit fullscreen mode

Main Function

The next step is to evaluate the results to make informed decisions. The NSFW API provides a probability for each page, indicating the likelihood of NSFW content. Based on business requirements and the desired stringency of your solution, you can adjust the threshold for classifying a page as NSFW. This means you can set the threshold higher or lower depending on how strict you want the filtering to be. In our test script, we will use a threshold value of 0.5.

By carefully evaluating these probabilities, we can customize the system to meet specific needs, ensuring that content filtering aligns with the organization's standards and expectations. This flexibility allows for an adaptable approach, making the solution suitable for various scenarios and requirements. Using a threshold of 0.5 in our test script establishes a baseline for further fine-tuning and optimization, resulting in a more precise and effective content analysis process.



def main():
    """
    Script entry function.
    """
    args = parse_args()
    probs = get_nsfw_probs(args.pdf, args.api_key)
    if any([prob['nsfw'] > 0.5 for prob in probs]):
        print('PDF contains NSFW content.')
    else:
        print('PDF is safe for work.')


if __name__ == '__main__':
    main()


Enter fullscreen mode Exit fullscreen mode

Complete Python Script

By combining all the components, you'll have a comprehensive Python script that integrates all the necessary steps to detect NSFW images in PDFs. This script is designed to be simple and concise, making it easy to implement and customize for specific needs, and allowing seamless integration into various business pipelines.

This flexibility is vital, as it enables you to tailor the code to different scenarios and requirements, ensuring it meets your organization's unique demands. Whether you need to adjust thresholds, modify processing steps, or add new features, this script provides a robust foundation that can be adapted to your specific use case. By leveraging this adaptable approach, you can streamline the process of detecting NSFW content in PDFs, making it more efficient and effective for your business needs.



"""
Check NSFW content in PDF using NSFW API.

Run script:
`python3 main.py --api-key <RAPID_API_KEY> <PATH_TO_PDF>
"""

import argparse
import sys
from pathlib import Path

import requests
from requests.adapters import Retry, HTTPAdapter


API_URL = 'https://nsfw3.p.rapidapi.com/v1/results'


def parse_args():
    """Parse command line arguments."""
    parser = argparse.ArgumentParser()
    parser.add_argument('--api-key', help='Rapid API key.', required=True)  # Get your token at https://rapidapi.com/api4ai-api4ai-default/api/nsfw3/pricing
    parser.add_argument('pdf', type=Path,
                        help='Path to a pdf.')
    return parser.parse_args()


def get_nsfw_probs(pdf_path: Path, api_key: str) -> list:
    """
    Get probabilities of NSFW content in PDF using NSFW API.
    Returns list of probabilities that content is NSFW, representing pdf pages.
    """
    # We strongly recommend you use exponential backoff.
    error_statuses = (408, 409, 429, 500, 502, 503, 504)
    s = requests.Session()
    retries = Retry(backoff_factor=1.5, status_forcelist=error_statuses)

    s.mount('https://', HTTPAdapter(max_retries=retries))

    url = f'{API_URL}'
    with pdf_path.open('rb') as f:
        api_res = s.post(url, files={'image': f},
                         headers={'X-RapidAPI-Key': api_key}, timeout=20)
    api_res_json = api_res.json()

    # Handle processing failure.
    if (api_res.status_code != 200 or
            api_res_json['results'][0]['status']['code'] == 'failure'):
        print('Image processing failed.')
        sys.exit(1)

    # Each page is a different result.
    probs = [result['entities'][0]['classes']for result in api_res_json['results']]
    return probs


def main():
    """
    Script entry function.
    """
    args = parse_args()
    probs = get_nsfw_probs(args.pdf, args.api_key)
    if any([prob['nsfw'] > 0.5 for prob in probs]):
        print('PDF contains NSFW content.')
    else:
        print('PDF is safe for work.')


if __name__ == '__main__':
    main()
```

**Testing the Script**

To test the script, follow these steps:

- **Prepare a PDF**: Obtain a sample PDF file with embedded images for testing. This will allow you to verify the script's functionality in a realistic scenario. You can [download a sample PDF](https://storage.googleapis.com/api4ai-static/rapidapi/tutorial_nsfw_pdf/nsfw.pdf) to get started.
- **Run the Script**: Execute the script from your terminal, supplying the path to the PDF file and your API key as arguments.

```bash
python3 main.py --api-key YOUR_API_KEY ./nsfw.pdf
PDF contains NSFW content.
```
By following these steps, you can efficiently create and test a Python script to detect NSFW images in PDFs using the NSFW API. This automated method not only saves time but also ensures greater accuracy and consistency in identifying inappropriate content.

## Conclusion

In this blog post, we examined the vital importance of detecting NSFW images in PDFs and demonstrated how to implement an effective solution using Python and the NSFW API. We began by defining NSFW content and providing examples, highlighting the risks and consequences of failing to identify such material. We then addressed the technical challenges of NSFW detection, compared manual and automated methods, and emphasized the need for accuracy in these processes.

We offered a detailed guide to developing a Python script for NSFW detection, including setting up the NSFW API, parsing command-line arguments, extracting and analyzing images from PDFs, and testing the script. By following these steps, you can automate the detection of inappropriate content, ensuring a safer and more professional environment.

Leveraging the power of AI and the NSFW API can significantly enhance your content moderation processes, reduce the workload on human reviewers, and ensure that your digital documents remain appropriate and safe for all audiences. We encourage you to incorporate these techniques into your workflows and stay updated on the latest developments in AI-powered content detection.

For additional resources, tutorials, and tools, explore the documentation and support provided by API4AI and other AI technology providers. With ongoing advancements in AI and machine learning, content detection systems will continue to improve, offering even more robust solutions for maintaining safe digital environments.

[More stories about Image Processing Solutions](https://api4.ai/blog)
Enter fullscreen mode Exit fullscreen mode

Top comments (0)