Rhythm Saha

Posted on Aug 5

My Favorite Failure of the Month (And What I Learned)

#webdev #failureislearning #serverless #nextjs

Hey everyone! As a fullstack web developer and the founder of NovexiQ, my days are usually packed building cool, modern web apps for clients. I'm really deep into the MERN stack, Next.js, and all those awesome tools. It's a super rewarding journey, for sure! But let's be real, right? It's not always smooth sailing and perfect deployments. Sometimes, things just break. And sometimes, they totally crash and burn in a really spectacular way, usually when you least expect it. This past month? Yeah, I had one of those 'spectacular breaks.' And honestly? It's probably my favorite failure yet, all because of what it taught me.

You know how we often see those highlight reels – all the successful projects, sleek UIs, amazing performance? Well, what we don't always see are the countless hours of debugging, refactoring, and yeah, even the outright failures that go into every 'flawless' launch. For me, learning to embrace these failures, really dig into them, and pull out valuable lessons? That's probably the most crucial skill any developer, especially a solo founder like me, can cultivate. So, come on, let's chat about a recent incident that seriously tested my limits, taught me a ton, and ultimately made NovexiQ even stronger.

The Scenario: A Flash Sale Gone Sideways

So, here's the deal: I had this super cool fashion boutique client, right here in Kolkata, who wanted to launch their very first flash sale on their brand-new e-commerce platform. I'd built that platform from scratch, obviously using Next.js 14 (App Router), TypeScript, and Tailwind CSS for the frontend. On the backend, it was a Node.js API with Prisma ORM, hooked up to a PostgreSQL database on Neon. And for deployment? Vercel handled everything seamlessly, using its serverless functions for all our API routes. This flash sale was a huge deal for them, a really big moment! It meant a limited stock of highly anticipated items would drop at a specific time, creating that classic high-traffic, burst-load situation we all know. My team (which, for now, is mostly just me, haha!) had meticulously tested the order placement flow under normal conditions. We'd even done some basic load testing using ab\ (ApacheBench) on a few endpoints. Everything seemed totally fine, you know? Famous last words, right?

The Moment of Truth (and Collapse)

The sale kicked off exactly at 8 PM. At first, things were looking pretty smooth. A few orders trickled in, no problem at all. But then, boom! That first massive wave of eager shoppers hit the 'Buy Now' button all at once, and the whole system just completely buckled. Instead of getting order confirmations, users were seeing generic error messages or just endless loading spinners. My client's WhatsApp messages? They started flooding in – panicked, frustrated, and totally understandable. Sales had pretty much ground to a halt. My heart just sank, you know? I immediately dove into the Vercel logs. The issue was crystal clear, painfully so: the API endpoint for creating orders was timing out over and over again. We're talking not just once in a while, but almost every single request under that heavy load. Vercel's serverless functions were spinning up, but those requests simply weren't finishing within the allowed time.

Initial Debugging Whirlwind

My mind was absolutely racing. Was it a database issue? Was my Prisma query just too slow? Or was Vercel’s cold start problem acting up again? I quickly jumped into action, going through my mental checklist:

First, I checked Neon's database metrics. Connections *were* spiking, sure, but they weren't totally maxing out just yet. CPU and memory seemed fine.
Next up, I reviewed the API endpoint code itself. The Prisma query was pretty standard – just a transaction to create an order, update stock, and log the sale. Nothing inherently complex there, really.
Then, I dug into the Vercel function logs. And boom! The timeout errors were incredibly consistent. Some requests were taking over 10 seconds just to process, which is way past the 10-second default timeout for free-tier Vercel functions. Not good!

// Simplified API endpoint for order creation
import { NextRequest, NextResponse } from 'next/server';
import { PrismaClient } from '@prisma/client';

const prisma = new PrismaClient();

export async function POST(req: NextRequest) {
  try {
    const { items, userId, shippingAddress } = await req.json();

    const result = await prisma.$transaction(async (tx) => {
      // 1. Create the order
      const order = await tx.order.create({
        data: {
          userId,
          shippingAddress,
          status: 'PENDING',
          totalAmount: 0, // Will update later
        },
      });

      let totalAmount = 0;
      const orderItemsData = [];

      // 2. Process each item, check stock, and create order items
      for (const item of items) {
        const product = await tx.product.findUnique({
          where: { id: item.productId },
        });

        if (!product || product.stock < item.quantity) {
          throw new Error(`Insufficient stock for product ${item.productId}`);
        }

        await tx.product.update({
          where: { id: item.productId },
          data: { stock: { decrement: item.quantity } },
        });

        orderItemsData.push({
          orderId: order.id,
          productId: item.productId,
          quantity: item.quantity,
          price: product.price,
        });
        totalAmount += product.price * item.quantity;
      }

      await tx.orderItem.createMany({ data: orderItemsData });

      // 3. Update total amount for the order
      await tx.order.update({
        where: { id: order.id },
        data: { totalAmount },
      });

      return order;
    });

    return NextResponse.json({ order: result }, { status: 201 });
  } catch (error: any) {
    console.error('Order creation failed:', error.message);
    return NextResponse.json({ error: error.message }, { status: 500 });
  }
}

The Root Cause: Prisma, Serverless, and Connection Pooling

After about an hour of frantic digging, the real culprit finally became crystal clear: it was Prisma’s connection management struggling big time in a serverless environment under high concurrency.

Here’s the full breakdown of what was actually going on:

Cold Starts & Connection Sprawl: So, when a Next.js serverless function on Vercel gets a request after being idle, it hits a 'cold start,' right? That means the function basically has to spin up from scratch. And every single time it did, my code was busy creating a new PrismaClient(). Now, Prisma does have internal connection pooling, but in a serverless function that's constantly spinning up and tearing down, this can quickly lead to a ton of open, unused database connections, especially with PostgreSQL. It's kinda like leaving dozens of doors open when you only need a few – super inefficient!
Database Connection Limits: My Neon database instance, since it was on a cost-effective tier (trying to save a buck, you know?), had a pretty low maximum connection limit. When that sudden burst of requests hit, dozens of serverless functions were all trying to establish new connections at the same time. This quickly just ate up the database’s connection limit.
Connection Queueing & Timeouts: Once that connection limit was hit, new connection requests just started queuing up. My serverless functions, waiting for a database connection that either never showed up or took way too long, would eventually hit their execution timeout and fail. That's exactly what caused all those frustrating errors our customers were seeing.

So, it wasn't really that the queries themselves were slow; it was that the connections were either slow or completely unavailable because of that specific serverless burst-traffic pattern. My PrismaClient instance just wasn't optimized for this kind of unique environment.

The Hard-Earned Lessons (and Fixes)

This 'favorite failure' really forced me to deep-dive into serverless architecture best practices, especially when it comes to database interactions. And trust me, these were some hard-earned lessons! Here’s what I learned, and what I immediately put into action:

1. Singleton PrismaClient Instance

When you're working with Next.js API routes deployed on Vercel, the go-to approach, which I totally missed initially, is to make sure you’re using a single PrismaClient instance across all requests within a single serverless function instance. This stops you from opening multiple connections during the same function invocation. The common pattern involves using a global variable to store that client – super important for performance!

// utils/prisma.ts
import { PrismaClient } from '@prisma/client';

declare global {
  var prisma: PrismaClient | undefined;
}

let prisma: PrismaClient;

if (process.env.NODE_ENV === 'production') {
  prisma = new PrismaClient();
} else {
  if (!global.prisma) {
    global.prisma = new PrismaClient();
  }
  prisma = global.prisma;
}

export default prisma;

Then, in my API routes:

// pages/api/order.ts (or app/api/order/route.ts)
import prisma from '@/utils/prisma';
// ... rest of the code using 'prisma' instance

2. External Connection Pooling (e.g., PgBouncer via Neon/Supabase)

Even with that singleton pattern, you know, serverless functions spinning up concurrently can still totally overwhelm a database. That's where external connection pooling comes in. But here's the cool part: services like Neon actually offer built-in connection pooling (often using PgBouncer). It's *super* important to understand how to use these for serverless setups. For Neon, it means making sure your connection string points to their pooled endpoint. This basically abstracts away all the tricky connection management, letting the database efficiently reuse connections instead of opening brand-new ones for every single function invocation. It's a game-changer! I also took a closer look at my Neon plan. I ended up upgrading it slightly to get more connections and better burst capacity during peak times. I totally realized that my initial cost-saving approach was actually a false economy for such a critical e-commerce app. Live and learn, right?

3. Rigorous Load Testing for Critical Paths

My simple ab tests? Yeah, they were totally not enough! For high-stakes events like flash sales, really thorough load testing is just non-negotiable. Period. Now, I make sure to incorporate more robust tools like k6 or Artillery into my pre-launch checklist for any client project that’s expecting a lot of traffic. These tools can simulate thousands of concurrent users, which is awesome for spotting bottlenecks *before* they ever hit production. Trust me, it's worth every bit of effort!

// Example k6 script snippet for order creation
import http from 'k6/http';
import { check } from 'k6';

export default function () {
  const url = 'https://your-app.vercel.app/api/order';
  const payload = JSON.stringify({
    items: [{ productId: 'prod_xyz', quantity: 1 }],
    userId: 'user_abc',
    shippingAddress: '123 Main St',
  });

  const params = {
    headers: { 'Content-Type': 'application/json' },
  };

  const res = http.post(url, payload, params);

  check(res, {
    'status is 201': (r) => r.status === 201,
    'response has order ID': (r) => r.json().order.id !== undefined,
  });
}

4. Enhanced Observability and Alerting

Vercel logs are great, don't get me wrong. They're super helpful. But for truly critical applications, I've now integrated more comprehensive logging and monitoring. Setting up alerts for things like API timeouts, high error rates, or database connection spikes? That's absolutely crucial. It really lets me be proactive instead of just reactive when issues pop up. You can't just cross your fingers and hope!

5. Transparent Client Communication

During the whole incident, my absolute top priority was staying totally transparent with the client. I immediately told them about the issue, explained we were working on a fix, and kept them updated every 15-30 minutes. Even when things are going sideways, clear and honest communication *really* builds trust. It's tough, but it pays off big time. After we finally resolved it, I gave them a full post-mortem explanation and detailed all the steps we took to make sure it wouldn't happen again. They really appreciated that.

Moving Forward: Embracing the Process

That flash sale incident? Man, it was a tough pill to swallow, no doubt about it. But those lessons? They were just invaluable. It really pushed me to deepen my understanding of serverless architecture, database connection management, and just how critically important realistic load testing is. Because of this whole experience, NovexiQ is now running with much more robust development and deployment processes – and I'm super proud of that! So, to all my fellow developers out there, especially if you're just starting out or building your own ventures: don't ever fear failure. Seriously, embrace it! Every single bug, every timeout, every broken feature? They're all teachers, trust me. Dig into them, understand them, fix them, and share what you've learned. That's how we truly grow, not just as developers, but as effective problem-solvers. This wasn't just a failure for me; it was a huge investment in my skill set and in making NovexiQ even more resilient. It was a spectacular failure that helped me fail forward, you know? What's your favorite failure story? I'd genuinely love to hear about it in the comments below!

DEV Community