Jaeyoun Nam

Posted on Aug 18 • Edited on Oct 24

NestJS + Opentelemetry (Sampling)

#webdev #javascript #nestjs

Grafana Cloud

이전 포스트에서 Grafana Cloud에 Opentelemetry data를 쏴서 저장하고 보는 것을 했다.

그라파나 클라우드 무료 버전을 사용하면 한달에 로그와 트레이스에 50GB정도를 준다. 유저가 얼마 없어서 Trace가 별로 안쌓이는(혹은 로그를 안찍는)서비스라면 그냥 사용해도 되겠지만, 조금 규모가 있는 상태에서 도입한다면 로그가 너무 많이 쌓여 터질까봐 두렵다.

Sampling

Sampling이란 전체에서 일부를 뽑아 쓰는 것이다. 결과적으로 저장되는 Telemetry 데이터의 수를 감소시키는 작업이다.

Why need Sampling

샘플링은 왜 필요할까?

위의 그림에서 모든 동그라미(트레이스)를 저장할 필요는 없다. 중요한 트레이스(에러, 혹은 너무 수행시간이 긴)와 전체를 대표하는 일부 표본(OK trace중 일부)만 저장하면 충분하다.

Sampling의 종류

샘플링은 크게 Head Sampling, Tail Sampling으로 나눌 수 있다.

Head Sampling

맨 앞에서 샘플링하는 것을 말한다. 대표적으로 그냥 확률적으로 샘플링 하는 것이 있다. 전체 트레이스에서 10퍼센트만 남기고 나머지는 트레이스 하지 않는 것이다.

Javascript

TraceIdRatioBasedSampler를 기본적으로 제공한다.

import { TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-node';

const samplePercentage = 0.1;

const sdk = new NodeSDK({
  // Other SDK configuration parameters go here
  sampler: new TraceIdRatioBasedSampler(samplePercentage),
});

단점

묻고 따지지도 않고 Drop해버리는 거기 때문에 중요한 trace들이 드롭되는 경우가 있다.

Tail Sampling

뒤쪽에서 샘플링 하는 것이다. 이 때는 사용할 수 있는 정보가 많기 때문에 원하는 로직에 따라서 필터링 할 수 있다.
예를 들어, 에러 트레이스는 무조건 샘플링 하는 식이다.
보통, 콜렉터에서 일단 모든 트레이스를 받은 이후에 샘플링을 한다.

단점

구현이 어려울 수 있다. 시스템이 바뀌고 조건이 바뀌면 항상 바껴야하는 존재다.
샘플링하기 위해 Stateful인 상태를 유지하고 있어야해서 수행이 어렵다.
Tail Sampler가 vendor-specific 인 경우가 있다.

구현

Tail Sampling을 Custom Span Processor를 구현하여 구현해보자.

SamplingSpanProcessor 구현

sampling-span-processor.ts 파일 생성

import { Context } from "@opentelemetry/api";
import {
  SpanProcessor,
  ReadableSpan,
  Span,
} from "@opentelemetry/sdk-trace-node";

/**
 * Sampling span processor (including all error span and ratio of other spans)
 */
export class SamplingSpanProcessor implements SpanProcessor {
  constructor(
    private _spanProcessor: SpanProcessor,
    private _ratio: number
  ) {}

  /**
   * Forces to export all finished spans
   */
  forceFlush(): Promise<void> {
    return this._spanProcessor.forceFlush();
  }

  onStart(span: Span, parentContext: Context): void {
    this._spanProcessor.onStart(span, parentContext);
  }

  shouldSample(traceId: string): boolean {
    let accumulation = 0;
    for (let idx = 0; idx < traceId.length; idx++) {
      accumulation += traceId.charCodeAt(idx);
    }
    const cmp = (accumulation % 100) / 100;
    return cmp < this._ratio;
  }

  /**
   * Called when a {@link ReadableSpan} is ended, if the `span.isRecording()`
   * returns true.
   * @param span the Span that just ended.
   */
  onEnd(span: ReadableSpan): void {
    // Only process spans that have an error status
    if (span.status.code === 2) {
      // Status code 0 means "UNSET", 1 means "OK", and 2 means "ERROR"
      this._spanProcessor.onEnd(span);
    } else {
      if (this.shouldSample(span.spanContext().traceId)) {
        this._spanProcessor.onEnd(span);
      }
    }
  }

  /**
   * Shuts down the processor. Called when SDK is shut down. This is an
   * opportunity for processor to do any cleanup required.
   */
  async shutdown(): Promise<void> {
    return this._spanProcessor.shutdown();
  }
}

status.code가 2 (Error)거나 ratio 확률에 당첨되었을 때만 this._spanProcessor.onEnd(span); 를 호출해서 export한다.

OtelSDK 업데이트

main.ts에서 spanProcessors를 업데이트 해준다.

  spanProcessors: [
    new SamplingSpanProcessor(
      new BatchSpanProcessor(traceExporter),
      samplePercentage
    ),
  ],

DEV Community

NestJS + Opentelemetry (Sampling)

Grafana Cloud

Sampling

Why need Sampling

Sampling의 종류

Head Sampling

Javascript

단점

Tail Sampling

단점

구현

SamplingSpanProcessor 구현

OtelSDK 업데이트

Top comments (0)

Read next

How to Make a Retro 2D JavaScript Game Part 1

Next.js Optimization for Dynamic Apps: Vercel Edge vs. Traditional SSR

Moving beyond console.log

Async Local Storage is Here to Help You