DEV Community

Mati
Mati

Posted on

Run Llama7b 100+ TPS with A10.

Hi Folk.

I've been working on setting up and managing TensorRT-LLM and Triton backend scripts to build the Llama2-7b model in FP16, int8, and int4 formats.

I ran a benchmark with int4 and achieved an inference speed of approximately 100 tokens per second.

Github Llama7b-TensorRT-LLM

Top comments (1)

Collapse
 
softwaresennin profile image
Lionel Tchami ♾️☁️

Hello @mattick27 great work. I love the hardwork you put in this. Above all thanks for the link, it helps to be able to understand what you are referring to.