Hi Folk.
I've been working on setting up and managing TensorRT-LLM and Triton backend scripts to build the Llama2-7b model in FP16, int8, and int4 formats.
I ran a benchmark with int4 and achieved an inference speed of approximately 100 tokens per second.
Top comments (1)
Hello @mattick27 great work. I love the hardwork you put in this. Above all thanks for the link, it helps to be able to understand what you are referring to.