DEV Community

Benchmarking CSV File Processing: Golang vs NestJS vs PHP vs Python

Linda Sebastian on August 12, 2024

Introduction Processing large CSV files efficiently is a common requirement in many applications, from data analysis to ETL (Extract, Tr...
Collapse
 
amankrokx profile image
Aman Kumar • Edited

I couldn't help but see abnormalities with results. And no justice for nodejs there.

I cloned your repository and run tests by writing my javascript version running with both nodejs and bunjs.
Results good.

Processor: Ryzen 7 5800H
RAM: 16GB DDR4

Nodejs and Bunjs:

PS C:\aman\csv-parsing-battle\read-csv-nodejs> node .\index.js
Nodejs Execution Time: 0.5335701
Total Sales: $ 274836733.6899998
Top Product: 1067 with sales $1067

PS C:\aman\csv-parsing-battle\read-csv-nodejs> bun index.js
Nodejs Execution Time: 0.4552085
Total Sales: $ 274836733.6899998
Top Product: 1067 with sales $1067
Enter fullscreen mode Exit fullscreen mode

The same golang code for reference:

PS C:\aman\csv-parsing-battle\read-csv-go> go run .\sales.go
Golang Execution time: 245.1086ms
Total Sales: $274836733.69
Top Product: 1067 with sales $308326.83
Enter fullscreen mode Exit fullscreen mode

EDIT:
Added a PR github.com/rocklinda/csv-parsing-b...
Used node 22

Collapse
 
marcello_h profile image
Marcelloh

why the second loop in Go, because you could have looked for the topProduct in the first loop.
(then you don't need a map, saving memory and such...)

I think this might be faster :-)

Collapse
 
rocklinda profile image
Linda Sebastian

You are right I just realized now. Thank you for your sharp eye.

Collapse
 
marcello_h profile image
Marcelloh

Can you check if the outcome is a bit more in favour of Go? (and let us know)

Thread Thread
 
rocklinda profile image
Linda Sebastian • Edited

I made some changes you can see in my article above. Changing from two loops to one loop doesn't have any difference, I still need a map for storing variables and getting a product with the highest price. But, I think if in real-world cases which lot of logic will make a difference. Do you have any thoughts about this? Let me know...

Thread Thread
 
marcello_h profile image
Marcelloh

Linda, please have a look at this (to understand more about what I meant)
goplay.space/#Ny0OT89_zNP
I can't test it, since I don't have the file.

Trick is: no map, no loop for lookup. This must be slightly better.

Thread Thread
 
rocklinda profile image
Linda Sebastian

I think you misunderstood the case, topProduct is the sum of all the sales not the product with the highest price. There's no way I can get the sum value of each product with an integer variable. I still need a map to collect all the sums and get the highest value of total sales. My approach is the same as storing values in an array in PHP/NodeJS or a list in Python.

Thread Thread
 
darkwiiplayer profile image
𒎏Wii 🏳️‍⚧️ • Edited

You're missing the case of duplicate products.

If a product appears more than once in the list, you need to add the values.

Collapse
 
salah_eddine_2f765047f35b profile image
salah eddine

Using Polars with Python is more flexible and powerful for handling large datasets and performing complex data manipulation. While Golang is powerful, it isn't designed for working with datasets like data frames. However, Golang could dominate if a framework similar to Polars is developed for it

Collapse
 
rocklinda profile image
Linda Sebastian

Thank you for your insight, in this case, I just want to try plain Python. Next time I will try Polars Python with a larger dataset.

Collapse
 
kyesil profile image
Kamil Yesil • Edited

I check GitHub. Why nestjs run web server? Other tests direct run. It's not fair. Maybe run js only nodejs or bun.

Collapse
 
rocklinda profile image
Linda Sebastian

I thought about it as well, you are right. I will fix it later.

Collapse
 
gbhorwood profile image
grant horwood

just an fyi: if you're benchmarking a php script, hrtime stopwatch is a better option.

php.net/manual/en/class.hrtime-sto...

Collapse
 
jerodev profile image
Jeroen Deviaene (Jerodev)

In Go, you are converting the quantity to an int to then convert it to a float directly after. Converting to float immediately should noticably impact performance.

Collapse
 
juststevemcd profile image
Steve McDougall

You could probably refactor the PHP code to use generators for better speed. 1 million rows is enough data to see an improvement. It's a tipping point, if it isn't enough data it's actually slower - go figure

Collapse
 
rocklinda profile image
Linda Sebastian

I've never used generators before. Yeah, I will refactor this later to see the difference.

Collapse
 
constantine0808 profile image
Konstantin

JIT enabled?

Collapse
 
lsproule profile image
Lucas Sproule

Why use a library for Javascript if you won't use a library for python?

Collapse
 
rocklinda profile image
Linda Sebastian

The initial idea is that I really want to know if the NestJS framework is fast enough or not because I used this framework at my previous workplace. However, the nodeJS is already updated in GitHub with help from @amankrokx you can take a look.

Collapse
 
amankrokx profile image
Aman Kumar

I think the huge execution time difference is not because of using nestjs but rather the streaming way of getting CSV file and parsing it. The function call with rows can also contribute to it.
If we read the CSV all at once and then process it, I beleive it will run way within 6 seconds.

Then again, I haven't really used nestjs so maybe some other factors might be involved.
But overall, processing is processing, as long as compiler/interpreter produces similar machine code, it should perform similarly.

Collapse
 
ivn_vitta_a profile image
Iván Vitta

So compiled languages are faster than interpreted languages?

Collapse
 
rocklinda profile image
Linda Sebastian

I can't say that is correct or incorrect because if we want to know about this we need to compare between all compile languages and interpreter. In this experiment I only used Go.

Collapse
 
chooking profile image
ChooKing

I would like to see Golang compared to other native code. All the other languages you tested are interpreted.

Collapse
 
rocklinda profile image
Linda Sebastian

Yeah, that's a good idea. The reason why I use those languages is because of their popularity in BE technology, I don't consider between interpreter and compile.