Did you consider processing the file using LINQ methods? They are lazy evaluated. Which means there is no memory allocation for data until you try to materialize the results (e.g. with ToArray). And during processing the only memory usage should be for the current iteration. So aside from IEnumerable overhead, the only memory usage should be what is in the file read buffer plus any locals you are using.
Here is a good example. However, rather than using the LINQ syntax, I tend to use the extension methods.
Instead of
vartypedSequence=fromlineinReadFrom(path)letrecord=ParseLine(line)whererecord.Active// for exampleselectrecord.Key;
It is admittedly a bit uglier with all the extra symbols, but it does not feel as foreign with the rest of my code as the LINQ syntax.
If you debug through this, you will see that one line goes through all of the steps in order. Maybe it does not match and gets stopped at the Where clause. Then the next line is read and goes thru the steps.
I find that (once you get used to the syntax) Linq is much more understandable that solutions using imperative loops. I see where and I know that all it is doing is filtering. I see select and I know it is only converting the item to a different form. Whereas imperative loops have to be read carefully to make sure of exactly what is being performed.
Your hand coded optimization at the bottom is clearly going to be superior in resource usage, but it trades off to be hard to understand and maintain. Using Linq properly can get you a large percentage of the same gains in memory usage (and consequently runtime) while still being pretty expressive.
- Performance & optimisation is my wheelhouse
- Passionate about technical excellence
- Extreme programming & lean software development
- Trust in the scientific method
The point I was making there was that LINQ methods to not load or run anything when they are declared (a common mistaken assumption) -- they only "do something" when they are enumerated.
The main issue from the original solution was run-away memory usage, because all the rows are loaded into memory at once. Then a new intermediate object is allocated at each step for every row. So memory usage is roughly: number of rows * (row size + intermediate objects sizes)
Using LINQ as I mentioned, only one row is fully processed at a time before fetching the next row. So at most you have one set of allocations for the row and each intermediate object. So memory usage is roughly: row size + intermediate objects sizes.
Any solution processing files would probably also do well to buffer the results to an IO output, to avoid collecting large result sets in memory.
If Garbage Collector performance is an issue, that can be optimized separately. Common strategies are: value types (allocated on stack frame and copied when passed in or returned to other stack frames), or a pre-allocated object pool, or if you need the same consistent set of objects for each row, then a set of singleton objects is equivalent to an object pool of size 1... just remember to reset them between iterations.
I am glad you posted this response detailing the usage of LINQ. When I started reading this article, I was thinking that it would be some interesting LINQ wizardry. However, it is always nice to see optimizations in any form.
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
Thanks for the article!
Did you consider processing the file using LINQ methods? They are lazy evaluated. Which means there is no memory allocation for data until you try to materialize the results (e.g. with ToArray). And during processing the only memory usage should be for the current iteration. So aside from IEnumerable overhead, the only memory usage should be what is in the file read buffer plus any locals you are using.
Here is a good example. However, rather than using the LINQ syntax, I tend to use the extension methods.
Instead of
I like this better
It is admittedly a bit uglier with all the extra symbols, but it does not feel as foreign with the rest of my code as the LINQ syntax.
If you debug through this, you will see that one line goes through all of the steps in order. Maybe it does not match and gets stopped at the
Where
clause. Then the next line is read and goes thru the steps.I find that (once you get used to the syntax) Linq is much more understandable that solutions using imperative loops. I see
where
and I know that all it is doing is filtering. I seeselect
and I know it is only converting the item to a different form. Whereas imperative loops have to be read carefully to make sure of exactly what is being performed.Your hand coded optimization at the bottom is clearly going to be superior in resource usage, but it trades off to be hard to understand and maintain. Using Linq properly can get you a large percentage of the same gains in memory usage (and consequently runtime) while still being pretty expressive.
In any case, thank you for posting this!
Hi Kasey,
Thanks for the detailed reply. No I did not consider
LINQ
.I would be very interested in any sample code that showed a large reduction in allocations.
I do agree that having a
LINQ
solution would greatly improve readability.Glad you enjoyed the article!
Cheers,
Indy
"no memory allocation ... until you materialize the results"
In this case, aren't you going to perform exactly the same allocations, then?
Unless... LINQ designers may have gone through the same optimization process @Indy has :) and reused buffers all along.
The point I was making there was that LINQ methods to not load or run anything when they are declared (a common mistaken assumption) -- they only "do something" when they are enumerated.
The main issue from the original solution was run-away memory usage, because all the rows are loaded into memory at once. Then a new intermediate object is allocated at each step for every row. So memory usage is roughly: number of rows * (row size + intermediate objects sizes)
Using LINQ as I mentioned, only one row is fully processed at a time before fetching the next row. So at most you have one set of allocations for the row and each intermediate object. So memory usage is roughly: row size + intermediate objects sizes.
Any solution processing files would probably also do well to buffer the results to an IO output, to avoid collecting large result sets in memory.
If Garbage Collector performance is an issue, that can be optimized separately. Common strategies are: value types (allocated on stack frame and copied when passed in or returned to other stack frames), or a pre-allocated object pool, or if you need the same consistent set of objects for each row, then a set of singleton objects is equivalent to an object pool of size 1... just remember to reset them between iterations.
I am glad you posted this response detailing the usage of LINQ. When I started reading this article, I was thinking that it would be some interesting LINQ wizardry. However, it is always nice to see optimizations in any form.