Benchmarking dict updates in Python
After playing around for several day with HackerRank exercises, I found myself thinking about ways to update a dict that has lists as values.
The plot twist here is that in the case I was entertaining, the list should be created with the single new item if its key was not present on the dict, but appended a new item if the key was already there.
Let me illustrate. This is the initial approach, checking if the key is there before trying to attempt a get on it:
But in the pursuit of a more succinct take on this, I gave a try to defaultdict type, from Python collections:
Both return exactly the same result at this point, so we are at a fair baseline to start thinking about which one we should pick and why.
At this point I tried putting the little snippets into a loop (excluding the import in the second version) and run it a large number of times.
The different in performance when running it 5000 times was about a 7% in total time (as average, after running each test several times), not huge, yet not insignificant:
So, what about kicking it up a notch? So I did run this half a million times instead of 5000 times. And the results got quite more interesting:
The different started to go around 40% this time. Not at all irrelevant.
Final notes
Imagine now this piece belongs to a Python backend under heavy load. It wouldn’t be unexpected to have this run a couple million times. Perhaps even more every day. The difference of performance would then mean a difference in hosting costs!
The second take-away from this for me was, being a person who tends to like the less verbose forms of coding expressions, to always be mindful of the runtime complexity you may add when adding libraries or non-language-native code and datatypes. Even on this case, with the Python’s own defaultdict type the impact in performance for a heavy load scenario ended up being quite noticeable.