Read Large Files Efficiently with Python

With the rate at which data is growing the size of files we are expected to process is growing seemingly exponentially. These increasingly large files mean we need to do everything we can to optimize memory usage and processor time – especially when we are working with Python (I love you Python, but you’re definitely not the fastest!).
Anyhow, here is our sample data (multiply this by a couple billion for a large file):
Joe
bill
cindy
mary
henry
joe
mary
In most tutorials and books on reading large files you will see something like this:
name_counts = {}
file_name = 'names_data.txt'
with open(file_name) as names_file:
names = names_file.read().splitlines()
for name in names:
name = name.lower() #deal with different casing
if name in name_counts:
name_counts[name] += 1
else:
name_counts[name] = 1
print(name_counts)
#output is: {'joe': 2, 'bill': 1, 'cindy': 1, 'mary': 2, 'henry': 1}
While this works, it loads the whole file into a list – all at once! As a result, this has a space complexity of O(n) – needless to say, this is not memory efficient. The alternative approach is to read one line at a time:
name_counts = {}
file_name = 'names_data.txt'
with open(file_name) as names_file:
for name in names_file:
name = name.strip().lower() #remember to strip the newline
if name in name_counts:
name_counts[name] += 1
else:
name_counts[name] = 1
print(name_counts)
#output is: {'joe': 2, 'bill': 1, 'cindy': 1, 'mary': 2, 'henry': 1}
By changing just 2 lines of code (since now we are only loading one line at a time) this approach has a space complexity of O(1)! You just need to remember to strip the string of the newline character!
With a file large enough, those 2 lines could mean the difference between being able to get things done and bringing your machine to a grinding halt!
Did you like this post? We are working on a series of posts to help readers go from newbies in Python to pros. Join our email list to make sure you don’t miss that series.