Performance matters, don't use string manipulation, use math wherever possible
June 28th 2023 | ~ 4 minute read
Preface
Ah, performance. Always a hotly debated topic among programmers. Does it even matter when computers are so fast these days? I'd argue that it does, and that you're doing everyone a disservice if you don't take ample care to ensure smooth sailing for everyone involved.
The problem
Here's an issue I recently came across while writing some python code. I had to do some work on an integer representing a bank account number that is structured in the following fashion:
AAA-BBBBBBBBBBBBB-CC
AAA -> Bank identifier (3 digits)
BBBBBBBBBBBBB -> Account number (13 digits)
CC -> checksum (2 digits)
Notice the checksum at the end? It's used to validate the integrity of the whole number. It's calculated using the following formula:
98 - bank_identifier-account_number * 100 % 97
You take the first 16 digits of the whole thing and then subtract the modulus of that x 100 with 97 from the number 98. This is your checksum.
It's then simple to verify that the resulting checksum is the same as is written at the end of the bank account number.
The solution
Let's take a moment to think about how we might implement this in python. The first thing that comes to mind to a good chunk of people is something like this:
def validate_account_no_str(account_no: int) -> bool:
account_no_str = str(account_no)
csum = int(account_no_str[-2:])
account_no_without_csum = int(account_no_str[0:16])
calculated_csum = 98 - account_no_without_csum * 100 % 97
return csum == calculated_csum
Cast the number to a string and then using string manipulation split it into parts we need and then cast those bits back into integers to do the calculation.
While this works, it's wasteful, you end up doing all sorts of unnecessary allocations and type conversions, all of which is slowing the code down significantly.
A better way to do this is to think about the problem in a different way. We're dealing with integers here. Is there a simple method to do all of this with some math without resorting to type conversion? Yes, there is.
def validate_account_no(account_no: int) -> bool:
csum = account_no % 100
account_no_without_csum = account_no // 100
calculated_csum = 98 - account_no_without_csum * 100 % 97
return csum == calculated_csum
To get the last two digits representing the checksum, just do a simple modulus operation with 100. Likewise, to get the first 16 digits, just do an integer division with 100. Then, just apply the same calculation as before.
The performance difference
Let's pit these two solutions against each other with a little test. We'll use the timeit
python module to run them each 10 million times, and then we'll compare how long each one took to complete. Here's the complete code:
#!/usr/bin/env python3
import timeit
def validate_account_no_str(account_no: int) -> bool:
account_no_str = str(account_no)
csum = int(account_no_str[-2:])
account_no_without_csum = int(account_no_str[0:16])
calculated_csum = 98 - account_no_without_csum * 100 % 97
return csum == calculated_csum
def validate_account_no(account_no: int) -> bool:
csum = account_no % 100
account_no_without_csum = account_no // 100
calculated_csum = 98 - account_no_without_csum * 100 % 97
return csum == calculated_csum
def main():
print(
timeit.timeit(
lambda: validate_account_no(200000000012345600),
number=10_000_000,
)
)
print(
timeit.timeit(
lambda: validate_account_no_str(200000000012345600),
number=10_000_000,
)
)
if __name__ == '__main__':
main()
And, now for the results:
1.7433777059995919
4.486554025999794
The integer version took a little over 1.7s to complete, while the string version took a staggering 4.5s! The integer version is almost 2.6 times faster with 10 million iterations.
Conclusion
The key takeaway is that sometimes, the "obvious" solution just isn't as optimal as what you can get if you put just a little more thought into the solution.
In a world where dwindling attention spans are a given, it's that much more imperative to write faster, more performant code instead of relying on hardware improvements that many are still unable to access. Our users deserve better!