How to Speedup Data Processing with Numpy Vectorization

/img/numpy-vectorize/clock-1.jpg

When dealing with smaller datasets it is easy to assume that normal Python methods are quick enough to process data. However, with the increase in the volume of data produced, and generally available for analysis, it is becoming more important than ever to optimise code to be as fast as possible.

We will therefore look into how using vectorization, and the numpy library, can help you speed up numerical data processing.

Why is python slow? #

Python is well known as an excellent language for data processing and exploration. The main attraction is that it is a high level language, so it is easy and intuitive to understand and learn, and quick to write and iterate. All the features you would want if your focus is data analysis / processing and not writing mountains of code.

However, this ease of use comes with a downside. It is much slower to process calculations when compared to lower level languages such as C.

Snail
Photo by Wolfgang Hasselmann on Unsplash

Fortunately, as python is one of the chosen languages of the data analysis and data science communities (among many others), there are extensive libraries and tools available to mitigate the inherent 'slowness' of python when it comes to processing large amounts of data.

What exactly is vectorization? #

You will often see the term "vectorization" when talking about speeding calculations up with numpy. Numpy even has a method called "vectorize", as we will see later.

A general Google search will result in a whole lot of confusing and contradictory information about what vectorization actually is, or just generalised statements that don't tell you a great deal:

The concept of vectorized operations on NumPy allows the use of more optimal and pre-compiled functions and mathematical operations on NumPy array objects and data sequences. The Output and Operations will speed up when compared to simple non-vectorized operations.

- GeekForGeeks.org - the first google result when searching - what is numpy vectorization?

It just doesn't say much more than: it will get faster due to optimisations. What optimisations?

What optimisations? #

The trouble is that numpy is a very powerful, optimised tool. When implementing something like vectorization, the implementation in numpy includes a lot of well thought out optimisations, on top of just plain old vectorization. I think this is where a lot of the confusion comes from, and breaking down what is going on (to some degree at least) would help to make things clearer.

Breaking down vectorization in numpy #

The subsequent sections will breakdown what is typically included under the generalised "vectorization" umbrella as used in the numpy library.

Knowing what each does, and how it contributes to the speed of numpy "vectorized" operations, should hopefully help with any confusion.

Actual vectorization #

Vectorization is a term used outside of numpy, and in very basic terms is parallelisation of calculations.

If you have a 1D array (or vector as they are also known):

[1, 2, 3, 4]

...and multiply each element in that vector by the scalar value 2, you end up with:

[2, 4, 6, 8]

In normal python this would be done element by element using something like a for loop, so four calculations one after the other. If each calculation takes 1 second, that would be 4 seconds to complete the calculation and issue the result.

However, numpy will actually multiply two vectors together [2, 2, 2, 2] and [2, 4, 6, 8] (numpy 'stretches' the scalar value 2 into a vector using something called broadcasting, see the next section for more on that). Each of the four separate calculations is done all at once in parallel. So in terms of time, the calculation is completed in 1 second (each calculation takes 1 second, but they are all completed at the same time).

A four fold improvement in speed just through 'vectorization' of the calculation (or, if you like, a form of parallel processing). Please bare in mind that the example I have given is very simplified, but it does help to illustrate what is going on on a basic level.

You can see how this could equate to a very large difference if you are dealing with datasets with thousands, if not millions, of elements.

Just be aware, the parallelisation is not unlimited, and dependent on hardware to some degree. Numpy is not able to parallelise 100 million calculations all together, but it can reduce the amount of serial calculations required by a significant amount, especially when dealing with a large amount of data.

If you want a more detailed explanation, then I recommend this stackoverflow post, which does a great job of explaining in more detail. If you want even more detail then this article, and this article are excellent.

Broadcasting #

Broadcasting is a feature of numpy that enables mathematical operations to be carried out between arrays of different sizes. We actually did just that in the previous section.

The scalar value 2 was "stretched" into an array full of 2s. That is broadcasting, and is one of the ways in which numpy prepares data for much more efficient calculations. However, saying "it just creates an array of 2s" is a gross oversimplification, but it is not worth getting into the detail here.

Numpy's own documentation is actually quite clear here:

The term broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python.

- numpy.org

A faster language #

Car performing a burnout
Photo by Uillian Vargas on Unsplash

As detailed in the quote from numpy's own documentation in the previous section, numpy uses pre-compiled and optimised C functions to execute calculations.

As C is a lower level language, there is much more scope for optimisation of calculations. This is not something you need to think about, as the numpy library does that for you, but it is something you benefit from.

Homogeneous data types #

In python you have the flexibility to specify lists with a mixture of different datatypes (strings, ints, floats etc.). When dealing with data in numpy, the data is homogeneous (i.e. all the same type). This helps speed up calculations as the data type does not need to be figured out on the fly like in a python list.

This can of course also been seen as a limitation, as it makes working with mixed data types more difficult.

Putting it all together #

As previously mentioned, it is quite common for all of the above (and more) to be grouped together when talking about vectorization in numpy. However, as vectorization is also used in other contexts to describe more specific operations, this can be quite confusing.

Hopefully, it is all a little bit clearer as to what we are dealing with, and now we can move on to the practical side.

How much difference does numpy's implementation of vectorization really make?

A practical example #

To demonstrate the effectiveness of vectorization in numpy we will compare a few different commonly used methods to apply mathematical functions, and also logic, using the pandas library.

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

- pydata.org

Pandas is widely used when dealing with tabular data, and is also built on top of numpy, so I think it serves as a great medium for demonstrating the effectiveness of vectorization.

All the calculations that follow are available in a colab notebook Open In Colab

The data #

The data will be a simple dataframe with two columns. Both columns will be comprised of 1 million rows of random numbers taken from a normal distribution.

df = pd.DataFrame({'series1':np.random.randn(1000000), 'series2':np.random.randn(1000000)})

which results in:

rowseries1series2
02.024360-0.304465
1-0.294511-0.585608
2-0.580776-0.987834
31.4035531.553986
4-2.004211-0.263476
.........
999995-0.4480200.040024
9999960.3258960.574605
9999971.679847-1.103830
9999980.568573-1.695838
999999-0.3625371.556493

1000000 rows × 2 columns

The manipulation #

Then the above dataframe will be manipulated by two different functions to create a third column 'series3'. This is a very common operation in pandas, for example, when creating new features for machine or deep learning:

Function 1 - a simple summation

def sum_nums(a, b):
return a + b

Function 2 - logic and arithmetic

def categorise(a, b):
if a < 0:
return a * 2 + b
elif b < 0:
return a + 2 * b
else:
return None

Each of the above functions will be applied using different methods (some vectorized, some not) to see which performs the calculations over the 1 million rows the quickest.

The methods and the results #

The processing methods that follow are arranged in order of speed. Slowest first.

Each method was run multiple times using the timeit library, and for both of the functions mentioned in the previous section. Once for the slower methods, up to 1000 times for the faster methods. This ensures the calculations don't run too long, and we get enough iterations to average out the run time per iteration.

The pandas apply method

The pandas apply method is very simple and intuitive. However, it is also one of the slowest ways for applying calculations on large datasets.

There is no optimisation of the calculation. It is basically performing a simple for loop. This method should be avoided unless the requirements of the function rule out all other methods.

# Function 1
series3 = df.apply(lambda df: sum_nums(df['series1'],df['series2']),axis=1)
# Function 2
series3 = df.apply(lambda df: categorise(df['series1'],df['series2']),axis=1)
FunctionIterationsTotal Time (s)Time per iteration (s)Improvement in speed
1111.6011.60Baseline
2111.5811.58Baseline

Itertuples

Itertuples, in some simple implementations, is even slower than the apply method, but in this case it is used with list comprehension, so achieves almost a 20 times improvement in speed over the apply method.

Itertuples removes the overhead of dealing with a pandas Series and instead uses named tuples for the iteration[1]. As previously mentioned, this particular implementation also benefits from the speed up list comprehension provides, by removing the overhead of appending to a list[2].

Note: there is also a function called iterrows, but it is always slower, and therefore ignored for brevity.

# Function 1
series3 = [sum_nums(a, b) for a, b in df.itertuples(index=False)]
# Function 2
series3 = [categorise(a, b) for a, b in df.itertuples(index=False)]
FunctionIterationsTotal Time (s)Time per iteration (s)Improvement in speed
1106.120.612x19
2106.410.641x18

List comprehension

The previous itertuples example also used list comprehension, but it seems this particular solution using 'zip' instead of itertuples is about twice as fast.

The main reason for this is the additional overhead introduced by the itertuples method. Itertuples actually uses zip internally, so any additional code to get to the point where zip is applied is just unnecessary overhead.

A great investigation into this can be found in this article. Incidentally, it also explains why iterrows is slower than itertuples.

# Function 1
series3 = [sum_nums(a, b) for a, b in zip(df['series1'],df['series2'])]
# Function 2
series3 = [categorise(a, b) for a, b in zip(df['series1'],df['series2'])]
FunctionIterationsTotal Time (s)Time per iteration (s)Improvement in speed
110029.310.293x40
210031.130.311x37

Numpy vectorize method

This is a bit of an odd one. The method itself is called 'vectorize', but the truth is it is no where near as fast as the full on optimised vectorization that we will see in the methods that follow. Even numpy's own documentation states:

The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.

- numpy.org

However, it is true that the syntax used to implement this function is extremely simple and clear. On top of that, the method actually does a great job of speeding up the calculation, more so than any method we have tried up till now.

It is also more flexible than the methods that follow, and so is easier to implement in lots of situations without any messing about. Numpy vectorize is therefore a great method to use, and highly recommended.

It is just worth bearing in mind that although this method is quick, it is not even close to what is achievable with the fully optimised methods we are about to see, so it should not just be your go to method in all situations.

# Function 1
series3 = np.vectorize(sum_nums)(df['series1'],df['series2'])
# Function 2
series3 = np.vectorize(categorise)(df['series1'],df['series2'])
FunctionIterationsTotal Time (s)Time per iteration (s)Improvement in speed
110022.130.221x52
210021.410.214x54

Pandas vectorization

Now we come to full on optimised vectorization.

The difference in speed is night and day compared to any method before, and a prime example of all the optimisations discussed in earlier sections of this article working together.

The pandas implementation is still an implementation of numpy under the hood, but the syntax is very, very straight forward. If you can express your desired calculation this way, you can't do much better in terms of speed without coming up with a significantly more complicated implementation.

Approximately, 7000 times faster than the apply method, and 130 times faster than the numpy vectorize method!

The downside, is that such simple syntax does not allow for more complicated logic statements to be processed.

# Function 1
series3 = df['series1'] + df['series2']
# Function 2
# N/A as a simple operation is not possible due to the included logic in the function.
FunctionIterationsTotal Time (s)Time per iteration (s)Improvement in speed
110001.660.00166x7000
-----

Numpy vectorization

The final implementation is as close as we can get to implementing raw numpy whilst still having the inputs from a pandas dataframe. Even so, by stripping away any pandas overhead in the calculation, a 15% reduction in processing time is achieved when compared to the pandas implementation.

That is 8000 times faster than the apply method.

# Function 1
series3 = np.add(df['series1'].to_numpy(),df['series2'].to_numpy())
# Function 2
# N/A as a simple operation is not possible due to the included logic in the function.
FunctionIterationsTotal Time (s)Time per iteration (s)Improvement in speed
110001.420.00142x8000
-----

Conclusion #

I hope this article has helped to clarify some of the jargon that exists especially in relation to vectorization, and therefore allowed you to have a better understanding of which methods would be most appropriate depending on your particular situation.

As a general rule of thumb if you are dealing with large datasets of numerical data, vectorized methods in pandas and numpy are your friend:

  1. If the calculation allows, try to use numpy's inbuilt mathematical functions[3]
  2. Pandas' mathematical operations are also a good choice
  3. If you require more complicated logic, use numpy's vectorize[4] method
  4. Failing all of the above, it is a case of deciding exactly what functionality you need, and choosing one of the slower methods as appropriate (list comprehension, intertuples, apply)

If you find yourself in a situation where you need both speed and more flexibility, then you are in a particularly niche situation. You may need to start looking into implementing your own parallelisation, or writing your own bespoke numpy functions[5]. All of which is possible.

References #

[1] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.itertuples.html

[2] Mazdak, Why is a list comprehension so much faster than appending to a list? (2015), stackoverflow.com

[3] https://numpy.org/doc/stable/reference/routines.math.html

[4] https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html

[5] https://numpy.org/doc/stable/user/c-info.ufunc-tutorial.html

🙏🙏🙏

Since you've made it this far, sharing this article on your favorite social media network would be highly appreciated. For feedback, please ping me on Twitter.

Published