Spark 3 with Pandas Vectorised UDF's - apache-spark

I'm looking at using Pandas UDF's in PySpark (v3). For a number of reasons, I understand iterating and UDF's in general are bad and I understand that the simple examples I show here can be done PySpark using SQL functions - all of that is besides the point!
I've been following this guide: https://databricks.com/blog/2020/05/20/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html
I have a simple example working from the docs:
import pandas as pd
from typing import Iterator, Tuple
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, pandas_udf
spark = SparkSession.builder.getOrCreate()
pdf = pd.DataFrame(([1, 2, 3], [4, 5, 6], [8, 9, 0]), columns=["x", "y", "z"])
df = spark.createDataFrame(pdf)
#pandas_udf('long')
def test1(x: pd.Series, y: pd.Series) -> pd.Series:
return x + y
df.select(test1(col("x"), col("y"))).show()
And this works well for performing basic arithmetic - if I want to add, multiply etc this is straight forward (but it is also straightforward in PySpark without functions).
I want to do a comparison between the values for example:
#pandas_udf('long')
def test2(x: pd.Series, y: pd.Series) -> pd.Series:
return x if x > y else y
df.select(test2(col("x"), col("y"))).show()
This will error with ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().. I understand that it is evaluating the series rather than the row value.
So there is an iterator example. Again this works fine for the basic arithmetic example they provide. But if I try to apply logic:
#pandas_udf("long")
def test3(batch_iter: Iterator[Tuple[pd.Series, pd.Series]]) -> Iterator[pd.Series]:
for x, y in batch_iter:
yield x if x > y else y
df.select(test3(col("x"), col("y"))).show()
I get the same ValueError as before.
So my question is how should I perform row by row comparisons like this? Is it possible in a vectorised function? And if not then what are the use cases for them?

I figured this out. So simple after you write it down and publish the problem to the world.
All that needs to happen is to return an array and then convert to a Pandas Series:
#pandas_udf('long')
def test4(x: pd.Series, y: pd.Series) -> pd.Series:
return pd.Series([a if a > b else b for a, b in zip(x, y)])
df.select(test4(col("x"),col("y"))).show()

I've spent the last two days looking for this answer, thank you simon_dmorias!
I needed a slightly modified example here. I'm breaking out the single pandas_udf into multiple components for easier management. Here is an example of what I'm using for others to reference:
xdf = pd.DataFrame(([1, 2, 3,'Fixed'], [4, 5, 6,'Variable'], [8, 9, 0,'Adjustable']), columns=["x", "y", "z", "Description"])
df = spark.createDataFrame(xdf)
def fnRate(x):
return pd.Series(['Fixed' if 'Fixed' in str(v) else 'Variable' if 'Variable' in str(v) else 'Other' for v in zip(x)])
#pandas_udf('string')
def fnRateRecommended(Description: pd.Series) -> pd.Series:
varProduct = fnRate(Description)
return varProduct
# call function
df.withColumn("Recommendation", fnRateRecommended(sf.col("Description"))).show()

Related

How to get the variable t-stat of multiple univariate OLS in statsmodels without looping?

The below code works in getting the t-stats and r-squared for OLS I run. However, I have millions of regressions to run and by looping, this is taking me ages. Appreciated if alternative solutions are suggested. Thanks.
import statsmodels.api as sm
import pandas as pd
import numpy as np
x = np.cumsum(np.ones(5))
df_y = pd.DataFrame({"A": [10,20,30,40,50],
"B": [20, 30, 10, 40, 50],
"C": [32, 234, 23, 23, 42523]})
df_result = pd.DataFrame({"t-stats": np.ones(3)*np.nan,
"r2": np.ones(3)*np.nan})
for i in range(0,df_y.shape[1]):
y = df_y.iloc[:,i]
results = sm.OLS(y,x).fit()
df_result.loc[i, "t-stats"] = (results.params/results.bse).values
df_result.loc[i, "r2"] = results.rsquared
Sorry, adding on to my questions.
What if there are missing values in y? How this can be handled? Thanks
Unfortunately, you cannot do this with statsmodels since y must be a single variable. If you really need performance here, you can easily write your own version using numpy. Assuming x never changes
import numpy as np
import pandas as pd
import statsmodels.api as sm
# Assume y is n by m where m is 1,000,000, use 1,000 here for speed
y = pd.DataFrame(np.random.standard_normal((20,1000)))
x = pd.DataFrame(sm.add_constant(np.random.standard_normal((20,3))))
_x = np.asarray(x)
_y = np.asarray(y)
b = np.linalg.lstsq(_x, _y, rcond=None)[0]
e = _y - _x # b
err_var = (e**2).mean(0)
# correct formula depends if x has a constant, here I assume it does
r2 = 1.0 - err_var / ((_y - _y.mean(0))**2).mean(0)
xpxi = np.linalg.inv(_x.T#_x)
se = np.sqrt(np.diag(xpxi)[:,None]*err_var)
tstats = b / se
This approach is less robust than using statsmodels but will be an order of magnitude faster precisely because it avoids lots of checks, edge case protections, and calculating statistics you do not want.
On my machine this code runs in 1.9ms, so with 1,000,000 series it would take around 2 seconds.

Matrix Sum logic

I am working on a 2D matrix and finding sum of elements, below is my logic:
def calculateSum(a, x, y):
s = 0;
for i in range(0,x+1):
for j in range(0,y+1):
s = s + a[i][j];
print(s)
return s
def check(a):
arr = []
x = 0
y = 0
for i in range(len(a)):
row = []
y = 0
for j in range(len(a[i])):
row.append(calculateSum(a, x, y))
y = y + 1
x = x + 1
print(row)
check([[1, 2], [3, 4]])
calculateSum is the function that calculates sum of elements.
Now my question is, if the matrix size is huge then is there is a way to improve performance of the above program?
Update:
import numpy as np
def calculateSum(a, x, y):
return np.sum(a[x:,y:])
After using numpy I am getting error as TypeError: list indices must be integers or slices, not tuple if I use numpy
As the matrix dimensions increases, Efficiency will fall, the efficient way to deal with this is to parallelize the task of summing the values, this is possible because addition follows Associative property.
Luckily for you this parallelization is already implemented in a library known as numpy.
To get started with numpy, use pip install numpy To get an overview of the library visit: https://www.geeksforgeeks.org/numpy-in-python-set-1-introduction/
And for your question you will need to use function numpy.sum()
Edit:
Also as #Mad Physicist pointed out Numpy also has packed memory layout and the routines are implemented in C which boost its speed even further.

computing the mean for python datetime

I have a datetime attribute:
d = {
'DOB': pd.Series([
datetime.datetime(2014, 7, 9),
datetime.datetime(2014, 7, 15),
np.datetime64('NaT')
], index=['a', 'b', 'c'])
}
df_test = pd.DataFrame(d)
I would like to compute the mean for that attribute. Running mean() causes an error:
TypeError: reduction operation 'mean' not allowed for this dtype
I also tried the solution proposed elsewhere. It doesn't work as running the function proposed there causes
OverflowError: Python int too large to convert to C long
What would you propose? The result for the above dataframe should be equivalent to
datetime.datetime(2014, 7, 12).
You can take the mean of Timedelta. So find the minimum value and subtract it from the series to get a series of Timedelta. Then take the mean and add it back to the minimum.
dob = df_test.DOB
m = dob.min()
(m + (dob - m).mean()).to_pydatetime()
datetime.datetime(2014, 7, 12, 0, 0)
One-line
df_test.DOB.pipe(lambda d: (lambda m: m + (d - m).mean())(d.min())).to_pydatetime()
To #ALollz point
I use the epoch pd.Timestamp(0) instead of min
df_test.DOB.pipe(lambda d: (lambda m: m + (d - m).mean())(pd.Timestamp(0))).to_pydatetime()
You can convert epoch time using astype with np.int64 and converting back to datetime with pd.to_datetime:
pd.to_datetime(df_test.DOB.dropna().astype(np.int64).mean())
Output:
Timestamp('2014-07-12 00:00:00')
You could work with unix time if you want. This is defined as the total number of seconds (for instance) since 1970-01-01. With that, all of your times are simply floats, so it's very easy to do simple math on the columns.
import pandas as pd
df_test['unix_time'] = (df_test.DOB - pd.to_datetime('1970-01-01')).dt.total_seconds()
df_test['unix_time'].mean()
#1405123200.0
# You want it in date, so just convert back
pd.to_datetime(df_test['unix_time'].mean(), origin='unix', unit='s')
#Timestamp('2014-07-12 00:00:00')
Datetime math supports some standard operations:
a = datetime.datetime(2014, 7, 9)
b = datetime.datetime(2014, 7, 15)
c = (b - a)/2
# here c will be datetime.timedelta(3)
a + c
Out[7]: datetime.datetime(2014, 7, 12, 0, 0)
So you can write a function that, given two datetimes, subtracts the lesser form the greater and adds half of the difference to the lesser. Apply this function to your dataframe, and shazam!
As of pandas=0.25, it is possible to compute the mean of a datetime series.
In [1]: import pandas as pd
...: import numpy as np
In [2]: s = pd.Series([
...: pd.datetime(2014, 7, 9),
...: pd.datetime(2014, 7, 15),
...: np.datetime64('NaT')])
In [3]: s.mean()
Out[3]: Timestamp('2014-07-12 00:00:00')
However, note that applying mean to a pandas dataframe currently ignores columns with a datetime series.

RDD foreach method provides no results

I am trying to understand how foreach method works. In my jupyter notebook, I tried :
def f(x): print(x)
a = sc.parallelize([1, 2, 3, 4, 5])
b = a.foreach(f)
print(type(b))
<class 'NoneType'>
I can execute that without any problem, but I don't have any output except the print(type(b)) part. The foreach doesn't return anything, just a none type. I do not know what foreach is supposed to do, and how to use it. Can you explain me what it is used for ?
foreach is an action, and does not return anything; so, you cannot use it as you do, i.e. assigning it to another variable like b = a.foreach(f). From Learning Spark, p. 41-42:
Adapting the simple example from the docs, run in a PySpark terminal:
>>> def f(x): print(x)
>>> a = sc.parallelize([1, 2, 3, 4, 5])
>>> a.foreach(f)
5
4
3
1
2
(NOTE: not sure about Jupyter, but the above code will not produce any print results in a Databricks notebook.)
You may also find the answers in this thread helpful.
I just use the following method and it is working perfectly under Jupyter Notebook with PySpark:
for row in RDD.toLocalIterator():
print(row)
Actually it will convert your RDD into a generator object and then by using this generator object you can easily iterate over each element. OR you can first create a generator object and then use it in your loop like below:
genobj = data.toLocalIterator()
for row in genobj:
print(row)

Insert field into structured array at a specific column index

I'm currently using np.loadtxt to load some mixed data into a structured numpy array. I do some calculations on a few of the columns to output later. For compatibility reasons I need to maintain a specific output format so I'd like to insert those columns at specific points and use np.savetxt to export the array in one shot.
A simple setup:
import numpy as np
x = np.zeros((2,),dtype=('i4,f4,a10'))
x[:] = [(1,2.,'Hello'),(2,3.,'World')]
newcol = ['abc','def']
For this example I'd like to make newcol the 2nd column. I'm very new to Python (coming from MATLAB). From my searching all I've been able to find so far are ways to append newcol to the end of x to make it the last column, or x to newcol to make it the first column. I also turned up np.insert but it doesn't seem to work on a structured array because it's technically a 1D array (from my understanding).
What's the most efficient way to accomplish this?
EDIT1:
I investigated np.savetxt a little further and it seems like it can't be used with a structured array, so I'm assuming I would need to loop through and write each row with f.write. I could specify each column explicitly (by field name) with that approach and not have to worry about the order in my structured array, but that doesn't seem like a very generic solution.
For the above example my desired output would be:
1, abc, 2.0, Hello
2, def, 3.0, World
This is a way to add a field to the array, at the position you require:
from numpy import zeros, empty
def insert_dtype(x, position, new_dtype, new_column):
if x.dtype.fields is None:
raise ValueError, "`x' must be a structured numpy array"
new_desc = x.dtype.descr
new_desc.insert(position, new_dtype)
y = empty(x.shape, dtype=new_desc)
for name in x.dtype.names:
y[name] = x[name]
y[new_dtype[0]] = new_column
return y
x = zeros((2,), dtype='i4,f4,a10')
x[:] = [(1, 2., 'Hello'), (2, 3., 'World')]
new_dt = ('my_alphabet', '|S3')
new_col = ['abc', 'def']
x = insert_dtype(x, 1, new_dt, new_col)
Now x looks like
array([(1, 'abc', 2.0, 'Hello'), (2, 'def', 3.0, 'World')],
dtype=[('f0', '<i4'), ('my_alphabet', 'S3'), ('f1', '<f4'), ('f2', 'S10')])
The solution is adapted from here.
To print the recarray to file, you could use something like:
from matplotlib.mlab import rec2csv
rec2csv(x,'foo.txt')

Resources