Windowing Data into Rows in Pyspark - apache-spark

I'm preparing a dataset to develop a supervised model to predict a value given the 5 previous values before it. For example given the sample data below, I would predict the 6th column given columns 1:5, or the 8th column given columns 3:7.
id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
a 150 110 130 80 136 150 190 110 150 110 130 136 100 150 190 110
b 100 100 130 100 136 100 160 230 122 130 15 200 100 100 136 100
c 130 122 140 140 122 130 15 200 100 100 130 100 136 100 160 230
To that end, I want to reorganize the sample data above into rows of 6 columns, taking every slice/window of 6 values possible (e.g. 1:6, 2:7, 3:8). How can I do that? Is it possible in PySpark/SQL? Example of output below, index just for clarification:
1 2 3 4 5 6
a[1:6] 150 110 130 80 136 150
a[2:7] 110 130 80 136 150 190
a[3:8] 130 80 136 150 190 110
...
c[1:6] 130 122 140 140 122 130
c[2:7] 122 140 140 122 130 15
...
c[10:16] 130 100 136 100 160 230

You can convert your columns into an array of arrays or array of structs and then explode, for example:
from pyspark.sql.functions import struct, explode, array, col
# all columns except the first
cols = df.columns[1:]
# size of the splits
N = 6
Use array of arrays:
df_new = df.withColumn('dta', explode(array(*[ array(*cols[i:i+N]) for i in range(len(cols)-N+1) ]))) \
.select('id', *[ col('dta')[i].alias(str(i+1)) for i in range(N) ])
df_new.show()
+---+---+---+---+---+---+---+
| id| 1| 2| 3| 4| 5| 6|
+---+---+---+---+---+---+---+
| a|150|110|130| 80|136|150|
| a|110|130| 80|136|150|190|
| a|130| 80|136|150|190|110|
| a| 80|136|150|190|110|150|
| a|136|150|190|110|150|110|
| a|150|190|110|150|110|130|
| a|190|110|150|110|130|136|
| a|110|150|110|130|136|100|
| a|150|110|130|136|100|150|
| a|110|130|136|100|150|190|
| a|130|136|100|150|190|110|
| b|100|100|130|100|136|100|
+---+---+---+---+---+---+---+
Use array of structs (spark 2.4+):
df_new = df.withColumn('dta', array(*cols)) \
.selectExpr("id", f"""
inline(transform(sequence(0,{len(cols)-N}), i -> ({','.join(f'dta[i+{j}] as `{j+1}`' for j in range(N))})))
""")
the code inside above f-string is the same as the following for N=6:
inline(transform(sequence(0,10), i -> struct(dta[i] as `1`, dta[i+1] as `2`, dta[i+2] as `3`, dta[i+3] as `4`, dta[i+4] as `5`, dta[i+5] as `6`)))

Yes you can use this code (and modify it to get what you need):
partitions = []
for row in df.rdd.toLocalIterator():
row_list = list(row)
num_elements = 6
for i in range(0, len(row_list) - num_elements):
partition = row[i : i+num_elements]
partitions.append(partition)
output_df = spark.createDataFrame(partitions)

Related

How to ignore plotting the feature which have all zeros in subplot using groupby

I am trying to use the "groupby" to plot all features in a dataframe in different subplots based on each serial number with ignoring the feature 'Ft' of each serial number (in subplot) where the all data are Zeros, for example, we should ignore 'Ft1' in S/N 'KLM10015' because all data in this feature are Zeros. The size of the dataframe is "5514 rows and 565 columns" with the ability of using a dataframe with different sizes.
The x-axis of each subplot represent the "Date" , y-axis represents each feature values (Ft) and the title represent the serial number (S/N).
This is an example of the dataframe which I have:
df =
S/N Ft1 Ft12 Ft17 ---- Ft1130 Ft1140 Ft1150
DATE
2021-01-06 KLM10015 0 12 14 ---- 17 52 47
2021-01-07 KLM10015 0 10 48 ---- 19 20 21
2021-01-11 KLM10015 0 0 45 ---- 0 19 0
2021-01-17 KLM10015 0 1 0 ---- 16 44 66
| | | | | | | |
| | | | | | | |
| | | | | | | |
2021-02-09 KLM10018 1 11 0 ---- 25 27 19
2021-12-13 KLM10018 12 0 19 ---- 78 77 18
2021-12-16 kLM10018 14 17 14 ---- 63 19 0
2021-07-09 KLM10018 18 0 77 ---- 65 34 98
2021-07-15 KLM10018 0 88 82 ---- 63 31 22
Code:
list_ID = ["ft1","ft12", "ft17, ......, ft1130, 1140, ft1150]
def plot_fun (dataframe):
for item in list_ID:
fig = plt.figure(figsize=(35, 20))
ax1 = fig.add_subplot(221)
ax2 = fig.add_subplot(222)
ax3 = fig.add_subplot(223)
ax4 = fig.add_subplot(224)
dataframe.groupby('S/N')[item].plot(legend='True', ax=ax1)
dataframe.groupby('S/N')[item].plot(legend='True', ax=ax2)
dataframe.groupby('S/N')[item].plot(legend='True', ax=ax3)
dataframe.groupby('S/N')[item].plot(legend='True', ax=ax4)
plt.show()
plot_fun (df)
I need really to your help. Thanks a lot

Count the number of labels on IOB corpus with Pandas

From my IOB corpus such as:
mention Tag
170
171 467 O
172
173 Vincennes B-LOCATION
174 . O
175
176 Confirmation O
177 des O
178 privilèges O
179 de O
180 la O
181 ville B-ORGANISATION
182 de I-ORGANISATION
183 Tournai I-ORGANISATION
184 1 O
185 ( O
186 cf O
187 . O
188 infra O
189 , O
I try to make simple statistics like total number of annotated mentions, total by labels etc.
After loading my dataset with pandas I got this:
df = pd.Series(data['Tag'].value_counts(), name="Total").to_frame().reset_index()
df.columns = ['Label', 'Total']
df
Output :
Label Total
0 O 438528
1 36235
2 B-LOCATION 378
3 I-LOCATION 259
4 I-PERSON 234
5 I-INSTALLATION 156
6 I-ORGANISATION 150
7 B-PERSON 144
8 B-TITLE 94
9 I-TITLE 89
10 B-ORGANISATION 68
11 B-INSTALLATION 62
12 I-EVENT 8
13 B-EVENT 2
First of all, How I could get a similar representation above but by regrouping the IOB prefixes such as (example):
Label, Total
PERSON, 300
LOCATION, 154
ORGANISATION, 67
etc.
and secondly how to exclude the "O" and empty strings labels from my output, I tested with .mask() and .where() on my Series but it fails.
Thank you for your leads.
remove B-, I- parts, groupby, sum
df['label'] = df.label.str[2:]
df.groupby(['label']).sum()
For the second part, just return data in which the length of the label column string is greater than 2
df.loc[df.label.str.len()>2]

Get the names of Top 'n' columns based on a threshold for values across a row

Let's say, that I have the following data:
In [1]: df
Out[1]:
Student_Name Maths Physics Chemistry Biology English
0 John Doe 90 87 81 65 70
1 Jane Doe 82 84 75 73 77
2 Mary Lim 40 65 55 60 70
3 Lisa Ray 55 52 77 62 90
I want to add a column to this dataframe which tells me the students' top 'n' subjects that are above a threshold, where the subject names are available in the column names. Let's assume n=3 and threshold=80.
The output would look like the following:
In [3]: df
Out[3]:
Student_Name Maths Physics Chemistry Biology English Top_3_above_80
0 John Doe 90 87 81 65 70 Maths, Physics, Chemistry
1 Jane Doe 82 84 75 73 77 Physics, Maths
2 Mary Lim 40 65 55 60 70 nan
3 Lisa Ray 55 52 77 62 90 English
I tried to use the solution written by #jezrael for this question where they use numpy.argsort to get the positions of sorted values for the top 'n' columns, but I am unable to set a threshold value below which nothing should be considered.
Idea is first replace not matched values by missing values in DataFrame.where, then applied solution with numpy.argsort. Filter by number of Trues of for count non missing values in numpy.where for replace not matched values to empty strings.
Last are values joined in list comprehension and filtered out non matched rows for missing value(s):
df1 = df.iloc[:, 1:]
m = df1 > 80
count = m.sum(axis=1)
arr = df1.columns.values[np.argsort(-df1.where(m), axis=1)]
m = np.arange(arr.shape[1]) < count[:, None]
a = np.where(m, arr, '')
L = [', '.join(x).strip(', ') for x in a]
df['Top_3_above_80'] = pd.Series(L, index=df.index)[count > 0]
print (df)
Student_Name Maths Physics Chemistry Biology English \
0 John Doe 90 87 81 65 70
1 Jane Doe 82 84 75 73 77
2 Mary Lim 40 65 55 60 70
3 Lisa Ray 55 52 77 62 90
Top_3_above_80
0 Maths, Physics, Chemistry
1 Physics, Maths
2 NaN
3 English
If performance is not important use Series.nlargest per rows, but it is really slow if large DataFrame:
df1 = df.iloc[:, 1:]
m = df1 > 80
count = m.sum(axis=1)
df['Top_3_above_80'] = (df1.where(m)
.apply(lambda x: ', '.join(x.nlargest(3).index), axis=1)[count > 0])
print (df)
Student_Name Maths Physics Chemistry Biology English \
0 John Doe 90 87 81 65 70
1 Jane Doe 82 84 75 73 77
2 Mary Lim 40 65 55 60 70
3 Lisa Ray 55 52 77 62 90
Top_3_above_80
0 Maths, Physics, Chemistry
1 Physics, Maths
2 NaN
3 English
Performance:
#4k rows
df = pd.concat([df] * 1000, ignore_index=True)
#print (df)
def f1(df):
df1 = df.iloc[:, 1:]
m = df1 > 80
count = m.sum(axis=1)
arr = df1.columns.values[np.argsort(-df1.where(m), axis=1)]
m = np.arange(arr.shape[1]) < count[:, None]
a = np.where(m, arr, '')
L = [', '.join(x).strip(', ') for x in a]
df['Top_3_above_80'] = pd.Series(L, index=df.index)[count > 0]
return df
def f2(df):
df1 = df.iloc[:, 1:]
m = df1 > 80
count = m.sum(axis=1)
df['Top_3_above_80'] = (df1.where(m).apply(lambda x: ', '.join(x.nlargest(3).index), axis=1)[count > 0])
return df
In [210]: %timeit (f1(df.copy()))
19.3 ms ± 272 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [211]: %timeit (f2(df.copy()))
2.43 s ± 61.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
An alternative:
res = []
tmp = df.set_index('Student_Name').T
for col in list(tmp):
res.append(tmp[col].nlargest(3)[tmp[col].nlargest(3) > 80].index.tolist())
res = [x if len(x) > 0 else np.NaN for x in res]
df['Top_3_above_80'] = res
Output:
Student_Name Maths Physics Chemistry Biology English Top_3_above_80
0 JohnDoe 90 87 81 65 70 [Maths, Physics, Chemistry]
1 JaneDoe 82 84 75 73 77 [Physics, Maths]
2 MaryLim 40 65 55 60 70 NaN
3 LisaRay 55 52 77 62 90 [English]

Efficient way to perform iterative subtraction and division operations on pandas columns

I have a following dataframe-
A B C Result
0 232 120 9 91
1 243 546 1 12
2 12 120 5 53
I want to perform the operation of the following kind-
A B C Result A-B/A+B A-C/A+C B-C/B+C
0 232 120 9 91 0.318182 0.925311 0.860465
1 243 546 1 12 -0.384030 0.991803 0.996344
2 12 120 5 53 -0.818182 0.411765 0.920000
which I am doing using
df['A-B/A+B']=(df['A']-df['B'])/(df['A']+df['B'])
df['A-C/A+C']=(df['A']-df['C'])/(df['A']+df['C'])
df['B-C/B+C']=(df['B']-df['C'])/(df['B']+df['C'])
which I believe is a very crude and ugly way to do.
How to do it in a more correct way?
You can do the following:
# take columns in a list except the last column
colnames = df.columns.tolist()[:-1]
# compute
for i, c in enumerate(colnames):
if i != len(colnames):
for k in range(i+1, len(colnames)):
df[c + '_' + colnames[k]] = (df[c] - df[colnames[k]]) / (df[c] + df[colnames[k]])
# check result
print(df)
A B C Result A_B A_C B_C
0 232 120 9 91 0.318182 0.925311 0.860465
1 243 546 1 12 -0.384030 0.991803 0.996344
2 12 120 5 53 -0.818182 0.411765 0.920000
This is a perfect case to use DataFrame.eval:
cols = ['A-B/A+B','A-C/A+C','B-C/B+C']
x = pd.DataFrame([df.eval(col).values for col in cols], columns=cols)
df.assign(**x)
A B C Result A-B/A+B A-C/A+C B-C/B+C
0 232 120 9 91 351.482759 786.753086 122.000000
1 243 546 1 12 240.961207 243.995885 16.583333
2 12 120 5 53 128.925000 546.998168 124.958333
The advantage of this method respect to the other solution, is that it does not depend on the order of the operation sings that appear as column names, but rather as mentioned in the documentation it is used to:
Evaluate a string describing operations on DataFrame columns.

Get unique values of a column in between a timeperiod in pandas after groupby

I have a requirement where I need to find all the unique values of a merchant_store_id of the user on the same stampcard in between a specific time period. I had group by stampcard id and userid to get the data frame based on the condition. Now I need to find the unique merchant_store_id of the this dataframe in interval of 10mins from that entry.
My approach is I would loop in that groupby dataframe and then I would find the all indexes in that dataframe of that group and then I would create a new dataframe from time of index to index + 60mins and then find the unique merchant_store_id's in it. If the unique merchant_store_id is >1 , I would append that dataframe from that time to a final dataframe. Problem with the approach is it works fine for small data, but for data of size 20,000 rows it shows memory error on linux and keeps on running on windows. Below is my code
fi_df = pd.DataFrame()
for i in df.groupby(["stamp_card_id", "merchant_id", "user_id"]):
user_df = i[1]
if len(user_df)>1:
# get list of unique indexes in that groupby df
index = user_df.index.values
for ind in index:
fdf = user_df[ind:ind+np.timedelta64(1, 'h')]
if len(fdf.merchant_store_id.unique())>1:
fi_df=fi_df.append(fdf)
fi_df.drop_duplicates(keep="first").to_csv(csv_export_path)
Sample Data after group by is:
((117, 209, 'oZOfOgAgnO'), stamp_card_id stamp_time stamps_record_id user_id \
0 117 2018-10-14 16:48:03 1756 oZOfOgAgnO
1 117 2018-10-14 16:54:03 1759 oZOfOgAgnO
2 117 2018-10-14 16:58:03 1760 oZOfOgAgnO
3 117 2018-10-14 17:48:03 1763 oZOfOgAgnO
4 117 2018-10-14 18:48:03 1765 oZOfOgAgnO
5 117 2018-10-14 19:48:03 1767 oZOfOgAgnO
6 117 2018-10-14 20:48:03 1769 oZOfOgAgnO
7 117 2018-10-14 21:48:03 1771 oZOfOgAgnO
8 117 2018-10-15 22:48:03 1773 oZOfOgAgnO
9 117 2018-10-15 23:08:03 1774 oZOfOgAgnO
10 117 2018-10-15 23:34:03 1777 oZOfOgAgnO
merchant_id merchant_store_id
0 209 662
1 209 662
2 209 662
3 209 662
4 209 662
5 209 662
6 209 663
7 209 664
8 209 662
9 209 664
10 209 663 )
I have tried the resampling method also, but then i get the data in respective of the time, where the condition of user hitting multiple merchant_store_id is neglected at end time of the hours.
Any help would be appreciated. Thanks
if those are datetimes you can filter with the following:
filtered_set = set(df[df["stamp_time"]>=x][df["stamp_time"]<=y]["col of interest"])
df[df["stamp_time"]>=x] filters the df
adding [df["stamp_time"]<=y] filters the filtered df
["merchant_store_id"] captures just the specified column (series)
and finally set() returns the unique list (set)
Specific to your code:
x = datetime(lowerbound) #pseudo-code
y = datetime(upperbound) #pseudo-code
filtered_set = set(fi_df[fi_df["stamp_time"]>=x][fi_df["stamp_time"]<=y]["col of interest"])

Resources