How to write Pyspark UDAF on multiple columns? - apache-spark

I have the following data in a pyspark dataframe called end_stats_df:
values start end cat1 cat2
10 1 2 A B
11 1 2 C B
12 1 2 D B
510 1 2 D C
550 1 2 C B
500 1 2 A B
80 1 3 A B
And I want to aggregate it in the following way:
I want to use the "start" and "end" columns as the aggregate keys
For each group of rows, I need to do the following:
Compute the unique number of values in both cat1 and cat2 for that group. e.g., for the group of start=1 and end=2, this number would be 4 because there's A, B, C, D. This number will be stored as n (n=4 in this example).
For the values field, for each group I need to sort the values, and then select every n-1 value, where n is the value stored from the first operation above.
At the end of the aggregation, I don't really care what is in cat1 and cat2 after the operations above.
An example output from the example above is:
values start end cat1 cat2
12 1 2 D B
550 1 2 C B
80 1 3 A B
How do I accomplish using pyspark dataframes? I assume I need to use a custom UDAF, right?

Pyspark do not support UDAF directly, so we have to do aggregation manually.
from pyspark.sql import functions as f
def func(values, cat1, cat2):
n = len(set(cat1 + cat2))
return sorted(values)[n - 2]
df = spark.read.load('file:///home/zht/PycharmProjects/test/text_file.txt', format='csv', sep='\t', header=True)
df = df.groupBy(df['start'], df['end']).agg(f.collect_list(df['values']).alias('values'),
f.collect_set(df['cat1']).alias('cat1'),
f.collect_set(df['cat2']).alias('cat2'))
df = df.select(df['start'], df['end'], f.UserDefinedFunction(func, StringType())(df['values'], df['cat1'], df['cat2']))

Related

Finding intervals in pandas dataframe based on values in another dataframe

I have two data frames. One dataframe (A) looks like:
Name. gender start_coordinate end_coordinate ID
Peter M 30 150 1
Hugo M 4500 6000 2
Jennie F 300 700 3
The other dataframe (B) looks like
ID_sim. position string
1 89 aa
4 568 bb
5 938437 cc
I want to accomplish two tasks here:
I want to get a list of indices for rows (from dataframe B) for which position column falls in the interval (specified by start_coordinate and end_coordinate column) in dataframe A.
The result for this task will be:
lst = [0,1]. ### because row 0 of B falls in interval of row 1 in A and row 1 of B falls in interval of row 3 of A.
The indices that I get from task 1, I want to keep it from dataframe B to create a new dataframe. Thus, the new dataframe will look like:
position string
89 aa
568 bb
I used .between() to accomplish this task. The code is as follows:
lst=dfB[dfB['position'].between(dfA.loc[0,'start_coordinate'],dfA.loc[len(dfA)-1,'end_coordinate'])].index.tolist()
result=dfB[dfB.index.isin(lst)]
result.shape
However, when I run this piece of code I get the following error:
KeyError: 0
What could possibly be raising this error? And how can I solve this?
We can try numpy broadcasting here
s, e = dfA[['start_coordinate', 'end_coordinate']].to_numpy().T
p = dfB['position'].to_numpy()[:, None]
dfB[((p >= s) & (p <= e)).any(1)]
ID_sim. position string
0 1 89 aa
1 4 568 bb
You could use Pandas IntervalIndex to get the positions, and afterwards, use a boolean to pull the relevant rows from B :
Create IntervalIndex:
intervals = pd.IntervalIndex.from_tuples([*zip(A['start_coordinate'],
A['end_coordinate'])
],
closed='both')
Get indexers for B.position, create a boolean array with the values and filter B:
# get_indexer returns -1 if an index is not found.
B.loc[intervals.get_indexer(B.position) >= 0]
Out[140]:
ID_sim. position string
0 1 89 aa
1 4 568 bb
This should work. Less elegant but easier to comprehend.
import pandas as pd
data = [['Name.','gender', 'start_coordinate','end_coordinate','ID'],
['Peter','M',30,150,1],
['Hugo','M',4500,6000,2],
['Jennie','F',300,700,3]]
data2 = [['ID_sim.','position','string'],
['1',89,'aa'],
['4',568,'bb'],
['5',938437,'cc']]
df1 = pd.DataFrame(data[1:], columns=data[0])
df2 = pd.DataFrame(data2[1:], columns=data2[0])
merged = pd.merge(df1, df2, left_index=True, right_index=True)
print (merged[(merged['position'] > merged['start_coordinate']) & (merged['position'] < merged['end_coordinate'])])

Fill Null values in Data-Frame with Column names

I have a data-frame with 55 columns and 2 million rows having mix of categorical and numeric fileds. There are null/na values in the data-set. I want to fill Null values with Column names.
The data-set I have is:
A B C D .....
1 na na 3 .....
na 3 4 na .....
........................
The output the I am trying to get is:
A B C D .....
1 B C 3 .....
A 3 4 D .....
........................
I am trying to use :
df.fillna(method='ffill')
Is there another way?
Python:3.6.5
Use DataFrame.fillna with columns converted to Series by Index.to_series:
df = df.fillna(df.columns.to_series())
print (df)
A B C D
0 1 B C 3
1 A 3 4 D
EDIT: If categorical columns in DataFrame select these columns and append non exist values by cat.add_categories:
for c in df.select_dtypes('category'):
df[c] = df[c].cat.add_categories(c)
df = df.fillna(df.columns.to_series())

Upsert function in Dataframe - Python

I am trying to update one dataframe with another dataframe with respect to the first column. If there is an extra row in the second dataframe, it should be inserted in the first dataframe. It there is a row with the same data in the first column but different data in the other coulmns, that row should be updated. Also, the row which has no value in the first column should be dropped.
Code used -
df = df_1.combine_first(df_2)\
.reset_index()\
.reindex(columns=df_1.columns)
df = df.drop_duplicates(subset='A', keep= 'last', inplace=False)
df.dropna(subset=['A'])
print ("Final Data")
print (df)
First Dataframe -
A B C
0 45 a b
1 98 c d
2 67 bn k
Second Dataframe -
A B C
0 45 a d
1 98 c d
2 67 bn k
3 90 x z
4
Final should look like -
A B C
0 45 a d
1 98 c d
2 67 bn k
3 90 x z
The final dataframe that I get -
A B C
0 45.0 a b
1 98.0 c d
2 67.0 bn k
3 90.0 x z
4
So, neither the data is getting updated, nor is it removing the row with null values. What am I missing?
Based on my understanding of your question, your second dataframe basically supercedes the first, if there is a matching index. If there isn't, then the difference is added to the first dataframe. I am also assuming that there are no duplicate keys in the first column, A.
Framing this requirement a little differently, the final output should contain all the rows in the second dataframe, as well as the values (since they are meant to overwrite the first dataframe if there's a match). Therefore, we will start off using the second dataframe as it is, and then add back the rows that exist in the first dataframe but not in the second. See the example below. (I'm also using a slightly different first dataframe to highlight the effects)
import pandas as pd
df1 = pd.DataFrame({'A':[45,98,67,91],'B':['a','c','bn','y'],'C':['b','d','k','oo']})
df2 = pd.DataFrame({'A':[45,98,67,90,''],'B':['a','c','bn','x',''],'C':['d','d','k','z','']})
# Remove rows with empty values in first column. This should be whatever conditions applicable to you i.e. checking for np.nan instead of str('')
df2 = df2.loc[df2['A'] != '']
df1.set_index('A', inplace=True)
df2.set_index('A', inplace=True)
# Find keys in dataframe 1 that are not in dataframe 2
idx_diff = df1.index.difference(df2.index)
# Append these rows to dataframe 2
df_ins = df1.loc[idx_diff]
df3 = df2.append(df_ins)
df3.reset_index(inplace=True)
>>>df3
A B C
0 45 a d
1 98 c d
2 67 bn k
3 90 x z
4 91 y oo

Selecting data from multiple dataframes

my workbook Rule.xlsx has following data.
sheet1:
group ordercode quantity
0 A 1
B 3
1 C 1
E 2
D 1
Sheet 2:
group ordercode quantity
0 x 1
y 3
1 x 1
y 2
z 1
I have created dataframe using below method.
df1 =data.parse('sheet1')
df2=data.parse('sheet2')
my desired result is writing a sequence using these two dataframe.
df3:
group ordercode quantity
0 A 1
B 3
0 x 1
y 3
1 C 1
E 2
D 1
1 x 1
y 2
z 1
one from df1 and one from df2.
I wish to know how I can print the data by selecting group number (eg. group(0), group(1) etc).
any suggestion ?
After some comments solution is:
#create OrderDict of DataFrames
dfs = pd.read_excel('Rule.xlsx', sheet_name=None)
#ordering of DataFrames
order = 'SWC_1380_81,SWC_1382,SWC_1390,SWC_1391,SWM_1380_81'.split(',')
#in loops lookup dictionaries, replace NaNs and create helper column
L = [dfs[x].ffill().assign(g=i) for i, x in enumerate(order)]
#last join together, sorting and last remove helper column
df = pd.concat(L).sort_values(['group','g'])

Apache Spark group sums by field

I have the dataframe with three columns
amount type id
12 A 1
10 C 1
21 B 2
10 A 2
2 B 3
44 B 3
I need to sum amounts of each type and group them by id. My solution is like
GroupedData result = dataFrame.agg(
when(dataFrame.col("type").like("A%")
.or(dataFrame.col("type").like("C%")),
sum("amount"))
.otherwise(0)
).agg(
when(dataFrame.col("type").like("B%"), sum("amount"))
.otherwise(0)
)
.groupBy(dataFrame.col("id"));
which isn't looks right for me. And I need to return DataFrame as a result with data
amount type id
22 A or C 1
21 B 2
10 A 2
46 B 3
I cannot use double groupBy because two different types may be in one sum. What can you suggest?
I use java and Apache Spark 1.6.2.
Why don't you groupBy by two columns?
df.groupBy($"id", $"type").sum()

Resources