Apache Spark group sums by field - apache-spark

I have the dataframe with three columns
amount type id
12 A 1
10 C 1
21 B 2
10 A 2
2 B 3
44 B 3
I need to sum amounts of each type and group them by id. My solution is like
GroupedData result = dataFrame.agg(
when(dataFrame.col("type").like("A%")
.or(dataFrame.col("type").like("C%")),
sum("amount"))
.otherwise(0)
).agg(
when(dataFrame.col("type").like("B%"), sum("amount"))
.otherwise(0)
)
.groupBy(dataFrame.col("id"));
which isn't looks right for me. And I need to return DataFrame as a result with data
amount type id
22 A or C 1
21 B 2
10 A 2
46 B 3
I cannot use double groupBy because two different types may be in one sum. What can you suggest?
I use java and Apache Spark 1.6.2.

Why don't you groupBy by two columns?
df.groupBy($"id", $"type").sum()

Related

pandas get rows from one dataframe which are existed in other dataframe

I have two dataframes. The dataframes as follows:
df1 is
numbers
user_id
0 9154701244
1 9100913773
2 8639988041
3 8092118985
4 8143131334
5 9440609551
6 8309707235
7 8555033317
8 7095451372
9 8919206985
10 8688960416
11 9676230089
12 7036733390
13 9100914771
it's shape is (14,1)
df2 is
user_id numbers names type duration date_time
0 9032095748 919182206378 ramesh incoming 23 233445445
1 9032095748 918919206983 suresh incoming 45 233445445
2 9032095748 919030785187 rahul incoming 45 233445445
3 9032095748 916281206641 jay incoming 67 233445445
4 jakfnka998nknk 9874654411 query incoming 25 8571228412
5 jakfnka998nknk 9874654112 form incoming 42 678565487
6 jakfnka998nknk 9848022238 json incoming 10 89547212765
7 ukajhj9417fka 9984741215 keert incoming 32 8548412664
8 ukajhj9417fka 9979501984 arun incoming 21 7541344646
9 ukajhj9417fka 95463241 paru incoming 42 945151215451
10 ukajknva939o 7864621215 hari outgoing 34 49829840920
and it's shape is (10308,6)
Here in df1, the column name numbers are having the multiple unique numbers. These numbers are available in df2 and those are repeated depends on the duration. I want to get those data who all are existed in df2 based on the numbers which are available in df1.
Here is the code I've tried to get this but I'm not able to figure it out how it can be solved using pandas.
df = pd.concat([df1, df2]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
df = df.reindex(idx)
It gives me only unique numbers column which are there in df2. But I need to get all the data including other columns from the second dataframe.
It would be great that anyone can help me on this. Thanks in advance.
Here is a sample dataframe, I created keeping the gist same.
df1=pd.DataFrame({"numbers":[123,1234,12345,5421]})
df2=pd.DataFrame({"numbers":[123,1234,12345,123,123,45643],"B":[1,2,3,4,5,6],"C":[2,3,4,5,6,7]})
final_df=df2[df2.numbers.isin(df1.numbers)]
Output DataFrame The result is all unique numbers that are present in df1 and present in df2 will be returned
numbers B C
0 123 1 2
1 1234 2 3
2 12345 3 4
3 123 4 5
4 123 5 6

Combining the respective columns from 2 separate DataFrames using pandas

I have 2 large DataFrames with the same set of columns but different values. I need to combine the values in respective columns (A and B here, maybe be more in actual data) into single values in the same columns (see required output below). I have a quick way of implementing this using np.vectorize and df.to_numpy() but I am looking for a way to implement this strictly with pandas. Criteria here is first readability of code then time complexity.
df1 = pd.DataFrame({'A':[1,2,3,4,5], 'B':[5,4,3,2,1]})
print(df1)
A B
0 1 5
1 2 4
2 3 3
3 4 2
4 5 1
and,
df2 = pd.DataFrame({'A':[10,20,30,40,50], 'B':[50,40,30,20,10]})
print(df2)
A B
0 10 50
1 20 40
2 30 30
3 40 20
4 50 10
I have one way of doing it which is quite fast -
#This function might change into something more complex
def conc(a,b):
return str(a)+'_'+str(b)
conc_v = np.vectorize(conc)
required = pd.DataFrame(conc_v(df1.to_numpy(), df2.to_numpy()), columns=df1.columns)
print(required)
#Required Output
A B
0 1_10 5_50
1 2_20 4_40
2 3_30 3_30
3 4_40 2_20
4 5_50 1_10
Looking for an alternate way (strictly pandas) of solving this.
Criteria here is first readability of code
Another simple way is using add and radd
df1.astype(str).add(df2.astype(str).radd('-'))
A B
0 1-10 5-50
1 2-20 4-40
2 3-30 3-30
3 4-40 2-20
4 5-50 1-10

Averaging duplicates values in dataframe [duplicate]

I have a dataframe like this:
cluster org time
1 a 8
1 a 6
2 h 34
1 c 23
2 d 74
3 w 6
I would like to calculate the average of time per org per cluster.
Expected result:
cluster mean(time)
1 15 #=((8 + 6) / 2 + 23) / 2
2 54 #=(74 + 34) / 2
3 6
I do not know how to do it in Pandas, can anybody help?
If you want to first take mean on the combination of ['cluster', 'org'] and then take mean on cluster groups, you can use:
In [59]: (df.groupby(['cluster', 'org'], as_index=False).mean()
.groupby('cluster')['time'].mean())
Out[59]:
cluster
1 15
2 54
3 6
Name: time, dtype: int64
If you want the mean of cluster groups only, then you can use:
In [58]: df.groupby(['cluster']).mean()
Out[58]:
time
cluster
1 12.333333
2 54.000000
3 6.000000
You can also use groupby on ['cluster', 'org'] and then use mean():
In [57]: df.groupby(['cluster', 'org']).mean()
Out[57]:
time
cluster org
1 a 438886
c 23
2 d 9874
h 34
3 w 6
I would simply do this, which literally follows what your desired logic was:
df.groupby(['org']).mean().groupby(['cluster']).mean()
Another possible solution is to reshape the dataframe using pivot_table() then take mean(). Note that it's necessary to pass aggfunc='mean' (this averages time by cluster and org).
df.pivot_table(index='org', columns='cluster', values='time', aggfunc='mean').mean()
Another possibility is to use level parameter of mean() after the first groupby() to aggregate:
df.groupby(['cluster', 'org']).mean().mean(level='cluster')

How to write Pyspark UDAF on multiple columns?

I have the following data in a pyspark dataframe called end_stats_df:
values start end cat1 cat2
10 1 2 A B
11 1 2 C B
12 1 2 D B
510 1 2 D C
550 1 2 C B
500 1 2 A B
80 1 3 A B
And I want to aggregate it in the following way:
I want to use the "start" and "end" columns as the aggregate keys
For each group of rows, I need to do the following:
Compute the unique number of values in both cat1 and cat2 for that group. e.g., for the group of start=1 and end=2, this number would be 4 because there's A, B, C, D. This number will be stored as n (n=4 in this example).
For the values field, for each group I need to sort the values, and then select every n-1 value, where n is the value stored from the first operation above.
At the end of the aggregation, I don't really care what is in cat1 and cat2 after the operations above.
An example output from the example above is:
values start end cat1 cat2
12 1 2 D B
550 1 2 C B
80 1 3 A B
How do I accomplish using pyspark dataframes? I assume I need to use a custom UDAF, right?
Pyspark do not support UDAF directly, so we have to do aggregation manually.
from pyspark.sql import functions as f
def func(values, cat1, cat2):
n = len(set(cat1 + cat2))
return sorted(values)[n - 2]
df = spark.read.load('file:///home/zht/PycharmProjects/test/text_file.txt', format='csv', sep='\t', header=True)
df = df.groupBy(df['start'], df['end']).agg(f.collect_list(df['values']).alias('values'),
f.collect_set(df['cat1']).alias('cat1'),
f.collect_set(df['cat2']).alias('cat2'))
df = df.select(df['start'], df['end'], f.UserDefinedFunction(func, StringType())(df['values'], df['cat1'], df['cat2']))

How to efficiently concatenate data frames with different column sets in Spark?

I have two tables with different but overlaping column sets. I want to concatenate them in a way that pandas does but it is very inefficient in spark.
X:
A B
0 1 3
1 2 4
Y:
A C
0 5 7
1 6 8
pd.concat(X, Y):
A B C
0 1 3 NaN
1 2 4 NaN
0 5 NaN 7
1 6 NaN 8
I tried to use Spark SQL to do it...
select A, B, null as C from X union all select A, null as B, C from Y
... and it is extremely slow. I applied this query to two tables with sizes: (79 rows, 17330 columns) and (92 rows, 16 columns). It took 129s running on Spark 1.62, 319s on Spark 2.01 and 1.2s on pandas.
Why is it so slow? Is this some kind of bug? Can it be done faster using spark?
EDIT:
I tried to do it programatically as in here: how to union 2 spark dataframes with different amounts of columns - it's even slower.
It seems that the problem is adding null columns maybe it can be solved somehow differently or this part could be made faster?

Resources