Spark simpler value_counts - apache-spark

Something similar to Spark - Group by Key then Count by Value would allow me to emulate df.series.value_counts() the functionality of Pandas in Spark to:
The resulting object will be in descending order so that the first
element is the most frequently-occurring element. Excludes NA values
by default. (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)
I am curious if this can't be achieved nicer / simpler for data frames in Spark.

It is just a basic aggregation, isn't it?
df.groupBy($"value").count.orderBy($"count".desc)
Pandas:
import pandas as pd
pd.Series([1, 2, 2, 2, 3, 3, 4]).value_counts()
2 3
3 2
4 1
1 1
dtype: int64
Spark SQL:
Seq(1, 2, 2, 2, 3, 3, 4).toDF("value")
.groupBy($"value").count.orderBy($"count".desc)
+-----+-----+
|value|count|
+-----+-----+
| 2| 3|
| 3| 2|
| 1| 1|
| 4| 1|
+-----+-----+
If you want to include additional grouping columns (like "key") just put these in the groupBy:
df.groupBy($"key", $"value").count.orderBy($"count".desc)

Related

get distinct count from an array of each rows using pyspark

I am looking for distinct counts from an array of each rows using pyspark dataframe:
input:
col1
[1,1,1]
[3,4,5]
[1,2,1,2]
output:
1
3
2
I used below code but it is giving me the length of an array:
output:
3
3
4
please help me how do i achieve this using python pyspark dataframe.
slen = udf(lambda s: len(s), IntegerType())
count = Df.withColumn("Count", slen(df.col1))
count.show()
Thanks in advanced !
For spark2.4+ you can use array_distinct and then just get the size of that, to get count of distinct values in your array. Using UDF will be very slow and inefficient for big data, always try to use spark in-built functions.
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.array_distinct
(welcome to SO)
df.show()
+------------+
| col1|
+------------+
| [1, 1, 1]|
| [3, 4, 5]|
|[1, 2, 1, 2]|
+------------+
df.withColumn("count", F.size(F.array_distinct("col1"))).show()
+------------+-----+
| col1|count|
+------------+-----+
| [1, 1, 1]| 1|
| [3, 4, 5]| 3|
|[1, 2, 1, 2]| 2|
+------------+-----+

parse values from list in dataframe column

I have a pyspark dataframe like the input dataframe below. It has a column colA that contains lists of numbers as each value. I would like to create a new column colC that parses each number from the list in colA, like the example output dataframe below. Can anyone suggest how to do this?
input dataframe:
colA colB
[1,2] 1
[3,2,4] 2
output dataframe:
colA colB colC
[1,2] 1 1
[1,2] 1 2
[3,2,4] 2 3
[3,2,4] 2 2
[3,2,4] 2 4
It can be done by explode function:
from pyspark.sql.functions import explode
df.withColumn("colC", explode(df.colA)).show()
Output:
+---------+----+----+
| colA|colB|colC|
+---------+----+----+
| [1, 2]| 1| 1|
| [1, 2]| 1| 2|
|[3, 2, 4]| 2| 3|
|[3, 2, 4]| 2| 2|
|[3, 2, 4]| 2| 4|
+---------+----+----+

Can we do a groupby on one column in spark using pyspark and get list of values of other columns (raw values without an aggregation) [duplicate]

This question already has answers here:
Apply a function to groupBy data with pyspark
(2 answers)
Closed 4 years ago.
I have a dataframe where there are more than 10 columns. I would like to groupby on one of the columns in the dataframe, suppose "Column1" and get the list of all possible values in "Column1" and "Column2" corresponding to the grouped "Column1".
Is there a way to do that using pyspark groupby or any other funtion ?
You can groupBy your Column1 and then use agg function to collect the values from Column2 and Column3
from pyspark.sql.functions import col
import pyspark.sql.functions as F
df = spark.createDataFrame([[1, 'r1', 1],
[1, 'r2', 0],
[2, 'r2', 1],
[2, 'r1', 1],
[3, 'r1', 1]], schema=['col1', 'col2', 'col3'])
df.show()
#output
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#| 1| r1| 1|
#| 1| r2| 0|
#| 2| r2| 1|
#| 2| r1| 1|
#| 3| r1| 1|
#+----+----+----+
df.groupby(col('col1')).
agg(F.collect_list(F.struct(col('col2'), col('col3'))).
alias('merged')).show()
You should see the output as,
+----+----------------+
|col1| merged|
+----+----------------+
| 1|[[r1,1], [r2,0]]|
| 3| [[r1,1]]|
| 2|[[r2,1], [r1,1]]|
+----+----------------+

Pyspark find friendship pairs from friendship lists

I currently have data describing single directional friendship such as below:
For the first line, it means 1 added 3, 4, 8 as friends but doesn't know their responses, and if 3 added 1 as friend as well, they become a pair.
ID friendsList
1 [3, 4, 8]
2 [8]
3 [1]
4 [1]
5 [6]
6 [7]
7 [1]
8 [1, 2, 4]
How can I use PySpark and PySpark SQL to generate friendship pair that both of them are bi-directional friends? Sample output(distinct or not doesn't matter):
(1, 4)
(1, 8)
(1, 3)
(2, 8)
(3, 1)
(4, 1)
(8, 1)
(8, 2)
Thanks!
This can be achieved by explode function and self join as shown below.
from pyspark.sql.functions import explode
df = spark.createDataFrame(((1,[3, 4, 8]),(2,[8]),(3,[1]),(4,[1]),(5,[6]),(6,[7]),(7,[1]),(8,[1, 2, 4])),["c1",'c2'])
df.withColumn('c2',explode(df['c2'])).createOrReplaceTempView('table1')
>>> spark.sql("SELECT t0.c1,t0.c2 FROM table1 t0 INNER JOIN table1 t1 ON t0.c1 = t1.c2 AND t0.c2 = t1.c1").show()
+---+---+
| c1| c2|
+---+---+
| 1| 3|
| 8| 1|
| 1| 4|
| 2| 8|
| 4| 1|
| 8| 2|
| 3| 1|
| 1| 8|
+---+---+
use below if Dataframe API is preferred over spark SQL.
df = df.withColumn('c2',explode(df['c2']))
df.alias('df1') \
.join(df.alias('df2'),((col('df1.c1') == col('df2.c2')) & (col('df2.c1') == col('df1.c2')))) \
.select(col('df1.c1'),col('df1.c2'))

Get IDs for duplicate rows (considering all other columns) in Apache Spark

I have a Spark sql dataframe, consisting of an ID column and n "data" columns, i.e.
id | dat1 | dat2 | ... | datn
The id columnn is uniquely determined, whereas, looking at dat1 ... datn there may be duplicates.
My goal is to find the ids of those duplicates.
My approach so far:
get the duplicate rows using groupBy:
dup_df = df.groupBy(df.columns[1:]).count().filter('count > 1')
join the dup_df with the entire df to get the duplicate rows including id:
df.join(dup_df, df.columns[1:])
I am quite certain that this is basically correct, it fails because the dat1 ... datn columns contain null values.
To do the join on null values, I found .e.g this SO post. But this would require to construct a huge "string join condition".
Thus my questions:
Is there a simple / more generic / more pythonic way to do joins on null values?
Or, even better, is there another (easier, more beautiful, ...) method to get the desired ids?
BTW: I am using Spark 2.1.0 and Python 3.5.3
If number ids per group is relatively small you can groupBy and collect_list. Required imports
from pyspark.sql.functions import collect_list, size
example data:
df = sc.parallelize([
(1, "a", "b", 3),
(2, None, "f", None),
(3, "g", "h", 4),
(4, None, "f", None),
(5, "a", "b", 3)
]).toDF(["id"])
query:
(df
.groupBy(df.columns[1:])
.agg(collect_list("id").alias("ids"))
.where(size("ids") > 1))
and the result:
+----+---+----+------+
| _2| _3| _4| ids|
+----+---+----+------+
|null| f|null|[2, 4]|
| a| b| 3|[1, 5]|
+----+---+----+------+
You can apply explode twice (or use an udf) to an output equivalent to the one returned from join.
You can also identify groups using minimal id per group. A few additional imports:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, count, min
window definition:
w = Window.partitionBy(df.columns[1:])
query:
(df
.select(
"*",
count("*").over(w).alias("_cnt"),
min("id").over(w).alias("group"))
.where(col("_cnt") > 1))
and the result:
+---+----+---+----+----+-----+
| id| _2| _3| _4|_cnt|group|
+---+----+---+----+----+-----+
| 2|null| f|null| 2| 2|
| 4|null| f|null| 2| 2|
| 1| a| b| 3| 2| 1|
| 5| a| b| 3| 2| 1|
+---+----+---+----+----+-----+
You can further use group column for self join.

Resources