parse values from list in dataframe column - apache-spark

I have a pyspark dataframe like the input dataframe below. It has a column colA that contains lists of numbers as each value. I would like to create a new column colC that parses each number from the list in colA, like the example output dataframe below. Can anyone suggest how to do this?
input dataframe:
colA colB
[1,2] 1
[3,2,4] 2
output dataframe:
colA colB colC
[1,2] 1 1
[1,2] 1 2
[3,2,4] 2 3
[3,2,4] 2 2
[3,2,4] 2 4

It can be done by explode function:
from pyspark.sql.functions import explode
df.withColumn("colC", explode(df.colA)).show()
Output:
+---------+----+----+
| colA|colB|colC|
+---------+----+----+
| [1, 2]| 1| 1|
| [1, 2]| 1| 2|
|[3, 2, 4]| 2| 3|
|[3, 2, 4]| 2| 2|
|[3, 2, 4]| 2| 4|
+---------+----+----+

Related

How to apply similar RDD reduceByKey on Dataframes [duplicate]

So I have a spark dataframe that looks like:
a | b | c
5 | 2 | 1
5 | 4 | 3
2 | 4 | 2
2 | 3 | 7
And I want to group by column a, create a list of values from column b, and forget about c. The output dataframe would be :
a | b_list
5 | (2,4)
2 | (4,3)
How would I go about doing this with a pyspark sql dataframe?
Thank you! :)
Here are the steps to get that Dataframe.
>>> from pyspark.sql import functions as F
>>>
>>> d = [{'a': 5, 'b': 2, 'c':1}, {'a': 5, 'b': 4, 'c':3}, {'a': 2, 'b': 4, 'c':2}, {'a': 2, 'b': 3,'c':7}]
>>> df = spark.createDataFrame(d)
>>> df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 5| 2| 1|
| 5| 4| 3|
| 2| 4| 2|
| 2| 3| 7|
+---+---+---+
>>> df1 = df.groupBy('a').agg(F.collect_list("b"))
>>> df1.show()
+---+---------------+
| a|collect_list(b)|
+---+---------------+
| 5| [2, 4]|
| 2| [4, 3]|
+---+---------------+

Can we do a groupby on one column in spark using pyspark and get list of values of other columns (raw values without an aggregation) [duplicate]

This question already has answers here:
Apply a function to groupBy data with pyspark
(2 answers)
Closed 4 years ago.
I have a dataframe where there are more than 10 columns. I would like to groupby on one of the columns in the dataframe, suppose "Column1" and get the list of all possible values in "Column1" and "Column2" corresponding to the grouped "Column1".
Is there a way to do that using pyspark groupby or any other funtion ?
You can groupBy your Column1 and then use agg function to collect the values from Column2 and Column3
from pyspark.sql.functions import col
import pyspark.sql.functions as F
df = spark.createDataFrame([[1, 'r1', 1],
[1, 'r2', 0],
[2, 'r2', 1],
[2, 'r1', 1],
[3, 'r1', 1]], schema=['col1', 'col2', 'col3'])
df.show()
#output
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#| 1| r1| 1|
#| 1| r2| 0|
#| 2| r2| 1|
#| 2| r1| 1|
#| 3| r1| 1|
#+----+----+----+
df.groupby(col('col1')).
agg(F.collect_list(F.struct(col('col2'), col('col3'))).
alias('merged')).show()
You should see the output as,
+----+----------------+
|col1| merged|
+----+----------------+
| 1|[[r1,1], [r2,0]]|
| 3| [[r1,1]]|
| 2|[[r2,1], [r1,1]]|
+----+----------------+

Pyspark find friendship pairs from friendship lists

I currently have data describing single directional friendship such as below:
For the first line, it means 1 added 3, 4, 8 as friends but doesn't know their responses, and if 3 added 1 as friend as well, they become a pair.
ID friendsList
1 [3, 4, 8]
2 [8]
3 [1]
4 [1]
5 [6]
6 [7]
7 [1]
8 [1, 2, 4]
How can I use PySpark and PySpark SQL to generate friendship pair that both of them are bi-directional friends? Sample output(distinct or not doesn't matter):
(1, 4)
(1, 8)
(1, 3)
(2, 8)
(3, 1)
(4, 1)
(8, 1)
(8, 2)
Thanks!
This can be achieved by explode function and self join as shown below.
from pyspark.sql.functions import explode
df = spark.createDataFrame(((1,[3, 4, 8]),(2,[8]),(3,[1]),(4,[1]),(5,[6]),(6,[7]),(7,[1]),(8,[1, 2, 4])),["c1",'c2'])
df.withColumn('c2',explode(df['c2'])).createOrReplaceTempView('table1')
>>> spark.sql("SELECT t0.c1,t0.c2 FROM table1 t0 INNER JOIN table1 t1 ON t0.c1 = t1.c2 AND t0.c2 = t1.c1").show()
+---+---+
| c1| c2|
+---+---+
| 1| 3|
| 8| 1|
| 1| 4|
| 2| 8|
| 4| 1|
| 8| 2|
| 3| 1|
| 1| 8|
+---+---+
use below if Dataframe API is preferred over spark SQL.
df = df.withColumn('c2',explode(df['c2']))
df.alias('df1') \
.join(df.alias('df2'),((col('df1.c1') == col('df2.c2')) & (col('df2.c1') == col('df1.c2')))) \
.select(col('df1.c1'),col('df1.c2'))

GroupByKey and create lists of values pyspark sql dataframe

So I have a spark dataframe that looks like:
a | b | c
5 | 2 | 1
5 | 4 | 3
2 | 4 | 2
2 | 3 | 7
And I want to group by column a, create a list of values from column b, and forget about c. The output dataframe would be :
a | b_list
5 | (2,4)
2 | (4,3)
How would I go about doing this with a pyspark sql dataframe?
Thank you! :)
Here are the steps to get that Dataframe.
>>> from pyspark.sql import functions as F
>>>
>>> d = [{'a': 5, 'b': 2, 'c':1}, {'a': 5, 'b': 4, 'c':3}, {'a': 2, 'b': 4, 'c':2}, {'a': 2, 'b': 3,'c':7}]
>>> df = spark.createDataFrame(d)
>>> df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 5| 2| 1|
| 5| 4| 3|
| 2| 4| 2|
| 2| 3| 7|
+---+---+---+
>>> df1 = df.groupBy('a').agg(F.collect_list("b"))
>>> df1.show()
+---+---------------+
| a|collect_list(b)|
+---+---------------+
| 5| [2, 4]|
| 2| [4, 3]|
+---+---------------+

Spark simpler value_counts

Something similar to Spark - Group by Key then Count by Value would allow me to emulate df.series.value_counts() the functionality of Pandas in Spark to:
The resulting object will be in descending order so that the first
element is the most frequently-occurring element. Excludes NA values
by default. (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)
I am curious if this can't be achieved nicer / simpler for data frames in Spark.
It is just a basic aggregation, isn't it?
df.groupBy($"value").count.orderBy($"count".desc)
Pandas:
import pandas as pd
pd.Series([1, 2, 2, 2, 3, 3, 4]).value_counts()
2 3
3 2
4 1
1 1
dtype: int64
Spark SQL:
Seq(1, 2, 2, 2, 3, 3, 4).toDF("value")
.groupBy($"value").count.orderBy($"count".desc)
+-----+-----+
|value|count|
+-----+-----+
| 2| 3|
| 3| 2|
| 1| 1|
| 4| 1|
+-----+-----+
If you want to include additional grouping columns (like "key") just put these in the groupBy:
df.groupBy($"key", $"value").count.orderBy($"count".desc)

Resources