I have a pyspark dataframe like the input dataframe below. It has a column colA that contains lists of numbers as each value. I would like to create a new column colC that parses each number from the list in colA, like the example output dataframe below. Can anyone suggest how to do this?
input dataframe:
colA colB
[1,2] 1
[3,2,4] 2
output dataframe:
colA colB colC
[1,2] 1 1
[1,2] 1 2
[3,2,4] 2 3
[3,2,4] 2 2
[3,2,4] 2 4
It can be done by explode function:
from pyspark.sql.functions import explode
df.withColumn("colC", explode(df.colA)).show()
Output:
+---------+----+----+
| colA|colB|colC|
+---------+----+----+
| [1, 2]| 1| 1|
| [1, 2]| 1| 2|
|[3, 2, 4]| 2| 3|
|[3, 2, 4]| 2| 2|
|[3, 2, 4]| 2| 4|
+---------+----+----+
Related
So I have a spark dataframe that looks like:
a | b | c
5 | 2 | 1
5 | 4 | 3
2 | 4 | 2
2 | 3 | 7
And I want to group by column a, create a list of values from column b, and forget about c. The output dataframe would be :
a | b_list
5 | (2,4)
2 | (4,3)
How would I go about doing this with a pyspark sql dataframe?
Thank you! :)
Here are the steps to get that Dataframe.
>>> from pyspark.sql import functions as F
>>>
>>> d = [{'a': 5, 'b': 2, 'c':1}, {'a': 5, 'b': 4, 'c':3}, {'a': 2, 'b': 4, 'c':2}, {'a': 2, 'b': 3,'c':7}]
>>> df = spark.createDataFrame(d)
>>> df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 5| 2| 1|
| 5| 4| 3|
| 2| 4| 2|
| 2| 3| 7|
+---+---+---+
>>> df1 = df.groupBy('a').agg(F.collect_list("b"))
>>> df1.show()
+---+---------------+
| a|collect_list(b)|
+---+---------------+
| 5| [2, 4]|
| 2| [4, 3]|
+---+---------------+
This question already has answers here:
Apply a function to groupBy data with pyspark
(2 answers)
Closed 4 years ago.
I have a dataframe where there are more than 10 columns. I would like to groupby on one of the columns in the dataframe, suppose "Column1" and get the list of all possible values in "Column1" and "Column2" corresponding to the grouped "Column1".
Is there a way to do that using pyspark groupby or any other funtion ?
You can groupBy your Column1 and then use agg function to collect the values from Column2 and Column3
from pyspark.sql.functions import col
import pyspark.sql.functions as F
df = spark.createDataFrame([[1, 'r1', 1],
[1, 'r2', 0],
[2, 'r2', 1],
[2, 'r1', 1],
[3, 'r1', 1]], schema=['col1', 'col2', 'col3'])
df.show()
#output
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#| 1| r1| 1|
#| 1| r2| 0|
#| 2| r2| 1|
#| 2| r1| 1|
#| 3| r1| 1|
#+----+----+----+
df.groupby(col('col1')).
agg(F.collect_list(F.struct(col('col2'), col('col3'))).
alias('merged')).show()
You should see the output as,
+----+----------------+
|col1| merged|
+----+----------------+
| 1|[[r1,1], [r2,0]]|
| 3| [[r1,1]]|
| 2|[[r2,1], [r1,1]]|
+----+----------------+
I currently have data describing single directional friendship such as below:
For the first line, it means 1 added 3, 4, 8 as friends but doesn't know their responses, and if 3 added 1 as friend as well, they become a pair.
ID friendsList
1 [3, 4, 8]
2 [8]
3 [1]
4 [1]
5 [6]
6 [7]
7 [1]
8 [1, 2, 4]
How can I use PySpark and PySpark SQL to generate friendship pair that both of them are bi-directional friends? Sample output(distinct or not doesn't matter):
(1, 4)
(1, 8)
(1, 3)
(2, 8)
(3, 1)
(4, 1)
(8, 1)
(8, 2)
Thanks!
This can be achieved by explode function and self join as shown below.
from pyspark.sql.functions import explode
df = spark.createDataFrame(((1,[3, 4, 8]),(2,[8]),(3,[1]),(4,[1]),(5,[6]),(6,[7]),(7,[1]),(8,[1, 2, 4])),["c1",'c2'])
df.withColumn('c2',explode(df['c2'])).createOrReplaceTempView('table1')
>>> spark.sql("SELECT t0.c1,t0.c2 FROM table1 t0 INNER JOIN table1 t1 ON t0.c1 = t1.c2 AND t0.c2 = t1.c1").show()
+---+---+
| c1| c2|
+---+---+
| 1| 3|
| 8| 1|
| 1| 4|
| 2| 8|
| 4| 1|
| 8| 2|
| 3| 1|
| 1| 8|
+---+---+
use below if Dataframe API is preferred over spark SQL.
df = df.withColumn('c2',explode(df['c2']))
df.alias('df1') \
.join(df.alias('df2'),((col('df1.c1') == col('df2.c2')) & (col('df2.c1') == col('df1.c2')))) \
.select(col('df1.c1'),col('df1.c2'))
So I have a spark dataframe that looks like:
a | b | c
5 | 2 | 1
5 | 4 | 3
2 | 4 | 2
2 | 3 | 7
And I want to group by column a, create a list of values from column b, and forget about c. The output dataframe would be :
a | b_list
5 | (2,4)
2 | (4,3)
How would I go about doing this with a pyspark sql dataframe?
Thank you! :)
Here are the steps to get that Dataframe.
>>> from pyspark.sql import functions as F
>>>
>>> d = [{'a': 5, 'b': 2, 'c':1}, {'a': 5, 'b': 4, 'c':3}, {'a': 2, 'b': 4, 'c':2}, {'a': 2, 'b': 3,'c':7}]
>>> df = spark.createDataFrame(d)
>>> df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 5| 2| 1|
| 5| 4| 3|
| 2| 4| 2|
| 2| 3| 7|
+---+---+---+
>>> df1 = df.groupBy('a').agg(F.collect_list("b"))
>>> df1.show()
+---+---------------+
| a|collect_list(b)|
+---+---------------+
| 5| [2, 4]|
| 2| [4, 3]|
+---+---------------+
Something similar to Spark - Group by Key then Count by Value would allow me to emulate df.series.value_counts() the functionality of Pandas in Spark to:
The resulting object will be in descending order so that the first
element is the most frequently-occurring element. Excludes NA values
by default. (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)
I am curious if this can't be achieved nicer / simpler for data frames in Spark.
It is just a basic aggregation, isn't it?
df.groupBy($"value").count.orderBy($"count".desc)
Pandas:
import pandas as pd
pd.Series([1, 2, 2, 2, 3, 3, 4]).value_counts()
2 3
3 2
4 1
1 1
dtype: int64
Spark SQL:
Seq(1, 2, 2, 2, 3, 3, 4).toDF("value")
.groupBy($"value").count.orderBy($"count".desc)
+-----+-----+
|value|count|
+-----+-----+
| 2| 3|
| 3| 2|
| 1| 1|
| 4| 1|
+-----+-----+
If you want to include additional grouping columns (like "key") just put these in the groupBy:
df.groupBy($"key", $"value").count.orderBy($"count".desc)