combining multiple rows in Spark dataframe column based on condition - apache-spark

I am trying to combine multiple rows in a spark dataframe based on a condition:
This is the dataframe I have(df):
|username | qid | row_no | text |
---------------------------------
| a | 1 | 1 | this |
| a | 1 | 2 | is |
| d | 2 | 1 | the |
| a | 1 | 3 | text |
| d | 2 | 2 | ball |
I want it to look like this
|username | qid | row_no | text |
---------------------------------------
| a | 1 | 1,2,3 | This is text|
| b | 2 | 1,2 | The ball |
I am using spark 1.5.2 it does not have collect_list function

collect_list showed up only in 1.6.
I'd go through the underlying RDD. Here's how:
data_df.show()
+--------+---+------+----+
|username|qid|row_no|text|
+--------+---+------+----+
| d| 2| 2|ball|
| a| 1| 1|this|
| a| 1| 3|text|
| a| 1| 2| is|
| d| 2| 1| the|
+--------+---+------+----+
Then this
reduced = data_df\
.rdd\
.map(lambda row: ((row[0], row[1]), [(row[2], row[3])]))\
.reduceByKey(lambda x,y: x+y)\
.map(lambda row: (row[0], sorted(row[1], key=lambda text: text[0]))) \
.map(lambda row: (
row[0][0],
row[0][1],
','.join([str(e[0]) for e in row[1]]),
' '.join([str(e[1]) for e in row[1]])
)
)
schema_red = typ.StructType([
typ.StructField('username', typ.StringType(), False),
typ.StructField('qid', typ.IntegerType(), False),
typ.StructField('row_no', typ.StringType(), False),
typ.StructField('text', typ.StringType(), False)
])
df_red = sqlContext.createDataFrame(reduced, schema_red)
df_red.show()
The above produced the following:
+--------+---+------+------------+
|username|qid|row_no| text|
+--------+---+------+------------+
| d| 2| 1,2| the ball|
| a| 1| 1,2,3|this is text|
+--------+---+------+------------+
In pandas
df4 = pd.DataFrame([
['a', 1, 1, 'this'],
['a', 1, 2, 'is'],
['d', 2, 1, 'the'],
['a', 1, 3, 'text'],
['d', 2, 2, 'ball']
], columns=['username', 'qid', 'row_no', 'text'])
df_groupped=df4.sort_values(by=['qid', 'row_no']).groupby(['username', 'qid'])
df3 = pd.DataFrame()
df3['row_no'] = df_groupped.apply(lambda row: ','.join([str(e) for e in row['row_no']]))
df3['text'] = df_groupped.apply(lambda row: ' '.join(row['text']))
df3 = df3.reset_index()

You can apply groupBy on username and qid column then follow by agg() method you can use collect_list() method like this
import pyspark.sql.functions as func
then you will have collect_list()or some other important functions
for detail abput groupBy and agg you can follow this URL.
Hope this solves your problem
Thanks

Related

Apache Spark group by DF, collect values into list and then group by list

I have the following Apache Spark DataFrame (DF1):
function_name | param1 | param2 | param3 | result
---------------------------------------------------
f1 | a | b | c | 1
f1 | b | d | m | 0
f2 | a | b | c | 0
f2 | b | d | m | 0
f3 | a | b | c | 1
f3 | b | d | m | 1
f4 | a | b | c | 0
f4 | b | d | m | 0
First of all, I'd like to group DataFrame by function_name, collect results into the ArrayType and receive the new DataFrame (DF2):
function_name | result_list
--------------------------------
f1 | [1,0]
f2 | [0,0]
f3 | [1,1]
f4 | [0,0]
rigth after that, I need to collect function_name into ArrayType by grouping result_list and I'll receive new DataFrame like the following (DF3):
result_list | function_name_lists
------------------------------------
[1,0] | [f1]
[0,0] | [f2,f4]
[1,1] | [f3]
So, I have a question - first of all, can I use grouping by ArrayType column in Apache Spark? If so, I can potentially have tens of millions values in result_list ArrayType single field. Will Apache Spark be able to group by result_list column in this case?
Yes you can do that.
Creating your data frame:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.types import *
list=[['f1','a','b','c',1],
['f1','b','d','m',0],
['f2','a','b','c',0],
['f2','b','d','m',0],
['f3','a','b','c',1],
['f3','b','d','m',1],
['f4','a','b','c',0],
['f4','b','d','m',0]]
df= spark.createDataFrame(list,['function_name','param1','param2','param3','result'])
df.show()
+-------------+------+------+------+------+
|function_name|param1|param2|param3|result|
+-------------+------+------+------+------+
| f1| a| b| c| 1|
| f1| b| d| m| 0|
| f2| a| b| c| 0|
| f2| b| d| m| 0|
| f3| a| b| c| 1|
| f3| b| d| m| 1|
| f4| a| b| c| 0|
| f4| b| d| m| 0|
+-------------+------+------+------+------+
Grouping by function_name, then grouping by result_list(using collect_list), using order of param1,param2,param3:
w=Window().partitionBy("function_name").orderBy(F.col("param1"),F.col("param2"),F.col("param3"))
w1=Window().partitionBy("function_name")
df1=df.withColumn("result_list", F.collect_list("result").over(w)).withColumn("result2",F.row_number().over(w))\
.withColumn("result3",F.max("result2").over(w1))\
.filter(F.col("result2")==F.col("result3")).drop("param1","param2","param3","result","result2","result3")
df1.groupBy("result_list")\
.agg(F.collect_list("function_name").alias("function_name_list")).show()
+-----------+------------------+
|result_list|function_name_list|
+-----------+------------------+
| [1, 0]| [f1]|
| [1, 1]| [f3]|
| [0, 0]| [f2, f4]|
+-----------+------------------+
For doing further anaylsis, transformation or cleaning on array type columns I would recommend you check out the new higher order functions in spark2.4 and above.
(collect_list will work for spark1.6 and above)
Higher order functions in open source:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.collect_list
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.array_contains onwards
Databricks releases:
Link:https://docs.databricks.com/delta/data-transformation/higher-order-lambda-functions.html

Combine dataframes columns consisting of multiple values - Spark

I have two Spark dataframes that share the same ID column:
df1:
+------+---------+---------+
|ID | Name1 | Name2 |
+------+---------+---------+
| 1 | A | B |
| 2 | C | D |
| 3 | E | F |
+------+---------+---------+
df2:
+------+-------+
|ID | key |
+------+-------+
| 1 | w |
| 1 | x |
| 2 | y |
| 3 | z |
+------+-------+
Now, I want to create a new column in df1 that contains all key values denoted in df2. So, I aim for the result:
+------+---------+---------+---------+
|ID | Name1 | Name2 | keys |
+------+---------+---------+---------+
| 1 | A | B | w,x |
| 2 | C | D | y |
| 3 | E | F | z |
+------+---------+---------+---------+
Ultimately, I want to find a solution for an arbitrary amount of keys.
My attempt in PySpark:
def get_keys(id):
x = df2.where(df2.ID == id).select('key')
return x
df_keys = df1.withColumn("keys", get_keys(col('ID')))
In the above code, x is a dataframe. Since the second argument of the .withColumn function needs to be an Column type variable, I am not sure how to mutate x correctly.
You are looking for collect_list function.
from pyspark.sql.functions import collect_list
df3 = df1.join(df2, df1.ID == df2.ID).drop(df2.ID)
df3.groupBy('ID','Name1','Name2').agg(collect_list('key').alias('keys')).show()
#+---+-----+-----+------+
#| ID|Name1|Name2| keys|
#+---+-----+-----+------+
#| 1| A| B|[w, x]|
#| 3| C| F| [z]|
#| 2| B| D| [y]|
#+---+-----+-----+------+
If you want only unique keys you can use collect_set

Randomly Split DataFrame by Unique Values in One Column

I have a pyspark DataFrame like the following:
+--------+--------+-----------+
| col1 | col2 | groupId |
+--------+--------+-----------+
| val11 | val21 | 0 |
| val12 | val22 | 1 |
| val13 | val23 | 2 |
| val14 | val24 | 0 |
| val15 | val25 | 1 |
| val16 | val26 | 1 |
+--------+--------+-----------+
Each row has a groupId and multiple rows can have the same groupId.
I want to randomly split this data into two datasets. But all the data having a particular groupId must be in one of the splits.
This means that if d1.groupId = d2.groupId, then d1 and d2 are in the same split.
For example:
# Split 1:
+--------+--------+-----------+
| col1 | col2 | groupId |
+--------+--------+-----------+
| val11 | val21 | 0 |
| val13 | val23 | 2 |
| val14 | val24 | 0 |
+--------+--------+-----------+
# Split 2:
+--------+--------+-----------+
| col1 | col2 | groupId |
+--------+--------+-----------+
| val12 | val22 | 1 |
| val15 | val25 | 1 |
| val16 | val26 | 1 |
+--------+--------+-----------+
What is the good way to do it on PySpark? Can I use the randomSplit method somehow?
You can use randomSplit to split just the distinct groupIds, and then use the results to split the source DataFrame using join.
For example:
split1, split2 = df.select("groupId").distinct().randomSplit(weights=[0.5, 0.5], seed=0)
split1.show()
#+-------+
#|groupId|
#+-------+
#| 1|
#+-------+
split2.show()
#+-------+
#|groupId|
#+-------+
#| 0|
#| 2|
#+-------+
Now join these back to the original DataFrame:
df1 = df.join(split1, on="groupId", how="inner")
df2 = df.join(split2, on="groupId", how="inner")
df1.show()
3+-------+-----+-----+
#|groupId| col1| col2|
#+-------+-----+-----+
#| 1|val12|val22|
#| 1|val15|val25|
#| 1|val16|val26|
#+-------+-----+-----+
df2.show()
#+-------+-----+-----+
#|groupId| col1| col2|
#+-------+-----+-----+
#| 0|val11|val21|
#| 0|val14|val24|
#| 2|val13|val23|
#+-------+-----+-----+

How can I check if an instance is in a dataframe in Pyspark?

I have an instance extracted from a dataframe df1 and I want to check if that instance is in another dataframe df2 in Pyspark. Is there way to face it?
For example:
Instance:
+------+------+------+
| Atr1 | Atr2 | Atr3 |
+------+------+------+
| 'A' | 2 | 'B' |
+------+------+------+
Dataframe:
+------+------+------+
| Atr1 | Atr2 | Atr3 |
+------+------+------+
| 'C' | 1 | 'B' |
+------+------+------+
| 'D' | 2 | 'A' |
+------+------+------+
| 'E' | 2 | 'C' |
+------+------+------+
| 'A' | 2 | 'B' |
+------+------+------+
This way, I want to get true because the instance is in the dataframe (4th row).
Thanks.
You can take the intersection of df1 and df2 and compare if the count of df1 is equal to that of the intersection as follows:
>>> df1 = spark.createDataFrame(sc.parallelize([['A', 2, 'B']]), ['Atr1', 'Atr2', 'Atr3'])
>>> df2 = spark.createDataFrame(sc.parallelize([['C',1,'B'],['D',2,'A'],['E',2,'C'],['A',2,'B']]), ['Atr1', 'Atr2', 'Atr3'])
>>> df1.show()
+----+----+----+
|Atr1|Atr2|Atr3|
+----+----+----+
| A| 2| B|
+----+----+----+
>>> df2.show()
+----+----+----+
|Atr1|Atr2|Atr3|
+----+----+----+
| C| 1| B|
| D| 2| A|
| E| 2| C|
| A| 2| B|
+----+----+----+
>>> df2.intersect(df1).count() == df1.count()
True
>>>
For information on pyspark.sql.DataFrame.intersect check the documentation here.
Pyspark is not the right language to do this, but still:
First let's create our dataframes:
df1 = spark.createDataFrame(sc.parallelize([['A', 2, 'B']]), ['Atr1', 'Atr2', 'Atr3'])
df2 = spark.createDataFrame(sc.parallelize([['C',1,'B'],['D',2,'A'],['E',2,'C'],['A',2,'B']]), ['Atr1', 'Atr2', 'Atr3'])
you can use:
subtract
df1.subtract(df2).count() == 0
a join
df2.join(df1, ['Atr1', 'Atr2', 'Atr3']).count() > 0
a filter
df2.filter((df2.Atr1 == 'A') & (df2.Atr2 == 2) & (df2.Atr3 == 'B')).count() > 0
Hope this helps!

PySpark: Compare array values in one dataFrame with array values in another dataFrame to get the intersection

I have the following two DataFrames:
l1 = [(['hello','world'],), (['stack','overflow'],), (['hello', 'alice'],), (['sample', 'text'],)]
df1 = spark.createDataFrame(l1)
l2 = [(['big','world'],), (['sample','overflow', 'alice', 'text', 'bob'],), (['hello', 'sample'],)]
df2 = spark.createDataFrame(l2)
df1:
["hello","world"]
["stack","overflow"]
["hello","alice"]
["sample","text"]
df2:
["big","world"]
["sample","overflow","alice","text","bob"]
["hello", "sample"]
For every row in df1, I want to calculate the number of times all the words in the array occur in df2.
For example, the first row in df1 is ["hello","world"]. Now, I want to check df2 for the intersection of ["hello","world"] with every row in df2.
| ARRAY | INTERSECTION | LEN(INTERSECTION)|
|["big","world"] |["world"] | 1 |
|["sample","overflow","alice","text","bob"] |[] | 0 |
|["hello","sample"] |["hello"] | 1 |
Now, I want to return the sum(len(interesection)). Ultimately I want the resulting df1 to look like this:
df1 result:
ARRAY INTERSECTION_TOTAL
| ["hello","world"] | 2 |
| ["stack","overflow"] | 1 |
| ["hello","alice"] | 2 |
| ["sample","text"] | 3 |
How do I solve this?
I'd focus on avoiding Cartesian product first. I'd try to explode and join
from pyspark.sql.functions import explode, monotonically_increasing_id
df1_ = (df1.toDF("words")
.withColumn("id_1", monotonically_increasing_id())
.select("*", explode("words").alias("word")))
df2_ = (df2.toDF("words")
.withColumn("id_2", monotonically_increasing_id())
.select("id_2", explode("words").alias("word")))
(df1_.join(df2_, "word").groupBy("id_1", "id_2", "words").count()
.groupBy("id_1", "words").sum("count").drop("id_1").show())
+-----------------+----------+
| words|sum(count)|
+-----------------+----------+
| [hello, alice]| 2|
| [sample, text]| 3|
|[stack, overflow]| 1|
| [hello, world]| 2|
+-----------------+----------+
If intermediate values are not needed it could be simplified to:
df1_.join(df2_, "word").groupBy("words").count().show()
+-----------------+-----+
| words|count|
+-----------------+-----+
| [hello, alice]| 2|
| [sample, text]| 3|
|[stack, overflow]| 1|
| [hello, world]| 2|
+-----------------+-----+
and you could omit adding ids.

Resources