This question already has answers here:
Explode in PySpark
(2 answers)
Closed 5 years ago.
Write a hivesql and display like below ouput
id name dob
-------------------------
1 anjan 10-16-1989
output:
id name dob
-------------------------
1 a 10-16-1989
1 n 10-16-1989
1 j 10-16-1989
1 a 10-16-1989
1 n 10-16-1989
and above scenario solve in spark and display same as above output
Assuming you have a dataframe (name it data) that comes from Hive like this:
+---+-----+----------+
| id| name| dob|
+---+-----+----------+
| 1|anjan|10-16-1989|
+---+-----+----------+
you can define a user defined function in spark that transform a string into an array :
val toArray = udf((name: String) => name.toArray.map(_.toString))
Having that we can apply it on the name column:
val df = data.withColumn("name", toArray(res0("name")))
+---+---------------+----------+
| id| name| dob|
+---+---------------+----------+
| 1|[a, n, j, a, n]|10-16-1989|
+---+---------------+----------+
We can use now the explode function on the name column
df.withColumn("name", explode(df("name")))
+---+----+----------+
| id|name| dob|
+---+----+----------+
| 1| a|10-16-1989|
| 1| n|10-16-1989|
| 1| j|10-16-1989|
| 1| a|10-16-1989|
| 1| n|10-16-1989|
+---+----+----------+
Related
I have 3 Dataframe df1(EMPLOYEE_INFO),df2(DEPARTMENT_INFO),df3(COMPANY_INFO) and i want to update a column which is in df1 by joining all the three dataframes . The name of column is FLAG_DEPARTMENT which is in df1. I need to set the FLAG_DEPARTMENT='POLITICS' . In sql query will look like this.
UPDATE [COMPANY_INFO] INNER JOIN ([DEPARTMENT_INFO]
INNER JOIN [EMPLOYEE_INFO] ON [DEPARTMENT_INFO].DEPT_ID = [EMPLOYEE_INFO].DEPT_ID)
ON [COMPANY_INFO].[COMPANY_DEPT_ID] = [DEPARTMENT_INFO].[DEP_COMPANYID]
SET EMPLOYEE_INFO.FLAG_DEPARTMENT = "POLITICS";
If the values in columns of these three tables matches i need to set my FLAG_DEPARTMENT='POLITICS' in my employee_Info Table
How can i achieve this same thing in pyspark. I have just started learning pyspark don't have that much depth knowledge?
You can use a chain of joins with a select on top of it.
Suppose that you have the following pyspark DataFrames:
employee_df
+---------+-------+
| Name|dept_id|
+---------+-------+
| John| dept_a|
| Liù| dept_b|
| Luke| dept_a|
| Michail| dept_a|
| Noe| dept_e|
|Shinchaku| dept_c|
| Vlad| dept_e|
+---------+-------+
department_df
+-------+----------+------------+
|dept_id|company_id| description|
+-------+----------+------------+
| dept_a| company1|Department A|
| dept_b| company2|Department B|
| dept_c| company5|Department C|
| dept_d| company3|Department D|
+-------+----------+------------+
company_df
+----------+-----------+
|company_id|description|
+----------+-----------+
| company1| Company 1|
| company2| Company 2|
| company3| Company 3|
| company4| Company 4|
+----------+-----------+
Then you can run the following code to add the flag_department column to your employee_df:
from pyspark.sql import functions as F
employee_df = (
employee_df.alias('a')
.join(
department_df.alias('b'),
on='dept_id',
how='left',
)
.join(
company_df.alias('c'),
on=F.col('b.company_id') == F.col('c.company_id'),
how='left',
)
.select(
*[F.col(f'a.{c}') for c in employee_df.columns],
F.when(
F.col('b.dept_id').isNotNull() & F.col('c.company_id').isNotNull(),
F.lit('POLITICS')
).alias('flag_department')
)
)
The new employee_df will be:
+---------+-------+---------------+
| Name|dept_id|flag_department|
+---------+-------+---------------+
| John| dept_a| POLITICS|
| Liù| dept_b| POLITICS|
| Luke| dept_a| POLITICS|
| Michail| dept_a| POLITICS|
| Noe| dept_e| null|
|Shinchaku| dept_c| null|
| Vlad| dept_e| null|
+---------+-------+---------------+
This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 3 years ago.
I want to give aggregate column name which contains a value of one of the groupBy columns:
dataset
.groupBy("user", "action")
.agg(collect_list("timestamp").name($"action" + "timestamps")
this part: .name($"action") does not work because name expects a String, not a Column.
Base on: How to pivot Spark DataFrame?
val df = spark.createDataFrame(Seq(("U1","a",1), ("U2","b",2))).toDF("user", "action", "timestamp")
val res = df.groupBy("user", "action").pivot("action").agg(collect_list("timestamp"))
res.show()
+----+------+---+---+
|user|action| a| b|
+----+------+---+---+
| U1| a|[1]| []|
| U2| b| []|[2]|
+----+------+---+---+
Fun part with column renaming. We should rename all but first 2 columns
val renames = res.schema.names.drop(2).map (n => col(n).as(n + "_timestamp"))
res.select((col("user") +: renames): _*).show
+----+-----------+-----------+
|user|a_timestamp|b_timestamp|
+----+-----------+-----------+
| U1| [1]| []|
| U2| []| [2]|
+----+-----------+-----------+
This question already has answers here:
Explode (transpose?) multiple columns in Spark SQL table
(3 answers)
Closed 4 years ago.
I have a dataframe in spark:
id | itemid | itemquant | itemprice
-------------------------------------------------
A | 1,2,3 | 2,2,1 | 30,19,10
B | 3,5 | 5,8 | 18,40
Here all the columns are of string datatype.
How can I use explode function across multiple columns and create a new dataframe shown below:
id | itemid | itemquant | itemprice
-------------------------------------------------
A | 1 | 2 | 30
A | 2 | 2 | 19
A | 3 | 1 | 10
B | 3 | 5 | 18
B | 5 | 8 | 40
Here in the new dataframe also, all the columns are of string datatype.
you need an UDF for that:
val df = Seq(
("A","1,2,3","2,2,1","30,19,10"),
("B","3,5","5,8","18,40")
).toDF("id","itemid","itemquant","itemprice")
val splitAndZip = udf((col1:String,col2:String,col3:String) => {
col1.split(',').zip(col2.split(',')).zip(col3.split(',')).map{case ((a,b),c) => (a,b,c)}
})
df
.withColumn("tmp",explode(splitAndZip($"itemId",$"itemquant",$"itemprice")))
.select(
$"id",
$"tmp._1".as("itemid"),
$"tmp._2".as("itemquant"),
$"tmp._3".as("itemprice")
)
.show()
+---+------+---------+---------+
| id|itemid|itemquant|itemprice|
+---+------+---------+---------+
| A| 1| 2| 30|
| A| 2| 2| 19|
| A| 3| 1| 10|
| B| 3| 5| 18|
| B| 5| 8| 40|
+---+------+---------+---------+
Sorry for the vague title, I can't think of a better way to put it. I understand a bit of python and have some experience with Pandas dataframes, but recently I have been tasked to look at something involving Spark and I'm struggling to get my ahead around it.
I suppose the best way to explain this is with a small example. Imagine I have dataframe A:
id | Name |
--------------
1 | Random |
2 | Random |
3 | Random |
As well as dataframe B:
id | Fruit |
-------------
1 | Pear |
2 | Pear |
2 | Apple |
2 | Banana |
3 | Pear |
3 | Banana |
Now what I'm trying to do is match dataframe A with B (based on id matching), and iterate through the Fruit column in dataframe B. If a value comes up (say Banana), I want to add it as a column to dataframe. Could be a simple sum (everytime banana comes up add 1 to a column), or just class it if it comes up once. So for example, an output could look like this:
id | Name | Banana
---------------------
1 | Random | 0
2 | Random | 1
3 | Random | 1
My issue is iterating through Spark dataframes, and how I can connect the two if the match does occur. I was trying to do something to this effect:
def fruit(input):
fruits = {"Banana" : "B"}
return fruits[input]
fruits = df.withColumn("Output", fruit("Fruit"))
But it's not really working. Any ideas? Apologies in advance my experience with Spark is very little.
Hope this helps!
#sample data
A = sc.parallelize([(1,"Random"), (2,"Random"), (3,"Random")]).toDF(["id", "Name"])
B = sc.parallelize([(1,"Pear"), (2,"Pear"), (2,"Apple"), (2,"Banana"), (3,"Pear"), (3,"Banana")]).toDF(["id", "Fruit"])
df_temp = A.join(B, A.id==B.id, 'inner').drop(B.id)
df = df_temp.groupby(df_temp.id, df_temp.Name).\
pivot("Fruit").\
count().\
na.fill(0)
df.show()
Output is
+---+------+-----+------+----+
| id| Name|Apple|Banana|Pear|
+---+------+-----+------+----+
| 1|Random| 0| 0| 1|
| 3|Random| 0| 1| 1|
| 2|Random| 1| 1| 1|
+---+------+-----+------+----+
Edit note: In case you are only interested in few fruits then
from pyspark.sql.functions import col
#list of fruits you are interested in
fruit_list = ["Pear", "Banana"]
df = df_temp.\
filter(col('fruit').isin(fruit_list)).\
groupby(df_temp.id, df_temp.Name).\
pivot("Fruit").\
count().\
na.fill(0)
df.show()
+---+------+------+----+
| id| Name|Banana|Pear|
+---+------+------+----+
| 1|Random| 0| 1|
| 3|Random| 1| 1|
| 2|Random| 1| 1|
+---+------+------+----+
I have the following two DataFrames:
l1 = [(['hello','world'],), (['stack','overflow'],), (['hello', 'alice'],), (['sample', 'text'],)]
df1 = spark.createDataFrame(l1)
l2 = [(['big','world'],), (['sample','overflow', 'alice', 'text', 'bob'],), (['hello', 'sample'],)]
df2 = spark.createDataFrame(l2)
df1:
["hello","world"]
["stack","overflow"]
["hello","alice"]
["sample","text"]
df2:
["big","world"]
["sample","overflow","alice","text","bob"]
["hello", "sample"]
For every row in df1, I want to calculate the number of times all the words in the array occur in df2.
For example, the first row in df1 is ["hello","world"]. Now, I want to check df2 for the intersection of ["hello","world"] with every row in df2.
| ARRAY | INTERSECTION | LEN(INTERSECTION)|
|["big","world"] |["world"] | 1 |
|["sample","overflow","alice","text","bob"] |[] | 0 |
|["hello","sample"] |["hello"] | 1 |
Now, I want to return the sum(len(interesection)). Ultimately I want the resulting df1 to look like this:
df1 result:
ARRAY INTERSECTION_TOTAL
| ["hello","world"] | 2 |
| ["stack","overflow"] | 1 |
| ["hello","alice"] | 2 |
| ["sample","text"] | 3 |
How do I solve this?
I'd focus on avoiding Cartesian product first. I'd try to explode and join
from pyspark.sql.functions import explode, monotonically_increasing_id
df1_ = (df1.toDF("words")
.withColumn("id_1", monotonically_increasing_id())
.select("*", explode("words").alias("word")))
df2_ = (df2.toDF("words")
.withColumn("id_2", monotonically_increasing_id())
.select("id_2", explode("words").alias("word")))
(df1_.join(df2_, "word").groupBy("id_1", "id_2", "words").count()
.groupBy("id_1", "words").sum("count").drop("id_1").show())
+-----------------+----------+
| words|sum(count)|
+-----------------+----------+
| [hello, alice]| 2|
| [sample, text]| 3|
|[stack, overflow]| 1|
| [hello, world]| 2|
+-----------------+----------+
If intermediate values are not needed it could be simplified to:
df1_.join(df2_, "word").groupBy("words").count().show()
+-----------------+-----+
| words|count|
+-----------------+-----+
| [hello, alice]| 2|
| [sample, text]| 3|
|[stack, overflow]| 1|
| [hello, world]| 2|
+-----------------+-----+
and you could omit adding ids.