PySpark data frame schema passing dynamically

PySpark data frame schema passing dynamically - apache-spark

I have a spark data frame like below:
>>> df2.show()
+--------------------+
| value|
+--------------------+
|[Name, Number, Sa...|
| [A, 1, 2000]|
| [B, 2, 3000]|
| [C, 3, 4000]|
+--------------------+
Now i am trying to remove the column name as Value and get the schema as below
+--------+--------+--------+
|Name |Number | Salary|
+--------+--------+--------+
| A| 1| 2000|
| B| 2| 3000|
| C| 3| 4000|
+--------+--------+--------+
The code i used as below:
length = len(df2.select('value').take(1)[0][0])
df2.select([df2.value[i] for i in range(length)]).show()
and the output i am getting as below which is not correct
+--------+--------+--------+
|value[0]|value[1]|value[2]|
+--------+--------+--------+
| Name| Number| Salary|
| A| 1| 2000|
| B| 2| 3000|
| C| 3| 4000|
+--------+--------+--------+
Have used spark version - 2.4.0 , Python Version - 3.6.12
Appreciate your support

Related

How can I achieve following spark behaviour using replaceWhere clause

I want to write data in delta tables incrementally while replacing (overwriting) partitions already present in sink. Example:
Consider this data inside my delta table already partionned by id column:
+---+---+
| id| x|
+---+---+
| 1| A|
| 2| B|
| 3| C|
+---+---+
Now, I would like to insert the following dataframe:
+---+---------+
| id| x|
+---+---------+
| 2| NEW|
| 2| NEW|
| 4| D|
| 5| E|
+---+---------+
The desired output is this
+---+---------+
| id| x|
+---+---------+
| 1| A|
| 2| NEW|
| 2| NEW|
| 3| C|
| 4| D|
| 5| E|
+---+---------+
What I did is the following:
df = spark.read.format("csv").option("sep", ";").option("header", "true").load("/mnt/blob/datafinance/bronze/simba/test/in/input.csv")
Ids=[x.id for x in df.select("id").distinct().collect()]
for Id in Ids:
df.filter(df.id==Id).write.format("delta").option("mergeSchema", "true").partitionBy("id").option("replaceWhere", "id == '$i'".format(i=Id)).mode("append").save("/mnt/blob/datafinance/bronze/simba/test/res/")
spark.read.format("delta").option("sep", ";").option("header", "true").load("/mnt/blob/datafinance/bronze/simba/test/res/").show()
And this is the result:
+---+---------+
| id| x|
+---+---------+
| 2| B|
| 1| A|
| 5| E|
| 2| NEW|
| 2|NEW AUSSI|
| 3| C|
| 4| D|
+---+---------+
As you can see it appended all value without replacing the partition id=2 which was already present in table.
I think it is because of mode("append").
But changing it to mode("overwrite") throws the following error:
Data written out does not match replaceWhere 'id == '$i''.
Can anyone tell me how to achieve what I want please ?
Thank you.

I actually had an error in the code. I replaced
.option("replaceWhere", "id == '$i'".format(i=idd))
with
.option("replaceWhere", "id == '{i}'".format(i=idd))
and it worked.
Thanks to #ggordon who noticed me about the error on another question.

How to concatenate data frame column pyspark?

I have created data frame using below code:
df = spark.createDataFrame([("A", "20"), ("B", "30"), ("D", "80"),("A", "120"),("c", "20"),("Null", "20")],["Let", "Num"])
df.show()
+----+---+
| Let|Num|
+----+---+
| A| 20|
| B| 30|
| D| 80|
| A|120|
| c| 20|
|Null| 20|
+----+---+
I want create data frame like below:
+----+-------+
| Let|Num |
+----+-------+
| A| 20,120|
| B| 30 |
| D| 80 |
| c| 20 |
|Null| 20 |
+----+-------+
how to achieve this?

You can groupBy Let and collect as list with collect_list
from pyspark.sql import functions as F
df.groupBy("Let").agg(F.collect_list("Num")).show()
Output as List:
+----+-----------------+
| Let|collect_list(Num)|
+----+-----------------+
| B| [30]|
| D| [80]|
| A| [20, 120]|
| c| [20]|
|Null| [20]|
+----+-----------------+
df.groupBy("Let").agg(F.concat_ws(",", F.collect_list("Num"))).show()
Output as String
+----+-------------------------------+
| Let|concat_ws(,, collect_list(Num))|
+----+-------------------------------+
| B| 30|
| D| 80|
| A| 20,120|
| c| 20|
|Null| 20|
+----+-------------------------------+

pyspark : fetch common data from dataframe when comparing values of given columns

I have two pyspark dataframes like this.
data_frame A
+----+---+
|name1| id1|
+----+---+
| a| 3|
| b| 5|
| c| 7|
+----+---+
data_frame B
+----+---+
|name2| id2|
+----+---+
| a| 13|
| b| 15|
| c| 17|
| d| 6|
| e| 0|
| f| 3|
+----+---+
I want to fetch dataframe B contents if values of name1 (from df a) and name2 (from df b) matches. which is as shown below.
o/p dataframe
+----+---+
|name2| id2|
+----+---+
| a| 13|
| b| 15|
| c| 17|
+----+---+
I want to avoid computationally expensive methods such as collect() etc.
How this can be done in apache spark?

from pyspark.sql.functions import *
df1.join(df2, df1.name1 == df2.name2).select('df2.*')
OR (using sql)
df1.registerTempTable("tableA")
df2.registerTempTable("tableB")
val result = sqlContext.sql("select b.name2, b.id2 from tableA a join tableB b on a.name1=b.name2")
result.show()
+----+----+
|name2| id2|
+----+----+
| a| 13|
| b| 15|
| c| 17|
+----+---+

Divide aggregate value using values from data frame in PySpark

I have a data frame like below in pyspark.
+---+-------------+----+
| id| device| val|
+---+-------------+----+
| 3| mac pro| 1|
| 1| iphone| 2|
| 1|android phone| 2|
| 1| windows pc| 2|
| 1| spy camera| 2|
| 2| spy camera| 3|
| 2| iphone| 3|
| 3| spy camera| 1|
| 3| cctv| 1|
+---+-------------+----+
I want to populate some columns based on the below lists
phone_list = ['iphone', 'android phone', 'nokia']
pc_list = ['windows pc', 'mac pro']
security_list = ['spy camera', 'cctv']
I have done like below.
import pyspark.sql.functions as F
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')).show()
I got the desired result.
Now I want to do some change to the code I want to populate the column value after I divide the cat column with the value in the data frame for that id.
I tried something like below but didn't get the correct result
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')/ df.val).show()
How can I get what I want?
edit
Expected result
+---+----+------+--------+
| id| pc|phones|security|
+---+----+------+--------+
| 1| 0.5| 1| 0.5|
| 3| 1| null| 2|
| 2|null| 0.33| 0.33|
+---+----+------+--------+

Aggregation would need an aggregation function, a simple column would not be identified
Since val column contains same value for each group of id column, you can use first inbuilt function as
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')/ F.first(df.val)).show()
which should give you
+---+----+------------------+------------------+
| id| pc| phones| security|
+---+----+------------------+------------------+
| 3| 1.0| null| 2.0|
| 1| 0.5| 1.0| 0.5|
| 2|null|0.3333333333333333|0.3333333333333333|
+---+----+------------------+------------------+

Simplify code and reduce join statements in pyspark data frames

I have a data frame in pyspark like below.
df.show()
+---+-------------+
| id| device|
+---+-------------+
| 3| mac pro|
| 1| iphone|
| 1|android phone|
| 1| windows pc|
| 1| spy camera|
| 2| spy camera|
| 2| iphone|
| 3| spy camera|
| 3| cctv|
+---+-------------+
phone_list = ['iphone', 'android phone', 'nokia']
pc_list = ['windows pc', 'mac pro']
security_list = ['spy camera', 'cctv']
from pyspark.sql.functions import col
phones_df = df.filter(col('device').isin(phone_list)).groupBy("id").count().selectExpr("id as id", "count as phones")
phones_df.show()
+---+------+
| id|phones|
+---+------+
| 1| 2|
| 2| 1|
+---+------+
pc_df = df.filter(col('device').isin(pc_list)).groupBy("id").count().selectExpr("id as id", "count as pc")
pc_df.show()
+---+---+
| id| pc|
+---+---+
| 1| 1|
| 3| 1|
+---+---+
security_df = df.filter(col('device').isin(security_list)).groupBy("id").count().selectExpr("id as id", "count as security")
security_df.show()
+---+--------+
| id|security|
+---+--------+
| 1| 1|
| 2| 1|
| 3| 2|
+---+--------+
Then I want to do a full outer join on all the three data frames. I have done like below.
full_df = phones_df.join(pc_df, phones_df.id == pc_df.id, 'full_outer').select(f.coalesce(phones_df.id, pc_df.id).alias('id'), phones_df.phones, pc_df.pc)
final_df = full_df.join(security_df, full_df.id == security_df.id, 'full_outer').select(f.coalesce(full_df.id, security_df.id).alias('id'), full_df.phones, full_df.pc, security_df.security)
Final_df.show()
+---+------+----+--------+
| id|phones| pc|security|
+---+------+----+--------+
| 1| 2| 1| 1|
| 2| 1|null| 1|
| 3| null| 1| 2|
+---+------+----+--------+
I am able to get what I want but want to simplify my code.
1) I want to create phones_df, pc_df, security_df in a better way because I am using the same code while creating these data frames I want to reduce this.
2) I want to simplify the join statements to one statement
How can I do this? Could anyone explain.

Here is one way using when.otherwise to map column to categories, and then pivot it to the desired output:
import pyspark.sql.functions as F
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')).show()
+---+----+------+--------+
| id| pc|phones|security|
+---+----+------+--------+
| 1| 1| 2| 1|
| 3| 1| null| 2|
| 2|null| 1| 1|
+---+----+------+--------+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

PySpark data frame schema passing dynamically - apache-spark

Related

How can I achieve following spark behaviour using replaceWhere clause

How to concatenate data frame column pyspark?

pyspark : fetch common data from dataframe when comparing values of given columns

Divide aggregate value using values from data frame in PySpark

Simplify code and reduce join statements in pyspark data frames

Categories

Resources