How to rename an existing Spark SQL function - apache-spark

I am using Spark to call functions on the data which is submitted by the user.
How can I rename an already existing function to a different name like like REGEXP_REPLACE to REPLACE?
I tried the following code :
ss.udf.register("REPLACE", REGEXP_REPLACE) // This doesn't work
ss.udf.register("sum_in_all", sumInAll)
ss.udf.register("mod", mod)
ss.udf.register("average_in_all", averageInAll)

Import it with an alias :
import org.apache.spark.sql.functions.{regexp_replace => replace }
df.show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+---+
df.withColumn("replaced", replace($"id", "(\\d)" , "$1+1") ).show
+---+--------+
| id|replaced|
+---+--------+
| 0| 0+1|
| 1| 1+1|
| 2| 2+1|
| 3| 3+1|
| 4| 4+1|
| 5| 5+1|
| 6| 6+1|
| 7| 7+1|
| 8| 8+1|
| 9| 9+1|
+---+--------+
To do it with Spark SQL, you'll have to re-register the function in Hive with a different name :
sqlContext.sql(" create temporary function replace
as 'org.apache.hadoop.hive.ql.udf.UDFRegExpReplace' ")
sqlContext.sql(""" select replace("a,b,c", "," ,".") """).show
+-----+
| _c0|
+-----+
|a.b.c|
+-----+

Related

How can I achieve following spark behaviour using replaceWhere clause

I want to write data in delta tables incrementally while replacing (overwriting) partitions already present in sink. Example:
Consider this data inside my delta table already partionned by id column:
+---+---+
| id| x|
+---+---+
| 1| A|
| 2| B|
| 3| C|
+---+---+
Now, I would like to insert the following dataframe:
+---+---------+
| id| x|
+---+---------+
| 2| NEW|
| 2| NEW|
| 4| D|
| 5| E|
+---+---------+
The desired output is this
+---+---------+
| id| x|
+---+---------+
| 1| A|
| 2| NEW|
| 2| NEW|
| 3| C|
| 4| D|
| 5| E|
+---+---------+
What I did is the following:
df = spark.read.format("csv").option("sep", ";").option("header", "true").load("/mnt/blob/datafinance/bronze/simba/test/in/input.csv")
Ids=[x.id for x in df.select("id").distinct().collect()]
for Id in Ids:
df.filter(df.id==Id).write.format("delta").option("mergeSchema", "true").partitionBy("id").option("replaceWhere", "id == '$i'".format(i=Id)).mode("append").save("/mnt/blob/datafinance/bronze/simba/test/res/")
spark.read.format("delta").option("sep", ";").option("header", "true").load("/mnt/blob/datafinance/bronze/simba/test/res/").show()
And this is the result:
+---+---------+
| id| x|
+---+---------+
| 2| B|
| 1| A|
| 5| E|
| 2| NEW|
| 2|NEW AUSSI|
| 3| C|
| 4| D|
+---+---------+
As you can see it appended all value without replacing the partition id=2 which was already present in table.
I think it is because of mode("append").
But changing it to mode("overwrite") throws the following error:
Data written out does not match replaceWhere 'id == '$i''.
Can anyone tell me how to achieve what I want please ?
Thank you.
I actually had an error in the code. I replaced
.option("replaceWhere", "id == '$i'".format(i=idd))
with
.option("replaceWhere", "id == '{i}'".format(i=idd))
and it worked.
Thanks to #ggordon who noticed me about the error on another question.

Generate ID in spark as per the below logic in Spark Scala

I have a dataframe having parent_id,service_id,product_relation_id,product_name field as given below, I want to assign id field as shown in the table below,
Please note that
one parent_id has many service_id
one service_id has many product_name
ID generation should follow the below pattern
Parent -- 1.n
Child 1 -- 1.n.1
Child 2 -- 1.n.2
Child 3 -- 1.n.3
Child 4 -- 1.n.4
How do we implement this logic in a manner that considering performance as well on Big Data ?
Scala Implementation
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val parentWindowSpec = Window.orderBy("parent_id")
val childWindowSpec = Window.partitionBy(
"parent_version", "service_id"
).orderBy("product_relation_id")
val df = spark.read.options(
Map("inferSchema"->"true","delimiter"->",","header"->"true")
).csv("product.csv")
val df2 = df.withColumn(
"parent_version", dense_rank.over(parentWindowSpec)
).withColumn(
"child_version",row_number.over(childWindowSpec) - 1)
val df3 = df2.withColumn("id",
when(col("product_name") === lit("Parent"),
concat(lit("1."), col("parent_version")))
.otherwise(concat(lit("1."), col("parent_version"),lit("."),col("child_version")))
).drop("parent_version").drop("child_version")
Output:
scala> df3.show
21/03/26 11:55:01 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+---------+----------+-------------------+------------+-----+
|parent_id|service_id|product_relation_id|product_name| id|
+---------+----------+-------------------+------------+-----+
| 100| 1| 1-A| Parent| 1.1|
| 100| 1| 1-A| Child1|1.1.1|
| 100| 1| 1-A| Child2|1.1.2|
| 100| 1| 1-A| Child3|1.1.3|
| 100| 1| 1-A| Child4|1.1.4|
| 100| 2| 1-B| Parent| 1.1|
| 100| 2| 1-B| Child1|1.1.1|
| 100| 2| 1-B| Child2|1.1.2|
| 100| 2| 1-B| Child3|1.1.3|
| 100| 2| 1-B| Child4|1.1.4|
| 100| 3| 1-C| Parent| 1.1|
| 100| 3| 1-C| Child1|1.1.1|
| 100| 3| 1-C| Child2|1.1.2|
| 100| 3| 1-C| Child3|1.1.3|
| 100| 3| 1-C| Child4|1.1.4|
| 200| 5| 1-D| Parent| 1.2|
| 200| 5| 1-D| Child1|1.2.1|
| 200| 5| 1-D| Child2|1.2.2|
| 200| 5| 1-D| Child3|1.2.3|
| 200| 5| 1-D| Child4|1.2.4|
+---------+----------+-------------------+------------+-----+
only showing top 20 rows

Show full results for Spark streaming batch using console output format

For a spark structured streaming read process :
sdf.writeStream
.outputMode(outputMode)
.format("console")
.trigger(Trigger.ProcessingTime("2 seconds"))
.start())
The format(console) is correctly writing its output as shown:
Batch: 3
+----------+------+-------+-----------------+
|OnTimeRank|Origin|Carrier| OnTimePct|
+----------+------+-------+-----------------+
| 1| BWI| EV| 90.0|
| 2| BWI| US|88.54072251715655|
| 3| BWI| CO|88.52097130242826|
| 4| BWI| YV| 87.2168284789644|
| 5| BWI| DL|86.21888471700737|
| 6| BWI| NW|86.04866030181707|
| 7| BWI| 9E|85.83545377438507|
| 8| BWI| AA|85.71428571428571|
| 9| BWI| FL|83.25366684127816|
| 10| BWI| UA|81.32427843803056|
| 1| CMI| MQ|81.92159607980399|
| 1| IAH| NW| 91.6242895602752|
| 2| IAH| F9|88.62350722815839|
| 3| IAH| US|87.54764930114358|
| 4| IAH| 9E|84.33613445378151|
| 5| IAH| OO| 84.2836946277097|
| 6| IAH| DL|83.46420323325636|
| 7| IAH| UA|83.40671436433682|
| 8| IAH| XE|81.35189010909355|
| 9| IAH| OH|80.61558611656844|
+----------+------+-------+-----------------+
However this is only a portion of the results. Is there an equivalent to the dataframe.show(NumRows, truncate) via an option setting - along the lines of .option("maxRows",1000) :
sdf.writeStream
.outputMode(outputMode)
.format("console")
.option("maxRows",1000) // This is what I want but not sure how to do
.trigger(Trigger.ProcessingTime("2 seconds"))
.start())
The option is called numRows e.g. .option("numRows",1000)
Source https://github.com/apache/spark/blob/2a80a4cd39c7bcee44b6f6432769ca9fdba137e4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ConsoleWrite.scala#L33

Simplify code and reduce join statements in pyspark data frames

I have a data frame in pyspark like below.
df.show()
+---+-------------+
| id| device|
+---+-------------+
| 3| mac pro|
| 1| iphone|
| 1|android phone|
| 1| windows pc|
| 1| spy camera|
| 2| spy camera|
| 2| iphone|
| 3| spy camera|
| 3| cctv|
+---+-------------+
phone_list = ['iphone', 'android phone', 'nokia']
pc_list = ['windows pc', 'mac pro']
security_list = ['spy camera', 'cctv']
from pyspark.sql.functions import col
phones_df = df.filter(col('device').isin(phone_list)).groupBy("id").count().selectExpr("id as id", "count as phones")
phones_df.show()
+---+------+
| id|phones|
+---+------+
| 1| 2|
| 2| 1|
+---+------+
pc_df = df.filter(col('device').isin(pc_list)).groupBy("id").count().selectExpr("id as id", "count as pc")
pc_df.show()
+---+---+
| id| pc|
+---+---+
| 1| 1|
| 3| 1|
+---+---+
security_df = df.filter(col('device').isin(security_list)).groupBy("id").count().selectExpr("id as id", "count as security")
security_df.show()
+---+--------+
| id|security|
+---+--------+
| 1| 1|
| 2| 1|
| 3| 2|
+---+--------+
Then I want to do a full outer join on all the three data frames. I have done like below.
full_df = phones_df.join(pc_df, phones_df.id == pc_df.id, 'full_outer').select(f.coalesce(phones_df.id, pc_df.id).alias('id'), phones_df.phones, pc_df.pc)
final_df = full_df.join(security_df, full_df.id == security_df.id, 'full_outer').select(f.coalesce(full_df.id, security_df.id).alias('id'), full_df.phones, full_df.pc, security_df.security)
Final_df.show()
+---+------+----+--------+
| id|phones| pc|security|
+---+------+----+--------+
| 1| 2| 1| 1|
| 2| 1|null| 1|
| 3| null| 1| 2|
+---+------+----+--------+
I am able to get what I want but want to simplify my code.
1) I want to create phones_df, pc_df, security_df in a better way because I am using the same code while creating these data frames I want to reduce this.
2) I want to simplify the join statements to one statement
How can I do this? Could anyone explain.
Here is one way using when.otherwise to map column to categories, and then pivot it to the desired output:
import pyspark.sql.functions as F
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')).show()
+---+----+------+--------+
| id| pc|phones|security|
+---+----+------+--------+
| 1| 1| 2| 1|
| 3| 1| null| 2|
| 2|null| 1| 1|
+---+----+------+--------+

spark SQL to perform simple arithmetic with constant

I'm trying do arithmetic operation with two operands: constant literal and Column. Is there an approach other than withColumn?
let df be a dataframe:
+---+
| i|
+---+
| 1|
| 2|
| 3|
+---+
then you can use select to add the results:
import org.apache spark.sql.functions.lit
df
.select($"i",($"i" + lit(1)).as("j"))
.show
+---+---+
| i| j|
+---+---+
| 1| 2|
| 2| 3|
| 3| 4|
+---+---+

Resources