How to rename an existing Spark SQL function

How to rename an existing Spark SQL function - apache-spark

I am using Spark to call functions on the data which is submitted by the user.
How can I rename an already existing function to a different name like like REGEXP_REPLACE to REPLACE?
I tried the following code :
ss.udf.register("REPLACE", REGEXP_REPLACE) // This doesn't work
ss.udf.register("sum_in_all", sumInAll)
ss.udf.register("mod", mod)
ss.udf.register("average_in_all", averageInAll)

Import it with an alias :
import org.apache.spark.sql.functions.{regexp_replace => replace }
df.show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+---+
df.withColumn("replaced", replace($"id", "(\\d)" , "$1+1") ).show
+---+--------+
| id|replaced|
+---+--------+
| 0| 0+1|
| 1| 1+1|
| 2| 2+1|
| 3| 3+1|
| 4| 4+1|
| 5| 5+1|
| 6| 6+1|
| 7| 7+1|
| 8| 8+1|
| 9| 9+1|
+---+--------+
To do it with Spark SQL, you'll have to re-register the function in Hive with a different name :
sqlContext.sql(" create temporary function replace
as 'org.apache.hadoop.hive.ql.udf.UDFRegExpReplace' ")
sqlContext.sql(""" select replace("a,b,c", "," ,".") """).show
+-----+
| _c0|
+-----+
|a.b.c|
+-----+

Related

How can I achieve following spark behaviour using replaceWhere clause

I want to write data in delta tables incrementally while replacing (overwriting) partitions already present in sink. Example:
Consider this data inside my delta table already partionned by id column:
+---+---+
| id| x|
+---+---+
| 1| A|
| 2| B|
| 3| C|
+---+---+
Now, I would like to insert the following dataframe:
+---+---------+
| id| x|
+---+---------+
| 2| NEW|
| 2| NEW|
| 4| D|
| 5| E|
+---+---------+
The desired output is this
+---+---------+
| id| x|
+---+---------+
| 1| A|
| 2| NEW|
| 2| NEW|
| 3| C|
| 4| D|
| 5| E|
+---+---------+
What I did is the following:
df = spark.read.format("csv").option("sep", ";").option("header", "true").load("/mnt/blob/datafinance/bronze/simba/test/in/input.csv")
Ids=[x.id for x in df.select("id").distinct().collect()]
for Id in Ids:
df.filter(df.id==Id).write.format("delta").option("mergeSchema", "true").partitionBy("id").option("replaceWhere", "id == '$i'".format(i=Id)).mode("append").save("/mnt/blob/datafinance/bronze/simba/test/res/")
spark.read.format("delta").option("sep", ";").option("header", "true").load("/mnt/blob/datafinance/bronze/simba/test/res/").show()
And this is the result:
+---+---------+
| id| x|
+---+---------+
| 2| B|
| 1| A|
| 5| E|
| 2| NEW|
| 2|NEW AUSSI|
| 3| C|
| 4| D|
+---+---------+
As you can see it appended all value without replacing the partition id=2 which was already present in table.
I think it is because of mode("append").
But changing it to mode("overwrite") throws the following error:
Data written out does not match replaceWhere 'id == '$i''.
Can anyone tell me how to achieve what I want please ?
Thank you.

I actually had an error in the code. I replaced
.option("replaceWhere", "id == '$i'".format(i=idd))
with
.option("replaceWhere", "id == '{i}'".format(i=idd))
and it worked.
Thanks to #ggordon who noticed me about the error on another question.

Generate ID in spark as per the below logic in Spark Scala

I have a dataframe having parent_id,service_id,product_relation_id,product_name field as given below, I want to assign id field as shown in the table below,
Please note that
one parent_id has many service_id
one service_id has many product_name
ID generation should follow the below pattern
Parent -- 1.n
Child 1 -- 1.n.1
Child 2 -- 1.n.2
Child 3 -- 1.n.3
Child 4 -- 1.n.4
How do we implement this logic in a manner that considering performance as well on Big Data ?

Scala Implementation
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val parentWindowSpec = Window.orderBy("parent_id")
val childWindowSpec = Window.partitionBy(
"parent_version", "service_id"
).orderBy("product_relation_id")
val df = spark.read.options(
Map("inferSchema"->"true","delimiter"->",","header"->"true")
).csv("product.csv")
val df2 = df.withColumn(
"parent_version", dense_rank.over(parentWindowSpec)
).withColumn(
"child_version",row_number.over(childWindowSpec) - 1)
val df3 = df2.withColumn("id",
when(col("product_name") === lit("Parent"),
concat(lit("1."), col("parent_version")))
.otherwise(concat(lit("1."), col("parent_version"),lit("."),col("child_version")))
).drop("parent_version").drop("child_version")
Output:
scala> df3.show
21/03/26 11:55:01 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+---------+----------+-------------------+------------+-----+
|parent_id|service_id|product_relation_id|product_name| id|
+---------+----------+-------------------+------------+-----+
| 100| 1| 1-A| Parent| 1.1|
| 100| 1| 1-A| Child1|1.1.1|
| 100| 1| 1-A| Child2|1.1.2|
| 100| 1| 1-A| Child3|1.1.3|
| 100| 1| 1-A| Child4|1.1.4|
| 100| 2| 1-B| Parent| 1.1|
| 100| 2| 1-B| Child1|1.1.1|
| 100| 2| 1-B| Child2|1.1.2|
| 100| 2| 1-B| Child3|1.1.3|
| 100| 2| 1-B| Child4|1.1.4|
| 100| 3| 1-C| Parent| 1.1|
| 100| 3| 1-C| Child1|1.1.1|
| 100| 3| 1-C| Child2|1.1.2|
| 100| 3| 1-C| Child3|1.1.3|
| 100| 3| 1-C| Child4|1.1.4|
| 200| 5| 1-D| Parent| 1.2|
| 200| 5| 1-D| Child1|1.2.1|
| 200| 5| 1-D| Child2|1.2.2|
| 200| 5| 1-D| Child3|1.2.3|
| 200| 5| 1-D| Child4|1.2.4|
+---------+----------+-------------------+------------+-----+
only showing top 20 rows

Show full results for Spark streaming batch using console output format

For a spark structured streaming read process :
sdf.writeStream
.outputMode(outputMode)
.format("console")
.trigger(Trigger.ProcessingTime("2 seconds"))
.start())
The format(console) is correctly writing its output as shown:
Batch: 3
+----------+------+-------+-----------------+
|OnTimeRank|Origin|Carrier| OnTimePct|
+----------+------+-------+-----------------+
| 1| BWI| EV| 90.0|
| 2| BWI| US|88.54072251715655|
| 3| BWI| CO|88.52097130242826|
| 4| BWI| YV| 87.2168284789644|
| 5| BWI| DL|86.21888471700737|
| 6| BWI| NW|86.04866030181707|
| 7| BWI| 9E|85.83545377438507|
| 8| BWI| AA|85.71428571428571|
| 9| BWI| FL|83.25366684127816|
| 10| BWI| UA|81.32427843803056|
| 1| CMI| MQ|81.92159607980399|
| 1| IAH| NW| 91.6242895602752|
| 2| IAH| F9|88.62350722815839|
| 3| IAH| US|87.54764930114358|
| 4| IAH| 9E|84.33613445378151|
| 5| IAH| OO| 84.2836946277097|
| 6| IAH| DL|83.46420323325636|
| 7| IAH| UA|83.40671436433682|
| 8| IAH| XE|81.35189010909355|
| 9| IAH| OH|80.61558611656844|
+----------+------+-------+-----------------+
However this is only a portion of the results. Is there an equivalent to the dataframe.show(NumRows, truncate) via an option setting - along the lines of .option("maxRows",1000) :
sdf.writeStream
.outputMode(outputMode)
.format("console")
.option("maxRows",1000) // This is what I want but not sure how to do
.trigger(Trigger.ProcessingTime("2 seconds"))
.start())

The option is called numRows e.g. .option("numRows",1000)
Source https://github.com/apache/spark/blob/2a80a4cd39c7bcee44b6f6432769ca9fdba137e4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ConsoleWrite.scala#L33

Simplify code and reduce join statements in pyspark data frames

I have a data frame in pyspark like below.
df.show()
+---+-------------+
| id| device|
+---+-------------+
| 3| mac pro|
| 1| iphone|
| 1|android phone|
| 1| windows pc|
| 1| spy camera|
| 2| spy camera|
| 2| iphone|
| 3| spy camera|
| 3| cctv|
+---+-------------+
phone_list = ['iphone', 'android phone', 'nokia']
pc_list = ['windows pc', 'mac pro']
security_list = ['spy camera', 'cctv']
from pyspark.sql.functions import col
phones_df = df.filter(col('device').isin(phone_list)).groupBy("id").count().selectExpr("id as id", "count as phones")
phones_df.show()
+---+------+
| id|phones|
+---+------+
| 1| 2|
| 2| 1|
+---+------+
pc_df = df.filter(col('device').isin(pc_list)).groupBy("id").count().selectExpr("id as id", "count as pc")
pc_df.show()
+---+---+
| id| pc|
+---+---+
| 1| 1|
| 3| 1|
+---+---+
security_df = df.filter(col('device').isin(security_list)).groupBy("id").count().selectExpr("id as id", "count as security")
security_df.show()
+---+--------+
| id|security|
+---+--------+
| 1| 1|
| 2| 1|
| 3| 2|
+---+--------+
Then I want to do a full outer join on all the three data frames. I have done like below.
full_df = phones_df.join(pc_df, phones_df.id == pc_df.id, 'full_outer').select(f.coalesce(phones_df.id, pc_df.id).alias('id'), phones_df.phones, pc_df.pc)
final_df = full_df.join(security_df, full_df.id == security_df.id, 'full_outer').select(f.coalesce(full_df.id, security_df.id).alias('id'), full_df.phones, full_df.pc, security_df.security)
Final_df.show()
+---+------+----+--------+
| id|phones| pc|security|
+---+------+----+--------+
| 1| 2| 1| 1|
| 2| 1|null| 1|
| 3| null| 1| 2|
+---+------+----+--------+
I am able to get what I want but want to simplify my code.
1) I want to create phones_df, pc_df, security_df in a better way because I am using the same code while creating these data frames I want to reduce this.
2) I want to simplify the join statements to one statement
How can I do this? Could anyone explain.

Here is one way using when.otherwise to map column to categories, and then pivot it to the desired output:
import pyspark.sql.functions as F
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')).show()
+---+----+------+--------+
| id| pc|phones|security|
+---+----+------+--------+
| 1| 1| 2| 1|
| 3| 1| null| 2|
| 2|null| 1| 1|
+---+----+------+--------+

spark SQL to perform simple arithmetic with constant

I'm trying do arithmetic operation with two operands: constant literal and Column. Is there an approach other than withColumn?

let df be a dataframe:
+---+
| i|
+---+
| 1|
| 2|
| 3|
+---+
then you can use select to add the results:
import org.apache spark.sql.functions.lit
df
.select($"i",($"i" + lit(1)).as("j"))
.show
+---+---+
| i| j|
+---+---+
| 1| 2|
| 2| 3|
| 3| 4|
+---+---+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to rename an existing Spark SQL function - apache-spark

Related

How can I achieve following spark behaviour using replaceWhere clause

Generate ID in spark as per the below logic in Spark Scala

Show full results for Spark streaming batch using console output format

Simplify code and reduce join statements in pyspark data frames

spark SQL to perform simple arithmetic with constant

Categories

Resources