Optimize spark dataframe operation - apache-spark

I have a spark(version-2.4) dataframe of the pattern.
+----------+
| ColumnA |
+----------+
| 1000#Cat |
| 1001#Dog |
| 1000#Cat |
| 1001#Dog |
| 1001#Dog |
+----------+
I am conditionally applying a regex removal of the number that is appended to the string using the following code
dataset.withColumn("ColumnA",when(regexp_extract(dataset.col("ColumnA"), "\\#(.*)", 1)
.equalTo(""), dataset.col("ColumnA"))
.otherwise(regexp_extract(dataset.col("ColumnA"), "\\#(.*)", 1)));
which would result a dataframe in the following format
+---------+
| ColumnA |
+---------+
| Cat |
| Dog |
| Cat |
| Dog |
| Dog |
+---------+
This runs correctly and produces the desired output.
However the regexp_extract operation is being applied twice, once to check if the returned string is empty and if not then reapply the regexp_extract on the column.
Is there any optimization that can be done on this code to make it perform better.?

Use split function instead of regexp_extract.
Please check below code with execution time
scala> df.show(false)
+--------+
|columna |
+--------+
|1000#Cat|
|1001#Dog|
|1000#Cat|
|1001#Dog|
|1001#Dog|
+--------+
scala> spark.time(df.withColumn("parsed",split($"columna","#")(1)).show(false))
+--------+------+
|columna |parsed|
+--------+------+
|1000#Cat|Cat |
|1001#Dog|Dog |
|1000#Cat|Cat |
|1001#Dog|Dog |
|1001#Dog|Dog |
+--------+------+
Time taken: 14 ms
scala> spark.time { df.withColumn("ColumnA",when(regexp_extract($"columna", "\\#(.*)", 1).equalTo(""), $"columna").otherwise(regexp_extract($"columna", "\\#(.*)", 1))).show(false) }
+-------+
|ColumnA|
+-------+
|Cat |
|Dog |
|Cat |
|Dog |
|Dog |
+-------+
Time taken: 22 ms
scala>
contains function to check # value in column
scala> spark.time(df.withColumn("parsed",when($"columna".contains("#"), lit(split($"columna","#")(1))).otherwise("")).show(false))
+--------+------+
|columna |parsed|
+--------+------+
|1000#Cat|Cat |
|1001#Dog|Dog |
|1000#Cat|Cat |
|1001#Dog|Dog |
|1001#Dog|Dog |
+--------+------+
Time taken: 14 ms

Related

Error while querying hive table with map datatype in Spark SQL. But working while executing in HiveQL

I have hive table with below structure
+---------------+--------------+----------------------+
| column_value | metric_name | key |
+---------------+--------------+----------------------+
| A37B | Mean | {0:"202006",1:"1"} |
| ACCOUNT_ID | Mean | {0:"202006",1:"2"} |
| ANB_200 | Mean | {0:"202006",1:"3"} |
| ANB_201 | Mean | {0:"202006",1:"4"} |
| AS82_RE | Mean | {0:"202006",1:"5"} |
| ATTR001 | Mean | {0:"202007",1:"2"} |
| ATTR001_RE | Mean | {0:"202007",1:"3"} |
| ATTR002 | Mean | {0:"202007",1:"4"} |
| ATTR002_RE | Mean | {0:"202007",1:"5"} |
| ATTR003 | Mean | {0:"202008",1:"3"} |
| ATTR004 | Mean | {0:"202008",1:"4"} |
| ATTR005 | Mean | {0:"202008",1:"5"} |
| ATTR006 | Mean | {0:"202009",1:"4"} |
| ATTR006 | Mean | {0:"202009",1:"5"} |
I need to write a spark sql query to filter based on Key column with NOT IN condition with commination of both keys.
The following query works fine in HiveQL in Beeline
select * from your_data where key[0] between '202006' and '202009' and key NOT IN ( map(0,"202009",1,"5") );
But when i try the same query in Spark SQL. I am getting error
cannot resolve due to data type mismatch: map<int,string>
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:115)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:107)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:277)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
Please help!
I got the answer from different question which i raised before. This query is working fine
select * from your_data where key[0] between 202006 and 202009 and NOT (key[0]="202009" and key[1]="5" );

Perform NGram on Spark DataFrame

I'm using Spark 2.3.1, I have Spark DataFrame like this
+----------+
| values|
+----------+
|embodiment|
| present|
| invention|
| include|
| pairing|
| two|
| wireless|
| device|
| placing|
| least|
| one|
| two|
+----------+
I want to perform a Spark ml n-Gram feature like this.
bigram = NGram(n=2, inputCol="values", outputCol="bigrams")
bigramDataFrame = bigram.transform(tokenized_df)
Following Error occurred on this line bigramDataFrame = bigram.transform(tokenized_df)
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Input type must be ArrayType(StringType) but got StringType.'
So I changed my code
df_new = tokenized_df.withColumn("testing", array(tokenized_df["values"]))
bigram = NGram(n=2, inputCol="values", outputCol="bigrams")
bigramDataFrame = bigram.transform(df_new)
bigramDataFrame.show()
So I got my final Data Frame as Follow
+----------+------------+-------+
| values| testing|bigrams|
+----------+------------+-------+
|embodiment|[embodiment]| []|
| present| [present]| []|
| invention| [invention]| []|
| include| [include]| []|
| pairing| [pairing]| []|
| two| [two]| []|
| wireless| [wireless]| []|
| device| [device]| []|
| placing| [placing]| []|
| least| [least]| []|
| one| [one]| []|
| two| [two]| []|
+----------+------------+-------+
Why my bigram column value is empty.
I want my output for bigram column as follow
+----------+
| bigrams|
+--------------------+
|embodiment present |
|present invention |
|invention include |
|include pairing |
|pairing two |
|two wireless |
|wireless device |
|device placing |
|placing least |
|least one |
|one two |
+--------------------+
Your bi-gram column value is empty because there are no bi-grams in each row of your 'values' column.
If your values in input data frame look like:
+--------------------------------------------+
|values |
+--------------------------------------------+
|embodiment present invention include pairing|
|two wireless device placing |
|least one two |
+--------------------------------------------+
Then you can get the output in bi-grams as below:
+--------------------------------------------+--------------------------------------------------+---------------------------------------------------------------------------+
|values |testing |ngrams |
+--------------------------------------------+--------------------------------------------------+---------------------------------------------------------------------------+
|embodiment present invention include pairing|[embodiment, present, invention, include, pairing]|[embodiment present, present invention, invention include, include pairing]|
|two wireless device placing |[two, wireless, device, placing] |[two wireless, wireless device, device placing] |
|least one two |[least, one, two] |[least one, one two] |
+--------------------------------------------+--------------------------------------------------+---------------------------------------------------------------------------+
The scala spark code to do this is:
val df_new = df.withColumn("testing", split(df("values")," "))
val ngram = new NGram().setN(2).setInputCol("testing").setOutputCol("ngrams")
val ngramDataFrame = ngram.transform(df_new)
A bi-gram is a sequence of two adjacent elements from a string of
tokens, which are typically letters, syllables, or words.
But in your input data frame, you have only one token in each row, hence you are not getting any bi-grams out of it.
So, for your question, you can do something like this:
Input: df1
+----------+
|values |
+----------+
|embodiment|
|present |
|invention |
|include |
|pairing |
|two |
|wireless |
|devic |
|placing |
|least |
|one |
|two |
+----------+
Output: ngramDataFrameInRows
+------------------+
|ngrams |
+------------------+
|embodiment present|
|present invention |
|invention include |
|include pairing |
|pairing two |
|two wireless |
|wireless devic |
|devic placing |
|placing least |
|least one |
|one two |
+------------------+
Spark scala code:
val df_new=df1.agg(collect_list("values").alias("testing"))
val ngram = new NGram().setN(2).setInputCol("testing").setOutputCol("ngrams")
val ngramDataFrame = ngram.transform(df_new)
val ngramDataFrameInRows=ngramDataFrame.select(explode(col("ngrams")).alias("ngrams"))

Sorting a dataset with concatenated string by splitting as well as a single string for column in spark

I have a column which looks as below. It has data with : concatenated and some may be null and some a string as N/A
Sample data:
+--------+
| column |
+--------+
| 11:88 |
+--------+
| 1:45 |
+--------+
| 17:456 |
+--------+
| 2:45 |
+--------+
| 11:9 |
+--------+
| N/A |
+--------+
I want to order it as ascending based on the first element before : and null should be first and N/A and then followed by the numerical order of the first element as below
Required data after ascending order :
+--------+
| column |
+--------+
| N/A |
+--------+
| 1:45 |
+--------+
| 2:45 |
+--------+
| 11:88 |
+--------+
| 11:9 |
+--------+
| 17:456 |
+--------+
Similarly as below when in descending order
+--------+
| column |
+--------+
| 17:456 |
+--------+
| 11:9 |
+--------+
| 11:88 |
+--------+
| 2:45 |
+--------+
| 1:45 |
+--------+
| N/A |
+--------+
scala> val df3 = Seq("11:88","1:45","17:456","2:45","11:9","N/A").toDF("column")
scala> df3.show
+------+
|column|
+------+
| 11:88|
| 1:45|
|17:456|
| 2:45|
| 11:9|
| N/A|
+------+
scala> df3.withColumn("NEW",regexp_replace(col("column"),":",".").cast("Double")).orderBy(col("NEW").desc).drop("NEW").show
+------+
|column|
+------+
|17:456|
| 11:9|
| 11:88|
| 2:45|
| 1:45|
| N/A|
+------+
scala> df3.withColumn("NEW",regexp_replace(col("column"),":",".").cast("Double")).orderBy(col("NEW").asc).drop("NEW").show
+------+
|column|
+------+
| N/A|
| 1:45|
| 2:45|
| 11:9|
| 11:88|
|17:456|
+------+
Please let me know if it helps you.

SparkSQL Get all prefixes of a word

Say I have a column in a SparkSQL DataFrame like this:
+-------+
| word |
+-------+
| chair |
| lamp |
| table |
+-------+
I want to explode out all the prefixes like so:
+--------+
| prefix |
+--------+
| c |
| ch |
| cha |
| chai |
| chair |
| l |
| la |
| lam |
| lamp |
| t |
| ta |
| tab |
| tabl |
| table |
+--------+
Is there a good way to do this WITHOUT using udfs, or functional programming methods such as flatMap in spark sql? (I'm talking about a solution using the codegen optimal functions in org.apache.spark.sql.functions._)
Technically it is possible but I doubt it will perform any better than a simple flatMap (if performance is the reason to avoid flatMap):
val df = Seq("chair", "lamp", "table").toDF("word")
df.withColumn("len", explode(sequence(lit(1), length($"word"))))
.select($"word".substr(lit(1), $"len") as "prefix")
.show()
Output:
+------+
|prefix|
+------+
| c|
| ch|
| cha|
| chai|
| chair|
| l|
| la|
| lam|
| lamp|
| t|
| ta|
| tab|
| tabl|
| table|
+------+

'where' in apache spark

df:
-----------+
| word|
+-----------+
| 1609|
| |
| the|
| sonnets|
| |
| by|
| william|
|shakespeare|
| |
| fg|
This is my data frame. How to remove the empty rows (to remove the rows that contain '') using the 'where' clause.
code:
df.where(trim(df.word) == "").show()
output:
----+
|word|
+----+
| |
| |
| |
| |
| |
| |
| |
| |
| |
Any help is appreciated.
You can trim and check if result is empty:
>>> from pyspark.sql.functions import trim
>>> df.where(trim(df.word) != "")
Apart from where, you can also use filter to achieve this.
from pyspark.sql.functions import trim
df.filter(trim(df.word) != "").show()
df.where(trim(df.word) != "").show()

Resources