Remove multiple blanks with a single blank in Spark SQL - apache-spark

I have DataFrame created with HiveContext where one of the columns hold records like:
text1 text2
We want the in between spaces between the 2 texts to be replaced with a single text and get final output as :
text1 text2
Ho can we achieve that in Spark SQL? Note we are using Hive Context, registering temp table and writing SQL queries over it.

Even better that I have now been enlightened by a real expert. It's simpler in fact:
import org.apache.spark.sql.functions._
// val myUDf = udf((s:String) => Array(s.trim.replaceAll(" +", " ")))
val myUDf = udf((s:String) => s.trim.replaceAll("\\s+", " ")) // <-- no Array(...)
// Then there is no need to play with columns excessively:
val data = List("i like cheese", " the dog runs ", "text111111 text2222222")
val df = data.toDF("val")
df.show()
val new_df = df.withColumn("new_val", myUDf(col("val")))
new_df.show

import org.apache.spark.sql.functions._
val myUDf = udf((s:String) => Array(s.trim.replaceAll(" +", " ")))
//error: object java.lang.String is not a value --> use Array
val data = List("i like cheese", " the dog runs ", "text111111 text2222222")
val df = data.toDF("val")
df.show()
val new_df = df
.withColumn("udfResult",myUDf(col("val")))
.withColumn("new_val", col("udfResult")(0))
.drop("udfResult")
new_df.show
Output on Databricks
+--------------------+
| val|
+--------------------+
| i like cheese|
| the dog runs |
|text111111 text...|
+--------------------+
+--------------------+--------------------+
| val| new_val|
+--------------------+--------------------+
| i like cheese| i like cheese|
| the dog runs | the dog runs|
|text111111 text...|text111111 text22...|
+--------------------+--------------------+

just do in spark.sql
regexp_replace( COLUMN, ' +', ' ')
https://spark.apache.org/docs/latest/api/sql/index.html#regexp_replace
check it:
spark.sql("""
select regexp_replace(col1, ' +', ' ') as col2
from (
select 'text1 text2 text3' as col1
)
""").show(20,False)
output
+-----------------+
|col2 |
+-----------------+
|text1 text2 text3|
+-----------------+

Related

Not able to split the column into multiple columns in Spark Dataframe

Not able to split the column into multiple columns in Spark Data-frame and through RDD.
I tried other some codes but works with only fixed columns.
Ex:
Datatype is name:string , city =list(string)
I have a text file and input data is like below
Name, city
A, (hyd,che,pune)
B, (che,bang,del)
C, (hyd)
Required Output is:
A,hyd
A,che
A,pune
B,che,
C,bang
B,del
C,hyd
after reading text file and converting DF.
Data-frame will look like below,
scala> data.show
+----------------+
| |
| value |
| |
+----------------+
| Name, city
|
|A,(hyd,che,pune)
|
|B,(che,bang,del)
|
| C,(hyd)
|
| D,(hyd,che,tn)|
+----------------+
You can use explode function on your DataFrame
val explodeDF = inputDF.withColumn("city", explode($"city")).show()
http://sqlandhadoop.com/spark-dataframe-explode/
Now that I understood you're loading your full line as a string, here is the solution on how to achieve your output
I have defined two user defined functions
val split_to_two_strings: String => Array[String] = _.split(",",2) # to first split your input two elements to convert to two columns (name, city)
val custom_conv_to_Array: String => Array[String] = _.stripPrefix("(").stripSuffix(")").split(",") # strip ( and ) then convert to list of cities
import org.apache.spark.sql.functions.udf
val custom_conv_to_ArrayUDF = udf(custom_conv_to_Array)
val split_to_two_stringsUDF = udf(split_to_two_strings)
val outputDF = inputDF.withColumn("tmp", split_to_two_stringsUDF($"value"))
.select($"tmp".getItem(0).as("Name"), trim($"tmp".getItem(1)).as("city_list"))
.withColumn("city_array", custom_conv_to_ArrayUDF($"city_list"))
.drop($"city_list")
.withColumn("city", explode($"city_array"))
.drop($"city_array")
outputDF.show()
Hope this helps

Spark aggregate rows with custom function

To make it simple, let's assume we have a dataframe containing the following data:
+----------+---------+----------+----------+
|firstName |lastName |Phone |Address |
+----------+---------+----------+----------+
|firstName1|lastName1|info1 |info2 |
|firstName1|lastName1|myInfo1 |dummyInfo2|
|firstName1|lastName1|dummyInfo1|myInfo2 |
+----------+---------+----------+----------+
How can I merge all rows grouping by (firstName,lastName) and keep in the columns Phone and Address only data starting by "my" to get the following :
+----------+---------+----------+----------+
|firstName |lastName |Phone |Address |
+----------+---------+----------+----------+
|firstName1|lastName1|myInfo1 |myInfo2 |
+----------+---------+----------+----------+
Maybe should I use agg function with a custom UDAF? But how can I implement it?
Note: I'm using Spark 2.2 along with Scala 2.11.
You can use groupBy and collect_set aggregation function and use a udf function to filter in the first string that starts with "my"
import org.apache.spark.sql.functions._
def myudf = udf((array: Seq[String]) => array.filter(_.startsWith("my")).head)
df.groupBy("firstName ", "lastName")
.agg(myudf(collect_set("Phone")).as("Phone"), myudf(collect_set("Address")).as("Address"))
.show(false)
which should give you
+----------+---------+-------+-------+
|firstName |lastName |Phone |Address|
+----------+---------+-------+-------+
|firstName1|lastName1|myInfo1|myInfo2|
+----------+---------+-------+-------+
I hope the answer is helpful
If only two columns involved, filtering and join can be used instead of UDF:
val df = List(
("firstName1", "lastName1", "info1", "info2"),
("firstName1", "lastName1", "myInfo1", "dummyInfo2"),
("firstName1", "lastName1", "dummyInfo1", "myInfo2")
).toDF("firstName", "lastName", "Phone", "Address")
val myPhonesDF = df.filter($"Phone".startsWith("my"))
val myAddressDF = df.filter($"Address".startsWith("my"))
val result = myPhonesDF.alias("Phones").join(myAddressDF.alias("Addresses"), Seq("firstName", "lastName"))
.select("firstName", "lastName", "Phones.Phone", "Addresses.Address")
result.show(false)
Output:
+----------+---------+-------+-------+
|firstName |lastName |Phone |Address|
+----------+---------+-------+-------+
|firstName1|lastName1|myInfo1|myInfo2|
+----------+---------+-------+-------+
For many columns, when only one row expected, such construction can be used:
val columnsForSearch = List("Phone", "Address")
val minExpressions = columnsForSearch.map(c => min(when(col(c).startsWith("my"), col(c)).otherwise(null)).alias(c))
df.groupBy("firstName", "lastName").agg(minExpressions.head, minExpressions.tail: _*)
Output is the same.
UDF with two parameters example:
val twoParamFunc = (firstName: String, Phone: String) => firstName + ": " + Phone
val twoParamUDF = udf(twoParamFunc)
df.select(twoParamUDF($"firstName", $"Phone")).show(false)

How to access an array using foreach in spark?

I have data like below :
tab1,c1|c2|c3
tab2,d1|d2|d3|d4|d5
tab3,e1|e2|e3|e4
I need to convert it to as below in spark:
select c1,c2,c3 from tab1;
select d1,d2,d3,d4,d5 from tab2;
select e1,e2,e3,e4 from tab3;
I am able to get like this:
d.foreach(f=>{println("select"+" "+f+" from"+";")})
select tab3,e1,e2,e3,e4 from;
select tab1,c1,c2,c3 from;
select tab2,d1,d2,d3,d4,d5 from;
Can anyone suggest?
I'm not seeing where spark fits in your question. What does the variable 'd' represent?
Here is my guess at something that may be helpful.
from pyspark.sql.types import *
from pyspark.sql.functions import *
mySchema = StructType([
StructField("table_name", StringType()),
StructField("column_name",
ArrayType(StringType())
)
])
df = spark.createDataFrame([
("tab1",["c1","c2","c3"]),
("tab2",["d1","d2","d3","d4","d5"]),
("tab3",["e1","e2","e3","e4"])
],
schema = mySchema
)
df.selectExpr('concat("select ", concat_ws(",", column_name), " from ", table_name, ";") as select_string').show(3, False)
Output:
+--------------------------------+
|select_string |
+--------------------------------+
|select c1,c2,c3 from tab1; |
|select d1,d2,d3,d4,d5 from tab2;|
|select e1,e2,e3,e4 from tab3; |
+--------------------------------+
You can also use a map operation on RDD.
Assuming you have a RDD of String like:
val rdd = spark.parallelize(Seq(("tab1,c1|c2|c3"), ("tab2,d1|d2|d3|d4|d5"), ("tab3,e1|e2|e3|e4")))
with this operation:
val select = rdd.map(str=> {
val separated = str.split(",", -1)
val table = separated(0)
val cols = separated(1).split("\\|", -1).mkString(",")
"select " + cols + " from " + table + ";"
})
you will get the expected result:
select.foreach(println(_))
select d1,d2,d3,d4,d5 from tab2;
select e1,e2,e3,e4 from tab3;
select c1,c2,c3 from tab1;

Filter is not working as expected when it is applied on a DF(Which is a union of 2 DF's) in Spark

Data Frame a:
SN Hash_id Name
111 11ww11 Airtel
222 null Idea
Data Frame b:
SN Hash_id Name
333 null BSNL
444 22ee11 Vodafone
Performing a UnionAll on these dataframes by column name as below:
def unionByName(a: DataFrame, b: DataFrame): DataFrame = {
val columns = a.columns.toSet.intersect(b.columns.toSet).map(col).toSeq
a.select(columns: _*).unionAll(b.select(columns: _*))
}
The result is: Data Frame c
SN Hash_id Name
111 11ww11 Airtel
222 null Idea
333 null BSNL
444 22ee11 Vodafone
performing a filter on Data frame c.
val withHashDF = c.where(c("Hash_id").isNotNull)
val withoutHashDF = c.where(c("Hash_id").isNull)
The result for withHashDF is: it gives result only for Data frame a
111 11ww11 Airtel
The record form the Data Frame b is missing where the hash id is present:
444 22ee11 Vodafone
The result for withoutHashDF is:
222 null Idea
BSNL 333 null
null 222 Idea
In this DF the column values are not as per column name and the count is 3 where it should be only 2.From Data Frame "a" row is repeating.
Look at unionByName there is a small change in getting columns
change from
val columns = a.columns.toSet.intersect(b.columns.toSet).map(col).toSeq
to
val columns = a.columns.intersect(b.columns).map(row => new Column(row)).toSeq
then it should work as expected.
Look at the complete code snippet & outcome below:
import sparkSession.sqlContext.implicits._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Column
val dataFrameA = Seq(("111", "11ww11", "Airtel"),("222", null, "Idea")).toDF("SN","Hash_id", "Name")
val dataFrameB = Seq(("333", null, "BSNL"),("444", "22ee11", "Vodafone")).toDF("SN","Hash_id", "Name")
def unionByName(a: DataFrame, b: DataFrame): DataFrame = {
val columns = a.columns.intersect(b.columns).map(row => new Column(row)).toSeq
a.select(columns: _*).union(b.select(columns: _*))
}
val dataFrameC = unionByName(dataFrameA, dataFrameB)
val withHashDF = dataFrameC.where(dataFrameC("Hash_id").isNotNull)
val withoutHashDF = dataFrameC.where(dataFrameC("Hash_id").isNull)
println("dataFrameC")
dataFrameC.show()
println("withHashDF")
withHashDF.show
println("withoutHashDF")
withoutHashDF.show
output:
dataFrameC
+---+-------+--------+
| SN|Hash_id| Name|
+---+-------+--------+
|111| 11ww11| Airtel|
|222| null| Idea|
|333| null| BSNL|
|444| 22ee11|Vodafone|
+---+-------+--------+
withHashDF
+---+-------+--------+
| SN|Hash_id| Name|
+---+-------+--------+
|111| 11ww11| Airtel|
|444| 22ee11|Vodafone|
+---+-------+--------+
withoutHashDF
+---+-------+----+
| SN|Hash_id|Name|
+---+-------+----+
|222| null|Idea|
|333| null|BSNL|
+---+-------+----+
If there are duplicates in the Dataframe(Unionall),It gives unexpected results for filter or where clause. Once you eliminates the duplicates by using the distinct method, the results are as expected.

SPARK DataFrame: Remove MAX value in a group

My data is like:
id | val
----------------
a1 | 10
a1 | 20
a2 | 5
a2 | 7
a2 | 2
I am trying to delete row that has MAX(val) in the group if I group on "id".
Result should be like:
id | val
----------------
a1 | 10
a2 | 5
a2 | 2
I am using SPARK DataFrame and SQLContext. I need some way like:
DataFrame df = sqlContext.sql("SELECT * FROM jsontable WHERE (id, val) NOT IN (SELECT is,MAX(val) from jsontable GROUP BY id)");
How can I do that?
You can do that using dataframe operations and Window functions. Assuming you have your data in the dataframe df1:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val maxOnWindow = max(col("val")).over(Window.partitionBy(col("id")))
val df2 = df1
.withColumn("max", maxOnWindow)
.where(col("val") < col("max"))
.select("id", "val")
In Java, the equivalent would be something like:
import org.apache.spark.sql.functions.Window;
import static org.apache.spark.sql.functions.*;
Column maxOnWindow = max(col("val")).over(Window.partitionBy("id"));
DataFrame df2 = df1
.withColumn("max", maxOnWindow)
.where(col("val").lt(col("max")))
.select("id", "val");
Here's a nice article about window functions: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
Below is the Java implementation of Mario's scala code:
DataFrame df = sqlContext.read().json(input);
DataFrame dfMaxRaw = df.groupBy("id").max("val");
DataFrame dfMax = dfMaxRaw.select(
dfMaxRaw.col("id").as("max_id"), dfMaxRaw.col("max(val)").as("max_val")
);
DataFrame combineMaxWithData = df.join(dfMax, df.col("id")
.equalTo(dfMax.col("max_id")));
DataFrame finalResult = combineMaxWithData.filter(
combineMaxWithData.col("id").equalTo(combineMaxWithData.col("max_id"))
.and(combineMaxWithData.col("val").notEqual(combineMaxWithData.col("max_val")))
);
Here is how to do this using RDD and a more Scala-flavored approach:
// Let's first get the data in key-value pair format
val data = sc.makeRDD( Seq( ("a",20), ("a", 1), ("a",8), ("b",3), ("b",10), ("b",9) ) )
// Next let's find the max value from each group
val maxGroups = data.reduceByKey( Math.max(_,_) )
// We join the max in the group with the original data
val combineMaxWithData = maxGroups.join(data)
// Finally we filter out the values that agree with the max
val finalResults = combineMaxWithData.filter{ case (gid, (max,curVal)) => max != curVal }.map{ case (gid, (max,curVal)) => (gid,curVal) }
println( finalResults.collect.toList )
>List((a,1), (a,8), (b,3), (b,9))

Resources