Complex File parsing in spark 2.4 - apache-spark

Spark with scala 2.4
My source data looks like as given below.
Salesperson_21: Customer_575,Customer_2703,Customer_2682,Customer_2615
Salesperson_11: Customer_454,Customer_158,Customer_1859,Customer_2605
Salesperson_10: Customer_1760,Customer_613,Customer_3008,Customer_1265
Salesperson_4: Customer_1545,Customer_1312,Customer_861,Customer_2178
Code used to flat the file.
val SalespersontextDF = spark.read.text("D:/prints/sales.txt")
val stringCol = SalespersontextDF.columns.map(c => s"'$c', cast(`$c` as string)").mkString(", ")
val processedDF = SalespersontextDF.selectExpr(s"stack(${df1.columns.length}, $stringCol) as (Salesperson, Customer)")
Unfortunately it is not populating salesperson in correct field , instead of salesperson number it is populating hardcoded value as "value". and salesperson number shift to another field.
Appreciate your help very much.

the below approach might solve your problem,
import org.apache.spark.sql.functions._
val SalespersontextDF = spark.read.text("/home/sathya/Desktop/stackoverflo/data/sales.txt")
val stringCol = SalespersontextDF.columns.map(c => s"'$c', cast(`$c` as string)").mkString(", ")
val processedDF = SalespersontextDF.selectExpr(s"stack(${SalespersontextDF.columns.length}, $stringCol) as (Salesperson, Customer)")
processedDF.show(false)
/*
+-----------+----------------------------------------------------------------------+
|Salesperson|Customer |
+-----------+----------------------------------------------------------------------+
|value |Salesperson_21: Customer_575,Customer_2703,Customer_2682,Customer_2615|
|value |Salesperson_11: Customer_454,Customer_158,Customer_1859,Customer_2605 |
|value |Salesperson_10: Customer_1760,Customer_613,Customer_3008,Customer_1265|
|value |Salesperson_4: Customer_1545,Customer_1312,Customer_861,Customer_2178 |
+-----------+----------------------------------------------------------------------+
*/
processedDF.withColumn("Salesperson", split($"Customer", ":").getItem(0)).withColumn("Customer", split($"Customer", ":").getItem(1)).show(false)
/*
+--------------+-------------------------------------------------------+
|Salesperson |Customer |
+--------------+-------------------------------------------------------+
|Salesperson_21| Customer_575,Customer_2703,Customer_2682,Customer_2615|
|Salesperson_11| Customer_454,Customer_158,Customer_1859,Customer_2605 |
|Salesperson_10| Customer_1760,Customer_613,Customer_3008,Customer_1265|
|Salesperson_4 | Customer_1545,Customer_1312,Customer_861,Customer_2178|
+--------------+-------------------------------------------------------+
*/

try this-
spark.read
.schema("Salesperson STRING, Customer STRING")
.option("sep", ":")
.csv("D:/prints/sales.txt")

Related

How do you explode an array of JSON string into rows?

My UDF function returns a json object array as string, how can I expand the array into dataframe rows?
If it isn't possible, is there any other way (like using Struct) to achieve this?
Here is my JSON data:
sample json
{
"items":[ {"Name":"test", Id:"1"}, {"Name":"sample", Id:"2"}]
}
And here is how I want it to end up like:
test, 1
sample, 2
The idea is spark can read any paralellized collection hence we take the string and parallelize it and read as a dataset
Code =>
import org.apache.spark.sql.functions._
val sampleJsonStr = """
| {
| "items":[ {"Name":"test", "Id":"1"}, {"Name":"sample", "Id":"2"}]
| }"""
val jsonDf = spark.read.option("multiLine","true").json(Seq(sampleJsonStr).toDS)
//jsonDf: org.apache.spark.sql.DataFrame = [items: array<struct<Id:string,Name:string>>]
// Finally we explode the json array
val explodedDf = jsonDf.
select("items").
withColumn("exploded_items",explode(col("items"))).
select(col("exploded_items.Id"),col("exploded_items.Name"))
Output =>
scala> explodedDf.show(false)
+---+------+
|Id |Name |
+---+------+
|1 |test |
|2 |sample|
+---+------+

How to access nested schema column?

I have a Kafka streaming source with JSONs, e.g. {"type":"abc","1":"23.2"}.
The query gives the following exception:
org.apache.spark.sql.catalyst.parser.ParseException: extraneous
input '.1' expecting {<EOF>, .......}
== SQL ==
person.1
What is the correct syntax to access "person.1"?
I have even changed DoubleType to StringType, but that didn't work either. Example works fine with just by keeping person.type and removing person.1 in selectExpr:
val personJsonDf = inputDf.selectExpr("CAST(value AS STRING)")
val struct = new StructType()
.add("type", DataTypes.StringType)
.add("1", DataTypes.DoubleType)
val personNestedDf = personJsonDf
.select(from_json($"value", struct).as("person"))
val personFlattenedDf = personNestedDf
.selectExpr("person.type", "person.1")
val consoleOutput = personNestedDf.writeStream
.outputMode("update")
.format("console")
.start()
Interesting, since select($"person.1") should work fine (but you used selectExpr which could've confused Spark SQL).
StructField(1,DoubleType,true) won't work however since the type should actually be StringType.
Let's see...
$ cat input.json
{"type":"abc","1":"23.2"}
val input = spark.read.text("input.json")
scala> input.show(false)
+-------------------------+
|value |
+-------------------------+
|{"type":"abc","1":"23.2"}|
+-------------------------+
import org.apache.spark.sql.types._
val struct = new StructType()
.add("type", DataTypes.StringType)
.add("1", DataTypes.StringType)
val q = input.select(from_json($"value", struct).as("person"))
scala> q.show
+-----------+
| person|
+-----------+
|[abc, 23.2]|
+-----------+
val q = input.select(from_json($"value", struct).as("person")).select($"person.1")
scala> q.show
+----+
| 1|
+----+
|23.2|
+----+
I have solved this problem by using person.*
+-----+--------+
|type | 1 |
+-----+--------+
|abc |23.2 |
+-----+--------+

Select all columns at runtime spark sql, without predefined schema

I have a dataframe with values which are in the format
|resourceId|resourceType|seasonId|seriesId|
+----------+------------+--------+--------+
|1234 |cM-type |883838 |8838832 |
|1235 |cM-type |883838 |8838832 |
|1236 |cM-type |883838 |8838832 |
|1237 |CNN-type |883838 |8838832 |
|1238 |cM-type |883838 |8838832 |
+----------+------------+--------+--------+
I want to convert the dataframe into this format
+----------+----------------------------------------------------------------------------------------+
|resourceId|value |
+----------+----------------------------------------------------------------------------------------+
|1234 |{"resourceId":"1234","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
|1235 |{"resourceId":"1235","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
|1236 |{"resourceId":"1236","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
|1237 |{"resourceId":"1237","resourceType":"CNN-type","seasonId":"883838","seriesId":"8838832"}|
|1238 |{"resourceId":"1238","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
+----------+----------------------------------------------------------------------------------------+
I know I can get the desired output by giving the fields manually like this
val jsonformated=df.select($"resourceId",to_json(struct($"resourceId", $"resourceType", $"seasonId",$"seriesId")).alias("value"))
However, I am trying to pass the column values to struct programmatic, using
val cols = df.columns.toSeq
val jsonformatted=df.select($"resourceId",to_json(struct("colval",cols)).alias("value"))
some reason struct function doesn't take the sequence, from the api it looks like there is a method signature to accept sequence,
struct(String colName, scala.collection.Seq<String> colNames)
is there a better solution to solve this problem.
Update:
As the answer pointed out the exact syntax to get the output
val colsList = df.columns.toList
val column: List[Column] = colsList.map(dftrim(_))
val jsonformatted=df.select($"resourceId",to_json(struct(column:_*)).alias("value"))
struct takes a sequence. You're just looking at a wrong variant. Use
def struct(cols: Column*): Column
such as
import org.apache.spark.sql.functions._
val cols: Seq[String] = ???
struct(cols map col: _*)

Remove multiple blanks with a single blank in Spark SQL

I have DataFrame created with HiveContext where one of the columns hold records like:
text1 text2
We want the in between spaces between the 2 texts to be replaced with a single text and get final output as :
text1 text2
Ho can we achieve that in Spark SQL? Note we are using Hive Context, registering temp table and writing SQL queries over it.
Even better that I have now been enlightened by a real expert. It's simpler in fact:
import org.apache.spark.sql.functions._
// val myUDf = udf((s:String) => Array(s.trim.replaceAll(" +", " ")))
val myUDf = udf((s:String) => s.trim.replaceAll("\\s+", " ")) // <-- no Array(...)
// Then there is no need to play with columns excessively:
val data = List("i like cheese", " the dog runs ", "text111111 text2222222")
val df = data.toDF("val")
df.show()
val new_df = df.withColumn("new_val", myUDf(col("val")))
new_df.show
import org.apache.spark.sql.functions._
val myUDf = udf((s:String) => Array(s.trim.replaceAll(" +", " ")))
//error: object java.lang.String is not a value --> use Array
val data = List("i like cheese", " the dog runs ", "text111111 text2222222")
val df = data.toDF("val")
df.show()
val new_df = df
.withColumn("udfResult",myUDf(col("val")))
.withColumn("new_val", col("udfResult")(0))
.drop("udfResult")
new_df.show
Output on Databricks
+--------------------+
| val|
+--------------------+
| i like cheese|
| the dog runs |
|text111111 text...|
+--------------------+
+--------------------+--------------------+
| val| new_val|
+--------------------+--------------------+
| i like cheese| i like cheese|
| the dog runs | the dog runs|
|text111111 text...|text111111 text22...|
+--------------------+--------------------+
just do in spark.sql
regexp_replace( COLUMN, ' +', ' ')
https://spark.apache.org/docs/latest/api/sql/index.html#regexp_replace
check it:
spark.sql("""
select regexp_replace(col1, ' +', ' ') as col2
from (
select 'text1 text2 text3' as col1
)
""").show(20,False)
output
+-----------------+
|col2 |
+-----------------+
|text1 text2 text3|
+-----------------+

What does Dataset's as method really mean

I have simple code:
test("Dataset as method") {
val spark = SparkSession.builder().master("local").appName("Dataset as method").getOrCreate()
import spark.implicits._
//xyz is an alias of ds1
val ds1 = Seq("1", "2").toDS().as("xyz")
//xyz can be used to refer to the value column
ds1.select($"xyz.value").show(truncate = false)
//ERROR here, no table or view named xyz
spark.sql("select * from xyz").show(truncate = false)
}
It looks to me that xyz is like a table name, but the sql select * from xyz raises an error complaining xyz doesn't exist.
So, I want to ask, what does as method really mean? and how I should use the alias,like xyz in my case
.as() when used with dataset (as in your case) is a function to create alias for a dataset as you can see in the api doc
/**
* Returns a new Dataset with an alias set.
*
* #group typedrel
* #since 1.6.0
*/
def as(alias: String): Dataset[T] = withTypedPlan {
SubqueryAlias(alias, logicalPlan)
}
which can be used in function apis only such as select, join, filter etc. But the alias cannot be used for sql queries.
It is more evident if you create two columns dataset and use alias as you did
val ds1 = Seq(("1", "2"),("3", "4")).toDS().as("xyz")
Now you can use select to select only one column using the alias as
ds1.select($"xyz._1").show(truncate = false)
which should give you
+---+
|_1 |
+---+
|1 |
|3 |
+---+
The use of as alias is more evident when you do join of two datsets having same column names where you can write condition for joining using the alias.
But to use alias for use in sql queries you will have to register the table
ds1.registerTempTable("xyz")
spark.sql("select * from xyz").show(truncate = false)
which should give you the correct result
+---+---+
|_1 |_2 |
+---+---+
|1 |2 |
|3 |4 |
+---+---+
Or even better do it in a new way
ds1.createOrReplaceTempView("xyz")

Resources