How do you explode an array of JSON string into rows? - apache-spark

My UDF function returns a json object array as string, how can I expand the array into dataframe rows?
If it isn't possible, is there any other way (like using Struct) to achieve this?
Here is my JSON data:
sample json
{
"items":[ {"Name":"test", Id:"1"}, {"Name":"sample", Id:"2"}]
}
And here is how I want it to end up like:
test, 1
sample, 2

The idea is spark can read any paralellized collection hence we take the string and parallelize it and read as a dataset
Code =>
import org.apache.spark.sql.functions._
val sampleJsonStr = """
| {
| "items":[ {"Name":"test", "Id":"1"}, {"Name":"sample", "Id":"2"}]
| }"""
val jsonDf = spark.read.option("multiLine","true").json(Seq(sampleJsonStr).toDS)
//jsonDf: org.apache.spark.sql.DataFrame = [items: array<struct<Id:string,Name:string>>]
// Finally we explode the json array
val explodedDf = jsonDf.
select("items").
withColumn("exploded_items",explode(col("items"))).
select(col("exploded_items.Id"),col("exploded_items.Name"))
Output =>
scala> explodedDf.show(false)
+---+------+
|Id |Name |
+---+------+
|1 |test |
|2 |sample|
+---+------+

Related

Complex File parsing in spark 2.4

Spark with scala 2.4
My source data looks like as given below.
Salesperson_21: Customer_575,Customer_2703,Customer_2682,Customer_2615
Salesperson_11: Customer_454,Customer_158,Customer_1859,Customer_2605
Salesperson_10: Customer_1760,Customer_613,Customer_3008,Customer_1265
Salesperson_4: Customer_1545,Customer_1312,Customer_861,Customer_2178
Code used to flat the file.
val SalespersontextDF = spark.read.text("D:/prints/sales.txt")
val stringCol = SalespersontextDF.columns.map(c => s"'$c', cast(`$c` as string)").mkString(", ")
val processedDF = SalespersontextDF.selectExpr(s"stack(${df1.columns.length}, $stringCol) as (Salesperson, Customer)")
Unfortunately it is not populating salesperson in correct field , instead of salesperson number it is populating hardcoded value as "value". and salesperson number shift to another field.
Appreciate your help very much.
the below approach might solve your problem,
import org.apache.spark.sql.functions._
val SalespersontextDF = spark.read.text("/home/sathya/Desktop/stackoverflo/data/sales.txt")
val stringCol = SalespersontextDF.columns.map(c => s"'$c', cast(`$c` as string)").mkString(", ")
val processedDF = SalespersontextDF.selectExpr(s"stack(${SalespersontextDF.columns.length}, $stringCol) as (Salesperson, Customer)")
processedDF.show(false)
/*
+-----------+----------------------------------------------------------------------+
|Salesperson|Customer |
+-----------+----------------------------------------------------------------------+
|value |Salesperson_21: Customer_575,Customer_2703,Customer_2682,Customer_2615|
|value |Salesperson_11: Customer_454,Customer_158,Customer_1859,Customer_2605 |
|value |Salesperson_10: Customer_1760,Customer_613,Customer_3008,Customer_1265|
|value |Salesperson_4: Customer_1545,Customer_1312,Customer_861,Customer_2178 |
+-----------+----------------------------------------------------------------------+
*/
processedDF.withColumn("Salesperson", split($"Customer", ":").getItem(0)).withColumn("Customer", split($"Customer", ":").getItem(1)).show(false)
/*
+--------------+-------------------------------------------------------+
|Salesperson |Customer |
+--------------+-------------------------------------------------------+
|Salesperson_21| Customer_575,Customer_2703,Customer_2682,Customer_2615|
|Salesperson_11| Customer_454,Customer_158,Customer_1859,Customer_2605 |
|Salesperson_10| Customer_1760,Customer_613,Customer_3008,Customer_1265|
|Salesperson_4 | Customer_1545,Customer_1312,Customer_861,Customer_2178|
+--------------+-------------------------------------------------------+
*/
try this-
spark.read
.schema("Salesperson STRING, Customer STRING")
.option("sep", ":")
.csv("D:/prints/sales.txt")

Convert string type to array type in spark sql

I have table in Spark SQL in Databricks and I have a column as string. I converted as new columns as Array datatype but they still as one string. Datatype is array type in table schema
Column as String
Data1
[2461][2639][2639][7700][7700][3953]
Converted to Array
Data_New
["[2461][2639][2639][7700][7700][3953]"]
String to array conversion
df_new = df.withColumn("Data_New", array(df["Data1"]))
Then write as parquet and use as spark sql table in databricks
When I search for string using array_contains function I get results as false
select *
from table_name
where array_contains(Data_New,"[2461]")
When I search for all string then query turns the results as true
Please suggest if I can separate these string as array and can find any array using array_contains function.
Just remove leading and trailing brackets from the string then split by ][ to get an array of strings:
df = df.withColumn("Data_New", split(expr("rtrim(']', ltrim('[', Data1))"), "\\]\\["))
df.show(truncate=False)
+------------------------------------+------------------------------------+
|Data1 |Data_New |
+------------------------------------+------------------------------------+
|[2461][2639][2639][7700][7700][3953]|[2461, 2639, 2639, 7700, 7700, 3953]|
+------------------------------------+------------------------------------+
Now use array_contains like this:
df.createOrReplaceTempView("table_name")
sql_query = "select * from table_name where array_contains(Data_New,'2461')"
spark.sql(sql_query).show(truncate=False)
Actually this is not an array, this is a full string so you need a regex or similar
expr = "[2461]"
df_new.filter(df_new["Data_New"].rlike(expr))
import
from pyspark.sql import functions as sf, types as st
create table
a = [["[2461][2639][2639][7700][7700][3953]"], [None]]
sdf = sc.parallelize(a).toDF(["col1"])
sdf.show()
+--------------------+
| col1|
+--------------------+
|[2461][2639][2639...|
| null|
+--------------------+
convert type
def spliter(x):
if x is not None:
return x[1:-1].split("][")
else:
return None
udf = sf.udf(spliter, st.ArrayType(st.StringType()))
sdf.withColumn("array_col1", udf("col1")).withColumn("check", sf.array_contains("array_col1", "2461")).show()
+--------------------+--------------------+-----+
| col1| array_col1|check|
+--------------------+--------------------+-----+
|[2461][2639][2639...|[2461, 2639, 2639...| true|
| null| null| null|
+--------------------+--------------------+-----+

Not able to split the column into multiple columns in Spark Dataframe

Not able to split the column into multiple columns in Spark Data-frame and through RDD.
I tried other some codes but works with only fixed columns.
Ex:
Datatype is name:string , city =list(string)
I have a text file and input data is like below
Name, city
A, (hyd,che,pune)
B, (che,bang,del)
C, (hyd)
Required Output is:
A,hyd
A,che
A,pune
B,che,
C,bang
B,del
C,hyd
after reading text file and converting DF.
Data-frame will look like below,
scala> data.show
+----------------+
| |
| value |
| |
+----------------+
| Name, city
|
|A,(hyd,che,pune)
|
|B,(che,bang,del)
|
| C,(hyd)
|
| D,(hyd,che,tn)|
+----------------+
You can use explode function on your DataFrame
val explodeDF = inputDF.withColumn("city", explode($"city")).show()
http://sqlandhadoop.com/spark-dataframe-explode/
Now that I understood you're loading your full line as a string, here is the solution on how to achieve your output
I have defined two user defined functions
val split_to_two_strings: String => Array[String] = _.split(",",2) # to first split your input two elements to convert to two columns (name, city)
val custom_conv_to_Array: String => Array[String] = _.stripPrefix("(").stripSuffix(")").split(",") # strip ( and ) then convert to list of cities
import org.apache.spark.sql.functions.udf
val custom_conv_to_ArrayUDF = udf(custom_conv_to_Array)
val split_to_two_stringsUDF = udf(split_to_two_strings)
val outputDF = inputDF.withColumn("tmp", split_to_two_stringsUDF($"value"))
.select($"tmp".getItem(0).as("Name"), trim($"tmp".getItem(1)).as("city_list"))
.withColumn("city_array", custom_conv_to_ArrayUDF($"city_list"))
.drop($"city_list")
.withColumn("city", explode($"city_array"))
.drop($"city_array")
outputDF.show()
Hope this helps

Select all columns at runtime spark sql, without predefined schema

I have a dataframe with values which are in the format
|resourceId|resourceType|seasonId|seriesId|
+----------+------------+--------+--------+
|1234 |cM-type |883838 |8838832 |
|1235 |cM-type |883838 |8838832 |
|1236 |cM-type |883838 |8838832 |
|1237 |CNN-type |883838 |8838832 |
|1238 |cM-type |883838 |8838832 |
+----------+------------+--------+--------+
I want to convert the dataframe into this format
+----------+----------------------------------------------------------------------------------------+
|resourceId|value |
+----------+----------------------------------------------------------------------------------------+
|1234 |{"resourceId":"1234","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
|1235 |{"resourceId":"1235","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
|1236 |{"resourceId":"1236","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
|1237 |{"resourceId":"1237","resourceType":"CNN-type","seasonId":"883838","seriesId":"8838832"}|
|1238 |{"resourceId":"1238","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
+----------+----------------------------------------------------------------------------------------+
I know I can get the desired output by giving the fields manually like this
val jsonformated=df.select($"resourceId",to_json(struct($"resourceId", $"resourceType", $"seasonId",$"seriesId")).alias("value"))
However, I am trying to pass the column values to struct programmatic, using
val cols = df.columns.toSeq
val jsonformatted=df.select($"resourceId",to_json(struct("colval",cols)).alias("value"))
some reason struct function doesn't take the sequence, from the api it looks like there is a method signature to accept sequence,
struct(String colName, scala.collection.Seq<String> colNames)
is there a better solution to solve this problem.
Update:
As the answer pointed out the exact syntax to get the output
val colsList = df.columns.toList
val column: List[Column] = colsList.map(dftrim(_))
val jsonformatted=df.select($"resourceId",to_json(struct(column:_*)).alias("value"))
struct takes a sequence. You're just looking at a wrong variant. Use
def struct(cols: Column*): Column
such as
import org.apache.spark.sql.functions._
val cols: Seq[String] = ???
struct(cols map col: _*)

How to filter Spark dataframe by array column containing any of the values of some other dataframe/set

I have a Dataframe A that contains a column of array string.
...
|-- browse: array (nullable = true)
| |-- element: string (containsNull = true)
...
For example three sample rows would be
+---------+--------+---------+
| column 1| browse| column n|
+---------+--------+---------+
| foo1| [X,Y,Z]| bar1|
| foo2| [K,L]| bar2|
| foo3| [M]| bar3|
And another Dataframe B that contains a column of string
|-- browsenodeid: string (nullable = true)
Some sample rows for it would be
+------------+
|browsenodeid|
+------------+
| A|
| Z|
| M|
How can I filter A so that I keep all the rows whose browse contains any of the the values of browsenodeid from B? In terms of the above examples the result will be:
+---------+--=-----+---------+
| column 1| browse| column n|
+---------+--------+---------+
| foo1| [X,Y,Z]| bar1| <- because Z is a value of B.browsenodeid
| foo3| [M]| bar3| <- because M is a value of B.browsenodeid
If I had a single value then I would use something like
A.filter(array_contains(A("browse"), single_value))
But what do I do with a list or DataFrame of values?
I found an elegant solution for this, without the need to cast DataFrames/Datasets to RDDs.
Assuming you have a DataFrame dataDF:
+---------+--------+---------+
| column 1| browse| column n|
+---------+--------+---------+
| foo1| [X,Y,Z]| bar1|
| foo2| [K,L]| bar2|
| foo3| [M]| bar3|
and an array b containing the values you want to match in browse
val b: Array[String] = Array(M,Z)
Implement the udf:
import org.apache.spark.sql.expressions.UserDefinedFunction
import scala.collection.mutable.WrappedArray
def array_contains_any(s:Seq[String]): UserDefinedFunction = {
udf((c: WrappedArray[String]) =>
c.toList.intersect(s).nonEmpty)
}
and then simply use the filter or where function (with a little bit of fancy currying :P) to do the filtering like:
dataDF.where(array_contains_any(b)($"browse"))
In Spark >= 2.4.0 you can use arrays_overlap:
import org.apache.spark.sql.functions.{array, arrays_overlap, lit}
val df = Seq(
("foo1", Seq("X", "Y", "Z"), "bar1"),
("foo2", Seq("K", "L"), "bar2"),
("foo3", Seq("M"), "bar3")
).toDF("col1", "browse", "coln")
val b = Seq("M" ,"Z")
val searchArray = array(b.map{lit}:_*) // cast to lit(i) then create Spark array
df.where(arrays_overlap($"browse", searchArray)).show()
// +----+---------+----+
// |col1| browse|coln|
// +----+---------+----+
// |foo1|[X, Y, Z]|bar1|
// |foo3| [M]|bar3|
// +----+---------+----+
Assume input data:Dataframe A
browse
200,300,889,767,9908,7768,9090
300,400,223,4456,3214,6675,333
234,567,890
123,445,667,887
and you have to match it with Dataframe B
browsenodeid:(I flatten the column browsenodeid) 123,200,300
val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\Dataframe_A.txt")
rawrdd.map(_.split("|"))
.map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(","))
.foreach(println)
Your output:
300,200
300
123
Updated
val matchSet = "A,Z,M".split(",").toSet
val rawrdd = sc.textFile("/FileStore/tables/mvv45x9f1494518792828/input_A.txt")
rawrdd.map(_.split("|"))
.map(r => if (! r(1).split(",").toSet.intersect(matchSet).isEmpty) org.apache.spark.sql.Row(r(0),r(1), r(2))).collect.foreach(println)
Output is
foo1,X,Y,Z,bar1
foo3,M,bar3

Resources