Not able to split the column into multiple columns in Spark Dataframe - apache-spark

Not able to split the column into multiple columns in Spark Data-frame and through RDD.
I tried other some codes but works with only fixed columns.
Ex:
Datatype is name:string , city =list(string)
I have a text file and input data is like below
Name, city
A, (hyd,che,pune)
B, (che,bang,del)
C, (hyd)
Required Output is:
A,hyd
A,che
A,pune
B,che,
C,bang
B,del
C,hyd
after reading text file and converting DF.
Data-frame will look like below,
scala> data.show
+----------------+
| |
| value |
| |
+----------------+
| Name, city
|
|A,(hyd,che,pune)
|
|B,(che,bang,del)
|
| C,(hyd)
|
| D,(hyd,che,tn)|
+----------------+

You can use explode function on your DataFrame
val explodeDF = inputDF.withColumn("city", explode($"city")).show()
http://sqlandhadoop.com/spark-dataframe-explode/
Now that I understood you're loading your full line as a string, here is the solution on how to achieve your output
I have defined two user defined functions
val split_to_two_strings: String => Array[String] = _.split(",",2) # to first split your input two elements to convert to two columns (name, city)
val custom_conv_to_Array: String => Array[String] = _.stripPrefix("(").stripSuffix(")").split(",") # strip ( and ) then convert to list of cities
import org.apache.spark.sql.functions.udf
val custom_conv_to_ArrayUDF = udf(custom_conv_to_Array)
val split_to_two_stringsUDF = udf(split_to_two_strings)
val outputDF = inputDF.withColumn("tmp", split_to_two_stringsUDF($"value"))
.select($"tmp".getItem(0).as("Name"), trim($"tmp".getItem(1)).as("city_list"))
.withColumn("city_array", custom_conv_to_ArrayUDF($"city_list"))
.drop($"city_list")
.withColumn("city", explode($"city_array"))
.drop($"city_array")
outputDF.show()
Hope this helps

Related

Processing a list of json strings in Spark Streaming

I'm trying to transform the input I get with spark streaming in order to create a dataframe out of it. Basically I receive a list of json strings from which I would want to extract the data.
Note: I reduced the json strings to just the coords objects which should be sufficient for the general concept.
The input I get:
["{\"coord\":{\"lon\":10.0217,\"lat\":53.5281}}", "{\"coord\":{"lon\":10.1169,\"lat\":53.6522}}", "{\"coord\":...."]
The dataframe I want to create in order to save it to a database:
+----------+----------+
|lon |lat |
+----------+----------+
| 10.0217| 53.5281|
| 10.1169| 53.6522|
| ... | ... |
+----------+----------+
So far I managed to replace the excaped quotes which leaves me with a array of strings.
I tried to flatten the array:
result = df \
.selectExpr("Cast(value AS STRING) as json") \
.withColumn("json", f.regexp_replace('json', '\\\\"', '"')) \
.withColumn("json", f.flatten(f.col("json"))) \
.select("json")
Error:
pyspark.sql.utils.AnalysisException: cannot resolve 'flatten(json)'
due to data type mismatch: The argument should be an array of arrays,
but 'json' is of string type.;;
Then I tried to load the array with json.loads, but I was not able to call this function from Spark streaming.
So how do I extract the data from this input?
With the array provided
arr = [
"{\"coord\":{\"lon\":10.0217,\"lat\":53.5281}}",
"{\"coord\":{\"lon\":10.1169,\"lat\":53.6522}}",
]
You can get the desired result with the following code
from pyspark.sql import functions, types
df = (df.withColumn("lon", functions.regexp_extract("value", "(?<=lon\"\:)[0-9]+.[0-9]+", 0))
.withColumn("lat", functions.regexp_extract("value", "(?<=lat\"\:)[0-9]+.[0-9]+", 0)))
df = df.select(df["lon"], df["lat"])
df.show()
+-------+-------+
| lon| lat|
+-------+-------+
|10.0217|53.5281|
|10.1169|53.6522|
+-------+-------+

How do you explode an array of JSON string into rows?

My UDF function returns a json object array as string, how can I expand the array into dataframe rows?
If it isn't possible, is there any other way (like using Struct) to achieve this?
Here is my JSON data:
sample json
{
"items":[ {"Name":"test", Id:"1"}, {"Name":"sample", Id:"2"}]
}
And here is how I want it to end up like:
test, 1
sample, 2
The idea is spark can read any paralellized collection hence we take the string and parallelize it and read as a dataset
Code =>
import org.apache.spark.sql.functions._
val sampleJsonStr = """
| {
| "items":[ {"Name":"test", "Id":"1"}, {"Name":"sample", "Id":"2"}]
| }"""
val jsonDf = spark.read.option("multiLine","true").json(Seq(sampleJsonStr).toDS)
//jsonDf: org.apache.spark.sql.DataFrame = [items: array<struct<Id:string,Name:string>>]
// Finally we explode the json array
val explodedDf = jsonDf.
select("items").
withColumn("exploded_items",explode(col("items"))).
select(col("exploded_items.Id"),col("exploded_items.Name"))
Output =>
scala> explodedDf.show(false)
+---+------+
|Id |Name |
+---+------+
|1 |test |
|2 |sample|
+---+------+

Select all columns at runtime spark sql, without predefined schema

I have a dataframe with values which are in the format
|resourceId|resourceType|seasonId|seriesId|
+----------+------------+--------+--------+
|1234 |cM-type |883838 |8838832 |
|1235 |cM-type |883838 |8838832 |
|1236 |cM-type |883838 |8838832 |
|1237 |CNN-type |883838 |8838832 |
|1238 |cM-type |883838 |8838832 |
+----------+------------+--------+--------+
I want to convert the dataframe into this format
+----------+----------------------------------------------------------------------------------------+
|resourceId|value |
+----------+----------------------------------------------------------------------------------------+
|1234 |{"resourceId":"1234","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
|1235 |{"resourceId":"1235","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
|1236 |{"resourceId":"1236","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
|1237 |{"resourceId":"1237","resourceType":"CNN-type","seasonId":"883838","seriesId":"8838832"}|
|1238 |{"resourceId":"1238","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
+----------+----------------------------------------------------------------------------------------+
I know I can get the desired output by giving the fields manually like this
val jsonformated=df.select($"resourceId",to_json(struct($"resourceId", $"resourceType", $"seasonId",$"seriesId")).alias("value"))
However, I am trying to pass the column values to struct programmatic, using
val cols = df.columns.toSeq
val jsonformatted=df.select($"resourceId",to_json(struct("colval",cols)).alias("value"))
some reason struct function doesn't take the sequence, from the api it looks like there is a method signature to accept sequence,
struct(String colName, scala.collection.Seq<String> colNames)
is there a better solution to solve this problem.
Update:
As the answer pointed out the exact syntax to get the output
val colsList = df.columns.toList
val column: List[Column] = colsList.map(dftrim(_))
val jsonformatted=df.select($"resourceId",to_json(struct(column:_*)).alias("value"))
struct takes a sequence. You're just looking at a wrong variant. Use
def struct(cols: Column*): Column
such as
import org.apache.spark.sql.functions._
val cols: Seq[String] = ???
struct(cols map col: _*)

Spark: create DataFrame with specified schema fields

In Spark, create case class to specify schema, then create RDD from a file and convert it to DF. e.g.
case class Example(name: String, age: Long)
val exampleDF = spark.sparkContext
.textFile("example.txt")
.map(_.split(","))
.map(attributes => Example(attributes(0), attributes(1).toInt))
.toDF()
The question is, if the content in the txt file is like "ABCDE12345FGHIGK67890", without any symbols or spaces. How to extract specified length of string for the schema field. e.g. extract 'BCD' for name and '23' for age. Is this possible to use map and split?
Thanks !!!
You can use subString to pull the data from specific index as below
case class Example (name : String, age: Int)
val example = spark.sparkContext
.textFile("test.txt")
.map(line => Example(line.substring(1, 4), line.substring(6,8).toInt)).toDF()
example.show()
Output:
+----+---+
|name|age|
+----+---+
| BCD| 23|
| BCD| 23|
| BCD| 23|
+----+---+
I hope this helps!
In the map function where you are splitting by commas, just put a function that converts the input string to a list of values in the required order.

How to filter Spark dataframe by array column containing any of the values of some other dataframe/set

I have a Dataframe A that contains a column of array string.
...
|-- browse: array (nullable = true)
| |-- element: string (containsNull = true)
...
For example three sample rows would be
+---------+--------+---------+
| column 1| browse| column n|
+---------+--------+---------+
| foo1| [X,Y,Z]| bar1|
| foo2| [K,L]| bar2|
| foo3| [M]| bar3|
And another Dataframe B that contains a column of string
|-- browsenodeid: string (nullable = true)
Some sample rows for it would be
+------------+
|browsenodeid|
+------------+
| A|
| Z|
| M|
How can I filter A so that I keep all the rows whose browse contains any of the the values of browsenodeid from B? In terms of the above examples the result will be:
+---------+--=-----+---------+
| column 1| browse| column n|
+---------+--------+---------+
| foo1| [X,Y,Z]| bar1| <- because Z is a value of B.browsenodeid
| foo3| [M]| bar3| <- because M is a value of B.browsenodeid
If I had a single value then I would use something like
A.filter(array_contains(A("browse"), single_value))
But what do I do with a list or DataFrame of values?
I found an elegant solution for this, without the need to cast DataFrames/Datasets to RDDs.
Assuming you have a DataFrame dataDF:
+---------+--------+---------+
| column 1| browse| column n|
+---------+--------+---------+
| foo1| [X,Y,Z]| bar1|
| foo2| [K,L]| bar2|
| foo3| [M]| bar3|
and an array b containing the values you want to match in browse
val b: Array[String] = Array(M,Z)
Implement the udf:
import org.apache.spark.sql.expressions.UserDefinedFunction
import scala.collection.mutable.WrappedArray
def array_contains_any(s:Seq[String]): UserDefinedFunction = {
udf((c: WrappedArray[String]) =>
c.toList.intersect(s).nonEmpty)
}
and then simply use the filter or where function (with a little bit of fancy currying :P) to do the filtering like:
dataDF.where(array_contains_any(b)($"browse"))
In Spark >= 2.4.0 you can use arrays_overlap:
import org.apache.spark.sql.functions.{array, arrays_overlap, lit}
val df = Seq(
("foo1", Seq("X", "Y", "Z"), "bar1"),
("foo2", Seq("K", "L"), "bar2"),
("foo3", Seq("M"), "bar3")
).toDF("col1", "browse", "coln")
val b = Seq("M" ,"Z")
val searchArray = array(b.map{lit}:_*) // cast to lit(i) then create Spark array
df.where(arrays_overlap($"browse", searchArray)).show()
// +----+---------+----+
// |col1| browse|coln|
// +----+---------+----+
// |foo1|[X, Y, Z]|bar1|
// |foo3| [M]|bar3|
// +----+---------+----+
Assume input data:Dataframe A
browse
200,300,889,767,9908,7768,9090
300,400,223,4456,3214,6675,333
234,567,890
123,445,667,887
and you have to match it with Dataframe B
browsenodeid:(I flatten the column browsenodeid) 123,200,300
val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\Dataframe_A.txt")
rawrdd.map(_.split("|"))
.map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(","))
.foreach(println)
Your output:
300,200
300
123
Updated
val matchSet = "A,Z,M".split(",").toSet
val rawrdd = sc.textFile("/FileStore/tables/mvv45x9f1494518792828/input_A.txt")
rawrdd.map(_.split("|"))
.map(r => if (! r(1).split(",").toSet.intersect(matchSet).isEmpty) org.apache.spark.sql.Row(r(0),r(1), r(2))).collect.foreach(println)
Output is
foo1,X,Y,Z,bar1
foo3,M,bar3

Resources