Using self-defined data transform function in Spark Structured Stream - apache-spark

I read the following blog and find the API is very useful.
https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html
In the blog, there are lots of data selection example. Like using input
{
"a": {
"b": 1
}
}
Apply Scala: events.select("a.b"), the output would be
{
"b": 1
}
But data type conversion are not mentioned in the blog. Saying I have the following input:
{
"timestampInSec": "1514917353",
"ip": "123.39.76.112",
"money": "USD256",
"countInString": "6"
}
The expected output is:
{
"timestamp": "2018-01-02 11:22:33",
"ip_long": 2066173040,
"currency": "USD",
"money_amount": 256,
"count": 6
}
There are some transformations that not included in org.apache.spark.sql.functions._:
Timestamp is in second and is a string type
Convert IP to long
Split USD256 to two columns and convert one of the column to number
Convert string to number
Another thing is error handling and default value. If there is an invalid input like:
{
"timestampInSec": "N/A",
"money": "999",
"countInString": "Number-Six"
}
It is expected that the output can be
{
"timestamp": "1970-01-01 00:00:00",
"ip_long": 0,
"currency": "NA",
"money_amount": 999,
"count": -1
}
input timestampInSec is not a number. It is expected to use 0 and create a timestamp string as return value
ip is missing in the input. It is expected to usea default value 0.
money field is not complete. It has money amount but missed currency. It is expected to use NA as default currency and correctly translate the money_amount
countInString is not a number. It is expected to use -1 (not 0) as default value .
These requirments are not common and need some self-defined business logic code.
I do checked some function like to_timestamp. There are some codegen stuff and seems not very easy to add new functions. Is there some guide/document about writing self-defined transformation function? Is there a easy way to meet the requirments?

For all:
import org.apache.spark.sql.functions._
Timestamp is in second and is a string type
val timestamp = coalesce(
$"timestampInSec".cast("long").cast("timestamp"),
lit(0).cast("timestamp")
).alias("timestamp")
Split USD256 to two columns and convert one of the column to number
val currencyPattern = "^([A-Z]+)?([0-9]+)$"
val currency = (trim(regexp_extract($"money", currencyPattern, 1)) match {
case c => when(length(c) === 0, "NA").otherwise(c)
}).alias("currency")
val amount = regexp_extract($"money", currencyPattern, 2)
.cast("decimal(38, 0)").alias("money_amount")
Convert string to number
val count = coalesce($"countInString".cast("long"), lit(-1)).alias("count")
Convert IP to long
val ipPattern = "^([0-9]{1,3})\\.([0-9]{1,3})\\.([0-9]{1,3})\\.([0-9]{1,3})"
val ip_long = coalesce(Seq((1, 24), (2, 16), (3, 8), (4, 0)).map {
case (group, numBits) => shiftLeft(
regexp_extract($"ip", ipPattern, group).cast("long"),
numBits
)
}.reduce(_ + _), lit(0)).alias("ip_long")
Result
val df = Seq(
("1514917353", "123.39.76.112", "USD256", "6"),
("N/A", null, "999", null)
).toDF("timestampInSec", "ip", "money", "countInString")
df.select(timestamp, currency, amount, count, ip_long).show
// +-------------------+--------+------------+-----+----------+
// | timestamp|currency|money_amount|count| ip_long|
// +-------------------+--------+------------+-----+----------+
// |2018-01-02 18:22:33| USD| 256| 6|2066173040|
// |1970-01-01 00:00:00| NA| 999| -1| 0|
// +-------------------+--------+------------+-----+----------+

Related

Spark : call withColumn according to column type

I have a dataset that has several types of columns inside : String, Double, List, Map etc.
I want to do a withColumn to set some specific values for these columns when the value is null, depending of the column type.
I tried something like this :
ds.withColumn(colName, when(col(colName).expr().dataType().equals(Datatypes.STRING)), lit("StringDefaultValues"));
But it's not working at all. Besides, I cannot find the Datatypes.MAP or Datatypes.LIST available.
I wonder what is the correct way to do this ?
Try:
val newData = data.schema.fieldNames.foldLeft(data)((manipulatedData, colName) => {
manipulatedData.schema(colName).dataType match {
case IntegerType => manipulatedData.withColumn(colName, when(col(colName).isNull, lit(-1)).otherwise(col(colName)))
case StringType => manipulatedData.withColumn(colName, when(col(colName).isNull, lit("Empty")).otherwise(col(colName)))
case MapType(IntegerType, StringType, true) => manipulatedData.withColumn(colName, when(col(colName).isNull, typedLit(Map.empty[Int, String])).otherwise(col(colName)))
case ArrayType(StringType, true) => manipulatedData.withColumn(colName, when(col(colName).isNull, typedLit(Array.empty[String])).otherwise(col(colName)))
// TODO: rest of types...
}
})
Try below code, To get more control over column data types.
scala> df.show(false)
+--------+---+------+------------------+
|name |age|salary|options |
+--------+---+------+------------------+
|Srinivas|1 |1200.0|[name -> srinivas]|
+--------+---+------+------------------+
For example I am change name and age fields to default.
scala>
val columns = df
.schema
.map(c => (c.name,c.dataType.typeName))
.map {
_ match {
case (column#"name",datatype#"string") => lit("StringDefaultValues").as(column)
case (column#"age",datatype#"integer") => lit(30).as(column)
case (column#"options",datatype#"map") => map(col("name"),col("age")).as(column) // for example
case (column,datatype) => col(column).as(column)
}
}
scala> df.select(columns:_*).show(false)
+-------------------+---+------+---------------+
|name |age|salary|options |
+-------------------+---+------+---------------+
|StringDefaultValues|30 |1200.0|[Srinivas -> 1]|
+-------------------+---+------+---------------+

finding a text wirhin a struct type column in a pyspark dataframe

I want to find the number of occurrence of a "matches_count" text in a struct type column of a dataframe. How can I accomplish this in pyspark. I need to return a column with the count. Also the structure varies for every row so that same keys might or might not be present in the row.
"abcviolation": {
"2020.06.01.xls": {
"twnin": {
"matches_count": 1
},
"phtaxid": {
"matches_count": 30
},
"driverslicense": {
"matches_count": 15
},
"DICard_Term": {
"matches_count": 1
},
"resident": {
"matches_count": 30
},
"win": {
"matches_count": 30
},
"port2": {
"matches_count": 1
},
"id_2": {
"matches_count": 30
},
"id_3": {
"matches_count": 6
},
"id_4": {
"matches_count": 30
}
}
},
the output dataframe would've a column "no_of_occurrence" with the value of 10 for this row.
If the structure of the data is not fixed (for example if the same code should be applied to many different datasets with slightly different structure), one approach could be to transform the whole struct into a string and search for the number of occurences of matches_count within this string.
#create a new column containing a (json) string representation of the data
df2 = df.withColumn("json", F.to_json(F.struct("abcviolation")))
#define a udf that counts the number of occurences of 'matches_count' within a string
from pyspark.sql.types import LongType
def count_matches_count(str):
return str.count('matches_count')
count_udf = F.udf(count_matches_count, LongType())
#create a new column with the help of the udf that contains the count
df3 = df2.withColumn("count", count_udf("json"))
The result is
+--------------------+--------------------+-----+
| abcviolation| json|count|
+--------------------+--------------------+-----+
|[[[1], [15], [30]...|{"abcviolation":{...| 10|
+--------------------+--------------------+-----+
This approach has some caveats: calling to_json in the first step will loose data if there are duplicate keys in the json. And the count might be too high if the string matches_count also occurs somewhere as data field name.

Spark dataframe to nested JSON

I have a dataframe joinDf created from joining the following four dataframes on userId:
val detailsDf = Seq((123,"first123","xyz"))
.toDF("userId","firstName","address")
val emailDf = Seq((123,"abc#gmail.com"),
(123,"def#gmail.com"))
.toDF("userId","email")
val foodDf = Seq((123,"food2",false,"Italian",2),
(123,"food3",true,"American",3),
(123,"food1",true,"Mediterranean",1))
.toDF("userId","foodName","isFavFood","cuisine","score")
val gameDf = Seq((123,"chess",false,2),
(123,"football",true,1))
.toDF("userId","gameName","isOutdoor","score")
val joinDf = detailsDf
.join(emailDf, Seq("userId"))
.join(foodDf, Seq("userId"))
.join(gameDf, Seq("userId"))
User's food and game favorites should be ordered by score in the ascending order.
I am trying to create a result from this joinDf where the JSON looks like the following:
[
{
"userId": "123",
"firstName": "first123",
"address": "xyz",
"UserFoodFavourites": [
{
"foodName": "food1",
"isFavFood": "true",
"cuisine": "Mediterranean",
},
{
"foodName": "food2",
"isFavFood": "false",
"cuisine": "Italian",
},
{
"foodName": "food3",
"isFavFood": "true",
"cuisine": "American",
}
]
"UserEmail": [
"abc#gmail.com",
"def#gmail.com"
]
"UserGameFavourites": [
{
"gameName": "football",
"isOutdoor": "true"
},
{
"gameName": "chess",
"isOutdoor": "false"
}
]
}
]
Should I use joinDf.groupBy().agg(collect_set())?
Any help would be appreciated.
My solution is based on the answers found here and here
It uses the Window function. It shows how to create a nested list of food preferences for a given userid based on the food score. Here we are creating a struct of FoodDetails from the columns we have
val foodModifiedDf = foodDf.withColumn("FoodDetails",
struct("foodName","isFavFood", "cuisine","score"))
.drop("foodName","isFavFood", "cuisine","score")
println("Just printing the food detials here")
foodModifiedDf.show(10, truncate = false)
Here we are creating a Windowing function which will accumulate the list for a userId based on the FoodDetails.score in descending order. The windowing function when applied goes on accumulating the list as it encounters new rows with same userId. After we have done accumulating, we have to do a groupBy over the userId to select the largest list.
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy("userId").orderBy( desc("FoodDetails.score"))
val userAndFood = detailsDf.join(foodModifiedDf, "userId")
val newUF = userAndFood.select($"*", collect_list("FoodDetails").over(window) as "FDNew")
println(" UserAndFood dataframe after windowing function applied")
newUF.show(10, truncate = false)
val resultUF = newUF.groupBy("userId")
.agg(max("FDNew"))
println("Final result after select the maximum length list")
resultUF.show(10, truncate = false)
This is how the result looks like finally :
+------+-----------------------------------------------------------------------------------------+
|userId|max(FDNew) |
+------+-----------------------------------------------------------------------------------------+
|123 |[[food3, true, American, 3], [food2, false, Italian, 2], [food1, true, Mediterranean, 1]]|
+------+-----------------------------------------------------------------------------------------+
Given this dataframe, it should be easier to write out the nested json.
The main problem of joining before grouping and collecting lists is the fact that join will produce a lot of records for group by to collapse, in your example it is 12 records after join and before groupby, also you would need to worry about picking "firstName","address" out detailsDf out of 12 duplicates. To avoid both problems your could pre-process the food, email and game dataframes using struct and groupBy and join them to the detailsDf with no risk of exploding your data due to multiple records with the same userId in the joined tables.
val detailsDf = Seq((123,"first123","xyz"))
.toDF("userId","firstName","address")
val emailDf = Seq((123,"abc#gmail.com"),
(123,"def#gmail.com"))
.toDF("userId","email")
val foodDf = Seq((123,"food2",false,"Italian",2),
(123,"food3",true,"American",3),
(123,"food1",true,"Mediterranean",1))
.toDF("userId","foodName","isFavFood","cuisine","score")
val gameDf = Seq((123,"chess",false,2),
(123,"football",true,1))
.toDF("userId","gameName","isOutdoor","score")
val emailGrp = emailDf.groupBy("userId").agg(collect_list("email").as("UserEmail"))
val foodGrp = foodDf
.select($"userId", struct("score", "foodName","isFavFood","cuisine").as("UserFoodFavourites"))
.groupBy("userId").agg(sort_array(collect_list("UserFoodFavourites")).as("UserFoodFavourites"))
val gameGrp = gameDf
.select($"userId", struct("gameName","isOutdoor","score").as("UserGameFavourites"))
.groupBy("userId").agg(collect_list("UserGameFavourites").as("UserGameFavourites"))
val result = detailsDf.join(emailGrp, Seq("userId"))
.join(foodGrp, Seq("userId"))
.join(gameGrp, Seq("userId"))
result.show(100, false)
Output:
+------+---------+-------+------------------------------+-----------------------------------------------------------------------------------------+----------------------------------------+
|userId|firstName|address|UserEmail |UserFoodFavourites |UserGameFavourites |
+------+---------+-------+------------------------------+-----------------------------------------------------------------------------------------+----------------------------------------+
|123 |first123 |xyz |[abc#gmail.com, def#gmail.com]|[[1, food1, true, Mediterranean], [2, food2, false, Italian], [3, food3, true, American]]|[[chess, false, 2], [football, true, 1]]|
+------+---------+-------+------------------------------+-----------------------------------------------------------------------------------------+----------------------------------------+
As all groupBy are done on userId and joins as well, spark will optimise it quite well.
UPDATE 1: After #user238607 pointed out that I have missed the original requirement of food preferences being sorted by score, did a quick fix and placed the score column as first element of structure UserFoodFavourites and used sort_array function to arrange data in desired order without forcing extra shuffle operation. Updated the code and its output.

spark assign column name for withColumn function from variable fields

I have some json data like below, I need to create new columns based on the some Jason values
{ "start": "1234567679", "test": ["abc"], "value": 324, "end": "1234567689" }
{ "start": "1234567679", "test": ["xyz"], "value": "Near", "end": "1234567689"}
{ "start": "1234568679", "test": ["pqr"], "value": ["Attr"," "], "end":"1234568679"}
{ "start": "1234568997", "test": ["mno"], "value": ["{\"key\": \"1\", \"value\": [\"789\"]}" ], "end": "1234568999"}
above is the json example
I want to create a column like below
start abc xyz pqr mno end
1234567679 324 null null null 1234567689
1234567889 null Near null null 1234567989
1234568679 null null attr null 1234568679
1234568997 null null null 789 1234568999
def getValue1(s1: Seq[String], v: String) = {
if (s1(0)=="abc")) v else null
}
def getValue2(s1: Seq[String], v: String) = {
if (s1(0)=="xyz")) v else null
}
val df = spark.read.json("path to json")
val tdf = df.withColumn("abc",getValue1($"test", $"value")).withColumn("xyz",getValue2($"test", $"value"))
But this i dont want to use because my test values are more, I want some function do something like this
def getColumnname(s1: Seq[String]) = {
return s1(0)
}
val tdf = df.withColumn(getColumnname($"test"),$"value"))
is it good idea to change the values to columns, I want like this because I need to apply this on some Machine learning code which needs plain columns
You can use pivot operations to do such things. Assuming you always have one item in your array for test column, here is the simpler solution;
import org.apache.spark.sql.functions._
val df = sqlContext.read.json("<yourPath>")
df.withColumn("test", $"test".getItem(0)).groupBy($"start", $"end").pivot("test").agg(first("value")).show
+----------+----------+----+----+
| start| end| abc| xyz|
+----------+----------+----+----+
|1234567679|1234567689| 324|null|
|1234567889|1234567689|null| 789|
+----------+----------+----+----+
If you have multiple values in test column, you can also use explode function;
df.withColumn("test", explode($"test")).groupBy($"start", $"end").pivot("test").agg(first("value")).show
For more information:
Spark doc
Databricks blog
Medium blog
I hope it helps!
Update I
Based on your comments and updated question, here is the solution that you need to follow. I have intentionally seperated all operations, so you can easily understand what you need to do for further improvements;
df.withColumn("value", regexp_replace($"value", "\\[", "")). //1
withColumn("value", regexp_replace($"value", "\\]", "")). //2
withColumn("value", split($"value", "\\,")). //3
withColumn("test", explode($"test")). //4
withColumn("value", explode($"value")). //5
withColumn("value", regexp_replace($"value", " +", "")). //6
filter($"value" !== ""). //7
groupBy($"start", $"end").pivot("test"). //8
agg(first("value")).show //9
When you read such json files, it will give you a data frame which has value column with StringType. You can not directly convert StringType to ArrayType, so you need to do some kind of tricks like in the lines 1, 2, 3 to convert it into ArrayType. You can do these operations in one line or with just one regular expression or defining udf. It is all up to you, I'm just trying to show you the abilities of Apache Spark.
Now you have value column with ArrayType. Explode this column in line 5 as we did in line 4 for test column. Then apply your pivoting operations.

Spark Dataframe groupBy and sort results into a list

I have a Spark Dataframe and I would like to group the elements by a key and have the results as a sorted list
Currently I am using:
df.groupBy("columnA").agg(collect_list("columnB"))
How do I make the items in the list sorted ascending order?
You could try the function sort_array available in the functions package:
import org.apache.spark.sql.functions._
df.groupBy("columnA").agg(sort_array(collect_list("columnB")))
Just wanted to add another hint to the answer of Daniel de Paula regarding sort_array solution.
If you want to sort elements according to a different column, you can form a struct of two fields:
the sort by field
the result field
Since structs are sorted field by field, you'll get the order you want, all you need is to get rid of the sort by column in each element of the resulting list.
The same approach can be applied with several sort by columns when needed.
Here's an example that can be run in local spark-shell (use :paste mode):
import org.apache.spark.sql.Row
import spark.implicits._
case class Employee(name: String, department: String, salary: Double)
val employees = Seq(
Employee("JSMITH", "A", 20.0),
Employee("AJOHNSON", "A", 650.0),
Employee("CBAKER", "A", 650.2),
Employee("TGREEN", "A", 13.0),
Employee("CHORTON", "B", 111.0),
Employee("AIVANOV", "B", 233.0),
Employee("VSMIRNOV", "B", 11.0)
)
val employeesDF = spark.createDataFrame(employees)
val getNames = udf { salaryNames: Seq[Row] =>
salaryNames.map { case Row(_: Double, name: String) => name }
}
employeesDF
.groupBy($"department")
.agg(collect_list(struct($"salary", $"name")).as("salaryNames"))
.withColumn("namesSortedBySalary", getNames(sort_array($"salaryNames", asc = false)))
.show(truncate = false)
The result:
+----------+--------------------------------------------------------------------+----------------------------------+
|department|salaryNames |namesSortedBySalary |
+----------+--------------------------------------------------------------------+----------------------------------+
|B |[[111.0, CHORTON], [233.0, AIVANOV], [11.0, VSMIRNOV]] |[AIVANOV, CHORTON, VSMIRNOV] |
|A |[[20.0, JSMITH], [650.0, AJOHNSON], [650.2, CBAKER], [13.0, TGREEN]]|[CBAKER, AJOHNSON, JSMITH, TGREEN]|
+----------+--------------------------------------------------------------------+----------------------------------+

Resources