spark conditional replacement of values - apache-spark

For pandas I have a code snippet like this:
def setUnknownCatValueConditional(df, conditionCol, condition, colToSet, _valueToSet='KEINE'):
df.loc[(df[conditionCol] == condition) & (df[colToSet].isnull()), colToSet] = _valueToSet
which conditionally will replace values in a data frame.
Trying to port this functionality to spark
df.withColumn("A", when($"A" === "x" and $"B" isNull, "replacement")).show
Did not work out for me
df.withColumn("A", when($"A" === "x" and $"B" isNull, "replacement")).show
warning: there was one feature warning; re-run with -feature for details
org.apache.spark.sql.AnalysisException: cannot resolve '((`A` = 'x') AND `B`)' due to data type mismatch: differing types in '((`A` = 'X') AND `B`)' (boolean and string).;;
even though df.printSchema returns a string for A and b
What is wrong here?
edit
A minimal example:
import java.sql.{ Date, Timestamp }
case class FooBar(foo:Date, bar:String)
val myDf = Seq(("2016-01-01","first"),("2016-01-02","second"),("2016-wrongFormat","noValidFormat"), ("2016-01-04","lastAssumingSameDate"))
.toDF("foo","bar")
.withColumn("foo", 'foo.cast("Date"))
.as[FooBar]
myDf.printSchema
root
|-- foo: date (nullable = true)
|-- bar: string (nullable = true)
scala> myDf.show
+----------+--------------------+
| foo| bar|
+----------+--------------------+
|2016-01-01| first|
|2016-01-02| second|
| null| noValidFormat|
|2016-01-04|lastAssumingSameDate|
+----------+--------------------+
myDf.withColumn("foo", when($"bar" === "noValidFormat" and $"foo" isNull, "noValue")).show
And the expected output
+----------+--------------------+
| foo| bar|
+----------+--------------------+
|2016-01-01| first|
|2016-01-02| second|
| "noValue"| noValidFormat|
|2016-01-04|lastAssumingSameDate|
+----------+--------------------+
edit2
in case chaining of conditions is required
df
.withColumn("A",
when(
(($"B" === "x") and ($"B" isNull)) or
(($"B" === "y") and ($"B" isNull)), "replacement")
should work

Mind the operator precedence. It should be:
myDf.withColumn("foo",
when(($"bar" === "noValidFormat") and ($"foo" isNull), "noValue"))
This:
$"bar" === "noValidFormat" and $"foo" isNull
is evaluated as:
(($"bar" === "noValidFormat") and $"foo") isNull

Related

Spark 3.1 String Array to Date Array Coversion Error

I want to find out Whether this array contains this date or not. if yes i need to put yes in one column.
Dataset<Row> dataset = dataset.withColumn("incoming_timestamp", col("incoming_timestamp").cast("timestamp"))
.withColumn("incoming_date", to_date(col("incoming_timestamp")));
my incoming_timestamp is 2021-03-30 00:00:00 after converting to date it is 2021-03-30
output dataset is like this
+----------------------+-------------------+----------------------------------------+
|col 1 |incoming_timestamp | incoming_date |
+----------------------+-------------------+-----------------------------------------
|val1 |2021-03-30 00:00:00| 2021-07-06 |
|val2 |2020-03-30 00:00:00| 2020-03-30 |
|val3 |1889-03-30 00:00:00| 1889-03-30 |
-------------------------------------------------------------------------------------
i have a String declared like this,
String Dates = "2021-07-06,1889-03-30";
i want to add one more col in the result dataset is the incoming date is present in Dates String.
Like this,
+----------------------+-------------------+----------------------------------------+--------------+
|col 1 |incoming_timestamp | incoming_date | result |
+----------------------+-------------------+--------------------------------------------------------
|val1 |2021-03-30 00:00:00| 2021-07-06 | true |
|val2 |2020-03-30 00:00:00| 2020-03-30 | false |
|val3 |1889-03-30 00:00:00| 1889-03-30 | true |
----------------------------------------------------------------------------------------------------
for that first i need to convert this String into Array, then array_contains(value,array) Returns true if the array contains the value.
i tried the following,
METHOD 1
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd");
Date[] dateArr = Arrays.stream((dates.split(","))).map(d->(LocalDate.parse(d,
formatter))).toArray(Date[]::new);
it throws error, java.lang.ArrayStoreException: java.time.LocalDate
METHOD 2
SimpleDateFormat formatter = new SimpleDateFormat("YYYY-MM-DD", Locale.ENGLISH);
formatter.setTimeZone(TimeZone.getTimeZone("America/New_York"));
Date[] dateArr = Arrays.stream((Dates.split(","))).map(d-> {
try {
return (formatter.parse(d));
} catch (ParseException e) {
e.printStackTrace();
}
return null;
}).toArray(Date[]::new);
dataset = dataset.withColumn("result",array_contains(col("incoming_date"),dates));
it throws error
org.apache.spark.sql.AnalysisException: Unsupported component type class java.util.Date in arrays
Can anyone help on this?
This can be solved by typecasting String to java.sql.Date.
import java.sql.Date
val data: Seq[(String, String)] = Seq(
("val1", "2020-07-31 00:00:00"),
("val2", "2021-02-28 00:00:00"),
("val3", "2019-12-31 00:00:00"))
val compareDate = "2020-07-31, 2019-12-31"
val compareDateArray = compareDate.split(",").map(x => Date.valueOf(x.trim))
import spark.implicits._
val df = data.toDF("variable", "date")
.withColumn("date_casted", to_date(col("date"), "y-M-d H:m:s"))
df.show()
val outputDf = df.withColumn("result", col("date_casted").isin(compareDateArray: _*))
outputDf.show()
Input:
+--------+-------------------+-----------+
|variable| date|date_casted|
+--------+-------------------+-----------+
| val1|2020-07-31 00:00:00| 2020-07-31|
| val2|2021-02-28 00:00:00| 2021-02-28|
| val3|2019-12-31 00:00:00| 2019-12-31|
+--------+-------------------+-----------+
root
|-- variable: string (nullable = true)
|-- date: string (nullable = true)
|-- date_casted: date (nullable = true)
output:
+--------+-------------------+-----------+------+
|variable| date|date_casted|result|
+--------+-------------------+-----------+------+
| val1|2020-07-31 00:00:00| 2020-07-31| true|
| val2|2021-02-28 00:00:00| 2021-02-28| false|
| val3|2019-12-31 00:00:00| 2019-12-31| true|
+--------+-------------------+-----------+------+

Spark SQL CSV to JSON with different data types

Currently, I have a csv data like this:
id,key,value
id_1,int_key,1
id_1,string_key,asd
id_1,double_key,null
id_2,double_key,2.0
I'd like to transform these attributes grouped by their id with their corresponding correct data type to json.
I'm expecting to have a json structure like this:
[{
id: "id_1"
attributes: {
int_key: 1,
string_key: "asd"
double_key: null
}
},
id: "id_2"
attributes: {
double_key: 2.0
}]
My current solution is to collect_list with to_json in Spark which looked like this:
SELECT to_json(id, map_from_arrays(collect_list(key), collect_list(value)) as attributes GROUP BY id)
This will work however, I cannot find a way to cast to their correct data types.
[{
id: "id_1"
attributes: {
int_key: "1",
string_key: "asd"
double_key: "null"
}
},
id: "id_2"
attributes: {
double_key: "2.0"
}]
I also need to add support to null values. But I already found a solution for that. I use ignoreNulls option in to_json. So, if I tried to enumerate each attributes and cast them to their corresponding type, I will be including all the attributes defined. I just want to include the attributes of the user defined in the csv file.
By the way, I'm using Spark 2.4.
Python: Here is my PySpark version of the conversion from the scala version. The results are the same.
from pyspark.sql.functions import col, max, struct
df = spark.read.option("header","true").csv("test.csv")
keys = [row.key for row in df.select(col("key")).distinct().collect()]
df2 = df.groupBy("id").pivot("key").agg(max("value"))
df2.show()
df2.printSchema()
for key in keys:
df2 = df2.withColumn(key, col(key).cast(key.split('_')[0]))
df2.show()
df2.printSchema()
df3 = df2.select("id", struct("int_key", "double_key", "string_key").alias("attributes"))
jsonArray = df3.toJSON().collect()
for json in jsonArray: print(json)
Scala: I tried to split each type of value by using the pivot first.
val keys = df.select('key).distinct.rdd.map(r => r(0).toString).collect
val df2 = df.groupBy('id).pivot('key, keys).agg(max('value))
df2.show
df2.printSchema
Then, the DataFrame looks like below:
+----+-------+----------+----------+
| id|int_key|double_key|string_key|
+----+-------+----------+----------+
|id_2| null| 2.0| null|
|id_1| 1| null| asd|
+----+-------+----------+----------+
root
|-- id: string (nullable = true)
|-- int_key: string (nullable = true)
|-- double_key: string (nullable = true)
|-- string_key: string (nullable = true)
where the type of each column is still strings.
To cast it, I have used the foldLeft,
val df3 = keys.foldLeft(df2) { (df, key) => df.withColumn(key, col(key).cast(key.split("_").head)) }
df3.show
df3.printSchema
and the result now have collrect types.
+----+-------+----------+----------+
| id|int_key|double_key|string_key|
+----+-------+----------+----------+
|id_2| null| 2.0| null|
|id_1| 1| null| asd|
+----+-------+----------+----------+
root
|-- id: string (nullable = true)
|-- int_key: integer (nullable = true)
|-- double_key: double (nullable = true)
|-- string_key: string (nullable = true)
Then, you can build your json such as
val df4 = df3.select('id, struct('int_key, 'double_key, 'string_key) as "attributes")
val jsonArray = df4.toJSON.collect
jsonArray.foreach(println)
where the last line is for checking the result that is
{"id":"id_2","attributes":{"double_key":2.0}}
{"id":"id_1","attributes":{"int_key":1,"string_key":"asd"}}

pyspark-java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema

I'm running pyspark-sql code on Horton sandbox
18/08/11 17:02:22 INFO spark.SparkContext: Running Spark version 1.6.3
# code
from pyspark.sql import *
from pyspark.sql.types import *
rdd1 = sc.textFile ("/user/maria_dev/spark_data/products.csv")
rdd2 = rdd1.map( lambda x : x.split("," ) )
df1 = sqlContext.createDataFrame(rdd2, ["id","cat_id","name","desc","price", "url"])
df1.printSchema()
root
|-- id: string (nullable = true)
|-- cat_id: string (nullable = true)
|-- name: string (nullable = true)
|-- desc: string (nullable = true)
|-- price: string (nullable = true)
|-- url: string (nullable = true)
df1.show()
+---+------+--------------------+----+------+--------------------+
| id|cat_id| name|desc| price| url|
+---+------+--------------------+----+------+--------------------+
| 1| 2|Quest Q64 10 FT. ...| | 59.98|http://images.acm...|
| 2| 2|Under Armour Men'...| |129.99|http://images.acm...|
| 3| 2|Under Armour Men'...| | 89.99|http://images.acm...|
| 4| 2|Under Armour Men'...| | 89.99|http://images.acm...|
| 5| 2|Riddell Youth Rev...| |199.99|http://images.acm...|
# When I try to get counts I get the following error.
df1.count()
**Caused by: java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 6 fields are required while 7 values are provided.**
# I get the same error for the following code as well
df1.registerTempTable("products_tab")
df_query = sqlContext.sql ("select id, name, desc from products_tab order by name, id ").show();
I see column desc is null, not sure if null column needs to be handled differently when creating data frame and using any method on it.
The same error occurs when running sql query. It seems sql error is due to "order by" clause, if I remove order by then query runs successfully.
Please let me know if you need more info and appreciate answer on how to handle this error.
I tried to see if name field contains any comma, as suggested by Chandan Ray.
There's no comma in name field.
rdd1.count()
=> 1345
rdd2.count()
=> 1345
# clipping id and name column from rdd2
rdd_name = rdd2.map(lambda x: (x[0], x[2]) )
rdd_name.count()
=>1345
rdd_name_comma = rdd_name.filter (lambda x : True if x[1].find(",") != -1 else False )
rdd_name_comma.count()
==> 0
I found the issue- it was due to one bad record, where comma was embedded in string. And even though string was double quoted, python splits string into 2 columns.
I tried using databricks package
# from command prompt
pyspark --packages com.databricks:spark-csv_2.10:1.4.0
# on pyspark
schema1 = StructType ([ StructField("id",IntegerType(), True), \
StructField("cat_id",IntegerType(), True), \
StructField("name",StringType(), True),\
StructField("desc",StringType(), True),\
StructField("price",DecimalType(), True), \
StructField("url",StringType(), True)
])
df1 = sqlContext.read.format('com.databricks.spark.csv').schema(schema1).load('/user/maria_dev/spark_data/products.csv')
df1.show()
df1.show()
+---+------+--------------------+----+-----+--------------------+
| id|cat_id| name|desc|price| url|
+---+------+--------------------+----+-----+--------------------+
| 1| 2|Quest Q64 10 FT. ...| | 60|http://images.acm...|
| 2| 2|Under Armour Men'...| | 130|http://images.acm...|
| 3| 2|Under Armour Men'...| | 90|http://images.acm...|
| 4| 2|Under Armour Men'...| | 90|http://images.acm...|
| 5| 2|Riddell Youth Rev...| | 200|http://images.acm...|
df1.printSchema()
root
|-- id: integer (nullable = true)
|-- cat_id: integer (nullable = true)
|-- name: string (nullable = true)
|-- desc: string (nullable = true)
|-- price: decimal(10,0) (nullable = true)
|-- url: string (nullable = true)
df1.count()
1345
I suppose your name field has comma in it, so its splitting this also. So its expecting 7 columns
There might be some malformed lines.
Please try to use the code as below to exclude bad record in one file
val df = spark.read.format(“csv”).option("badRecordsPath", "/tmp/badRecordsPath").load(“csvpath”)
//it will read csv and create a dataframe, if there will be any malformed record it will move this into the path you provided.
// please read below
https://docs.databricks.com/spark/latest/spark-sql/handling-bad-records.html
Here is my take on cleaning of such records, we normally encounter such situations:
a. Anomaly on the data where the file when created, was not looked if "," is the best delimiter on the columns.
Here is my solution on the case:
Solution a: In such cases, we would like to have the process identify as part of data cleansing if that record is a qualified records. The rest of the records if routed to a bad file/collection would give the opportunity to reconcile such records.
Below is the structure of my dataset (product_id,product_name,unit_price)
1,product-1,10
2,product-2,20
3,product,3,30
In the above case, product,3 is supposed to be read as product-3 which might have been a typo when the product was registered. In such as case, the below sample would work.
>>> tf=open("C:/users/ip2134/pyspark_practice/test_file.txt")
>>> trec=tf.read().splitlines()
>>> for rec in trec:
... if rec.count(",") == 2:
... trec_clean.append(rec)
... else:
... trec_bad.append(rec)
...
>>> trec_clean
['1,product-1,10', '2,product-2,20']
>>> trec_bad
['3,product,3,30']
>>> trec
['1,product-1,10', '2,product-2,20','3,product,3,30']
The other alternative of dealing with this problem would be trying to see if skipinitialspace=True would work to parse out the columns.
(Ref:Python parse CSV ignoring comma with double-quotes)

How to flatten columns of type array of structs (as returned by Spark ML API)?

Maybe it's just because I'm relatively new to the API, but I feel like Spark ML methods often return DFs that are unnecessarily difficult to work with.
This time, it's the ALS model that's tripping me up. Specifically, the recommendForAllUsers method. Let's reconstruct the type of DF it would return:
scala> val arrayType = ArrayType(new StructType().add("itemId", IntegerType).add("rating", FloatType))
scala> val recs = Seq((1, Array((1, .7), (2, .5))), (2, Array((0, .9), (4, .1)))).
toDF("userId", "recommendations").
select($"userId", $"recommendations".cast(arrayType))
scala> recs.show()
+------+------------------+
|userId| recommendations|
+------+------------------+
| 1|[[1,0.7], [2,0.5]]|
| 2|[[0,0.9], [4,0.1]]|
+------+------------------+
scala> recs.printSchema
root
|-- userId: integer (nullable = false)
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- itemId: integer (nullable = true)
| | |-- rating: float (nullable = true)
Now, I only care about the itemId in the recommendations column. After all, the method is recommendForAllUsers not recommendAndScoreForAllUsers (ok ok I'll stop being sassy...)
How do I do this??
I thought I had it when I created a UDF:
scala> val itemIds = udf((arr: Array[(Int, Float)]) => arr.map(_._1))
but that produces an error:
scala> recs.withColumn("items", items($"recommendations"))
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(recommendations)' due to data type mismatch: argument 1 requires array<struct<_1:int,_2:float>> type, however, '`recommendations`' is of array<struct<itemId:int,rating:float>> type.;;
'Project [userId#87, recommendations#92, UDF(recommendations#92) AS items#238]
+- Project [userId#87, cast(recommendations#88 as array<struct<itemId:int,rating:float>>) AS recommendations#92]
+- Project [_1#84 AS userId#87, _2#85 AS recommendations#88]
+- LocalRelation [_1#84, _2#85]
Any ideas? thanks!
wow, my coworker came up with an extremely elegant solution:
scala> recs.select($"userId", $"recommendations.itemId").show
+------+------+
|userId|itemId|
+------+------+
| 1|[1, 2]|
| 2|[0, 4]|
+------+------+
So maybe the Spark ML API isn't that difficult after all :)
With an array as the type of a column, e.g. recommendations, you'd be quite productive using explode function (or the more advanced flatMap operator).
explode(e: Column): Column Creates a new row for each element in the given array or map column.
That gives you bare structs to work with.
import org.apache.spark.sql.types._
val structType = new StructType().
add($"itemId".int).
add($"rating".float)
val arrayType = ArrayType(structType)
val recs = Seq((1, Array((1, .7), (2, .5))), (2, Array((0, .9), (4, .1)))).
toDF("userId", "recommendations").
select($"userId", $"recommendations" cast arrayType)
val exploded = recs.withColumn("recs", explode($"recommendations"))
scala> exploded.show
+------+------------------+-------+
|userId| recommendations| recs|
+------+------------------+-------+
| 1|[[1,0.7], [2,0.5]]|[1,0.7]|
| 1|[[1,0.7], [2,0.5]]|[2,0.5]|
| 2|[[0,0.9], [4,0.1]]|[0,0.9]|
| 2|[[0,0.9], [4,0.1]]|[4,0.1]|
+------+------------------+-------+
structs are nice in select operator with * (star) to flatten them to columns per the struct fields.
You could do select($"element.*").
scala> exploded.select("userId", "recs.*").show
+------+------+------+
|userId|itemId|rating|
+------+------+------+
| 1| 1| 0.7|
| 1| 2| 0.5|
| 2| 0| 0.9|
| 2| 4| 0.1|
+------+------+------+
I think that could do what you're after.
p.s. Stay away from UDFs as long as possible since they "trigger" row conversion from the internal format (InternalRow) to JVM objects that can lead to excessive GCs.

How to filter Spark dataframe by array column containing any of the values of some other dataframe/set

I have a Dataframe A that contains a column of array string.
...
|-- browse: array (nullable = true)
| |-- element: string (containsNull = true)
...
For example three sample rows would be
+---------+--------+---------+
| column 1| browse| column n|
+---------+--------+---------+
| foo1| [X,Y,Z]| bar1|
| foo2| [K,L]| bar2|
| foo3| [M]| bar3|
And another Dataframe B that contains a column of string
|-- browsenodeid: string (nullable = true)
Some sample rows for it would be
+------------+
|browsenodeid|
+------------+
| A|
| Z|
| M|
How can I filter A so that I keep all the rows whose browse contains any of the the values of browsenodeid from B? In terms of the above examples the result will be:
+---------+--=-----+---------+
| column 1| browse| column n|
+---------+--------+---------+
| foo1| [X,Y,Z]| bar1| <- because Z is a value of B.browsenodeid
| foo3| [M]| bar3| <- because M is a value of B.browsenodeid
If I had a single value then I would use something like
A.filter(array_contains(A("browse"), single_value))
But what do I do with a list or DataFrame of values?
I found an elegant solution for this, without the need to cast DataFrames/Datasets to RDDs.
Assuming you have a DataFrame dataDF:
+---------+--------+---------+
| column 1| browse| column n|
+---------+--------+---------+
| foo1| [X,Y,Z]| bar1|
| foo2| [K,L]| bar2|
| foo3| [M]| bar3|
and an array b containing the values you want to match in browse
val b: Array[String] = Array(M,Z)
Implement the udf:
import org.apache.spark.sql.expressions.UserDefinedFunction
import scala.collection.mutable.WrappedArray
def array_contains_any(s:Seq[String]): UserDefinedFunction = {
udf((c: WrappedArray[String]) =>
c.toList.intersect(s).nonEmpty)
}
and then simply use the filter or where function (with a little bit of fancy currying :P) to do the filtering like:
dataDF.where(array_contains_any(b)($"browse"))
In Spark >= 2.4.0 you can use arrays_overlap:
import org.apache.spark.sql.functions.{array, arrays_overlap, lit}
val df = Seq(
("foo1", Seq("X", "Y", "Z"), "bar1"),
("foo2", Seq("K", "L"), "bar2"),
("foo3", Seq("M"), "bar3")
).toDF("col1", "browse", "coln")
val b = Seq("M" ,"Z")
val searchArray = array(b.map{lit}:_*) // cast to lit(i) then create Spark array
df.where(arrays_overlap($"browse", searchArray)).show()
// +----+---------+----+
// |col1| browse|coln|
// +----+---------+----+
// |foo1|[X, Y, Z]|bar1|
// |foo3| [M]|bar3|
// +----+---------+----+
Assume input data:Dataframe A
browse
200,300,889,767,9908,7768,9090
300,400,223,4456,3214,6675,333
234,567,890
123,445,667,887
and you have to match it with Dataframe B
browsenodeid:(I flatten the column browsenodeid) 123,200,300
val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\Dataframe_A.txt")
rawrdd.map(_.split("|"))
.map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(","))
.foreach(println)
Your output:
300,200
300
123
Updated
val matchSet = "A,Z,M".split(",").toSet
val rawrdd = sc.textFile("/FileStore/tables/mvv45x9f1494518792828/input_A.txt")
rawrdd.map(_.split("|"))
.map(r => if (! r(1).split(",").toSet.intersect(matchSet).isEmpty) org.apache.spark.sql.Row(r(0),r(1), r(2))).collect.foreach(println)
Output is
foo1,X,Y,Z,bar1
foo3,M,bar3

Resources