scala.collection.mutable.WrappedArray$ofRef cannot be cast to Integer - apache-spark

I'm fairly new to Spark and Scala. I'm trying to call a function as a Spark UDF but I run into this error that I can't seem to resolve.
I understand that in Scala, Array and Seq aren't the same. WrappedArray is a subtype of Seq and there is implicit conversions between WrappedArray and Array but I'm not sure why that doesn't happen in the case of the UDF.
Any pointers to help me understand and resolve this is much appreciated.
Here's a snippet of the code
def filterMapKeysWithSet(m: Map[Int, Int], a: Array[Int]): Map[Int, Int] = {
val seqToArray = a.toArray
val s = seqToArray.toSet
m filterKeys s
}
val myUDF = udf((m: Map[Int, Int], a: Array[Int]) => filterMapKeysWithSet(m, a))
case class myType(id: Int, m: Map[Int, Int])
val mapRDD = Seq(myType(1, Map(1 -> 100, 2 -> 200)), myType(2, Map(1 -> 100, 2 -> 200)), myType(3, Map(3 -> 300, 4 -> 400)))
val mapDF = mapRDD.toDF
mapDF: org.apache.spark.sql.DataFrame = [id: int, m: map<int,int>]
root
|-- id: integer (nullable = false)
|-- m: map (nullable = true)
| |-- key: integer
| |-- value: integer (valueContainsNull = false)
case class myType2(id: Int, a: Array[Int])
val idRDD = Seq(myType2(1, Array(1,2,100,200)), myType2(2, Array(100,200)), myType2(3, Array(1,2)) )
val idDF = idRDD.toDF
idDF: org.apache.spark.sql.DataFrame = [id: int, a: array<int>]
root
|-- id: integer (nullable = false)
|-- a: array (nullable = true)
| |-- element: integer (containsNull = false)
import sqlContext.implicits._
/* Hive context is exposed as sqlContext */
val j = mapDF.join(idDF, idDF("id") === mapDF("id")).drop(idDF("id"))
val k = j.withColumn("filteredMap",myUDF(j("m"), j("a")))
k.show
Looking at the Dataframe "j" & "k", the map and array columns have the right data types.
j: org.apache.spark.sql.DataFrame = [id: int, m: map<int,int>, a: array<int>]
root
|-- id: integer (nullable = false)
|-- m: map (nullable = true)
| |-- key: integer
| |-- value: integer (valueContainsNull = false)
|-- a: array (nullable = true)
| |-- element: integer (containsNull = false)
k: org.apache.spark.sql.DataFrame = [id: int, m: map<int,int>, a: array<int>, filteredMap: map<int,int>]
root
|-- id: integer (nullable = false)
|-- m: map (nullable = true)
| |-- key: integer
| |-- value: integer (valueContainsNull = false)
|-- a: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- filteredMap: map (nullable = true)
| |-- key: integer
| |-- value: integer (valueContainsNull = false)
However, an action on the Dataframe "k" that calls the UDF fails with the following error -
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 6, ip-100-74-42-194.ec2.internal): java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [I
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:60)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1865)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1865)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Changing the datatype from Array[Int] to Seq[Int] in the function filterMapKeysWithSet seems to resolve the above issue.
def filterMapKeysWithSet(m: Map[Int, Int], a: Seq[Int]): Map[Int, Int] = {
val seqToArray = a.toArray
val s = seqToArray.toSet
m filterKeys s
}
val myUDF = udf((m: Map[Int, Int], a: Seq[Int]) => filterMapKeysWithSet(m, a))
k: org.apache.spark.sql.DataFrame = [id: int, m: map<int,int>, a: array<int>, filteredMap: map<int,int>]
root
|-- id: integer (nullable = false)
|-- m: map (nullable = true)
| |-- key: integer
| |-- value: integer (valueContainsNull = false)
|-- a: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- filteredMap: map (nullable = true)
| |-- key: integer
| |-- value: integer (valueContainsNull = false)
+---+--------------------+----------------+--------------------+
| id| m| a| filteredMap|
+---+--------------------+----------------+--------------------+
| 1|Map(1 -> 100, 2 -...|[1, 2, 100, 200]|Map(1 -> 100, 2 -...|
| 2|Map(1 -> 100, 2 -...| [100, 200]| Map()|
| 3|Map(3 -> 300, 4 -...| [1, 2]| Map()|
+---+--------------------+----------------+--------------------+
So it looks like the ArrayType on Dataframe "idDF" is really a WrappedArray and not an Array - So the function call to "filterMapKeysWithSet" failed as it expected an Array but got a WrappedArray/ Seq instead (which doesn't implicitly convert to Array in Scala 2.8 and above).

Related

Spark SQL select subset of fields of map column

We're processing data from different sources which all differ in structure, but yet have core common fields. For the sake of example:
root
|-- sesionId: long (nullable = false)
|-- region: string (nullable = true)
|-- auctions: map (nullable = true)
| |-- key: long
| |-- value: struct (valueContainsNull = true)
| | |-- requests: long (nullable = false)
| | |-- wins: long (nullable = false)
| | |-- malformed: long (nullable = false)
and
root
|-- sesionId: long (nullable = false)
|-- datacenter: string (nullable = true)
|-- trafficType: string (nullable = true)
|-- auctions: map (nullable = true)
| |-- key: long
| |-- value: struct (valueContainsNull = true)
| | |-- requests: long (nullable = false)
| | |-- wins: long (nullable = false)
| | |-- rejected: long (nullable = false)
| | |-- monetized: long (nullable = false)
Let's say, I want to aggregate core statistics regardless of input source.
case class CoreKpis(requests: Long, wins: Long)
class DummyMapAggregate[K, V](implicit E: Encoder[Map[K, V]])
extends Aggregator[Map[K, V], Map[K, V], Map[K, V]] {
def zero: Map[K, V] = Map.empty
def reduce(buffer: Map[K, V], input: Map[K, V]): Map[K, V] = input
def merge(buffer1: Map[K, V], buffer2: Map[K, V]): Map[K, V] = buffer2
def finish(reduction: Map[K, V]): Map[K, V] = reduction
def bufferEncoder: Encoder[Map[K, V]] = E
def outputEncoder: Encoder[Map[K, V]] = E
}
val aggMap = udaf(new DummyMapAggregate[Long, CoreKpis]())
df.groupBy(col("sessionId")).agg(aggMap(col("auctions"))
But I end up getting
org.apache.spark.sql.AnalysisException: cannot resolve 'DummyMapAggregate(auctions)'
due to data type mismatch: argument 1 requires map<bigint,struct<requests:bigint,wins:bigint>> type,
however, '`auctions`' is of map<bigint,struct<requests:bigint,wins:bigint,malformed:bigint>> type.
The question is following: how to avoid this exception? Should I somehow cast the column to correct type first or what?

Convert DataFrame Format

I have my dataframe in below format -
|-- id: string (nullable = true)
|-- epoch: string (nullable = true)
|-- data: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
and convert into having multiple values-
|-- id: string (nullable = true)
|-- epoch: string (nullable = true)
|-- key: string (nullable = true)
|-- value: string (nullable = true)
Example:
From:
1,12345, [pq -> r, ab -> c]
To:
1,12345, pq ,r
1,12345, ab ,c
I am trying this code but doesn't work-
val array2Df = array1Df.flatMap(line =>
line.getMap[String, String](2).map(
(line.getString(0),line.getString(1),_)
))
Try following
val arrayData = Seq(
Row("1","epoch_1",Map("epoch_1_key1"->"epoch_1_val1","epoch_1_key2"->"epoch_1_Val2")),
Row("2","epoch_2",Map("epoch_2_key1"->"epoch_2_val1","epoch_2_key2"->"epoch_2_Val2"))
)
val arraySchema = new StructType()
.add("Id",StringType)
.add("epoch", StringType)
.add("data", MapType(StringType,StringType))
val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayData),arraySchema)
df.printSchema()
df.show(false)
After that you need to explode based on data column. Don't forget to
import org.apache.spark.sql.functions.explode
df.select($"Id",explode($"data")).show(false)

How to convert a list with structure like (key1, list(key2, value)) into a dataframe in pyspark?

I have a list as shown below:
It is of the type as shown below:
[(key1, [(key11, value11), (key12, value12)]), (key2, [(key21, value21), (key22, value22)...])...]
A sample structure is shown below:
[('1052762305',
[('1007819788', 0.9206884810054885),
('1005886801', 0.913818268123084),
('1003863766', 0.9131746152849486),
('1007811435', 0.9128666156173751),
('1005879599', 0.9126368405937075),
('1003705572', 0.9122051062936369),
('1007804896', 0.9083424459788203),
('1005890270', 0.8982097535650703),
('1007806781', 0.8708761186829758),
('1003670458', 0.8452789033694487)]),
('1064808607',
[('1007804896', 0.9984397647563017),
('1003705572', 0.9970498347406341),
('1005879599', 0.9951581013190172),
('1007811435', 0.9934813787902085),
('1005886801', 0.9930572794622374),
('1003863766', 0.9928815742735568),
('1007819788', 0.9869723713790797),
('1005890270', 0.9642640856016443),
('1007806781', 0.9211558765137313),
('1003670458', 0.8519872445941068)])]
I want to convert this into a dataframe of the form
key1 key2 score
1052762305 1007819788 0.9206884810054885
1052762305 1005886801 0.913818268123084
1052762305 1003863766 0.9131746152849486
... ... ...
1064808607 1007804896 0.9984397647563017
1064808607 1003705572 0.9970498347406341
1064808607 1005879599 0.9951581013190172
... ... ...
How can we implement this in pyspark?
You can create a schema upfront with the input. Use explode and access the elements with in the value struct.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField,StringType,ArrayType, DoubleType
spark = SparkSession.builder \
.appName('SO')\
.getOrCreate()
schema = StructType([StructField("key1",StringType()), StructField("value",ArrayType(
StructType([ StructField("key2", StringType()),
StructField("score", DoubleType())])
)) ])
df = spark.createDataFrame(
[('1052762305',
[('1007819788', 0.9206884810054885),
('1005886801', 0.913818268123084),
('1003863766', 0.9131746152849486),
('1007811435', 0.9128666156173751),
('1005879599', 0.9126368405937075),
('1003705572', 0.9122051062936369),
('1007804896', 0.9083424459788203),
('1005890270', 0.8982097535650703),
('1007806781', 0.8708761186829758),
('1003670458', 0.8452789033694487)]),
('1064808607',
[('1007804896', 0.9984397647563017),
('1003705572', 0.9970498347406341),
('1005879599', 0.9951581013190172),
('1007811435', 0.9934813787902085),
('1005886801', 0.9930572794622374),
('1003863766', 0.9928815742735568),
('1007819788', 0.9869723713790797),
('1005890270', 0.9642640856016443),
('1007806781', 0.9211558765137313),
('1003670458', 0.8519872445941068)])
],schema
)
df.show()
+----------+--------------------+
| key1| value |
+----------+--------------------+
|1052762305|[[1007819788, 0.9...|
|1064808607|[[1007804896, 0.9...|
+----------+--------------------+
df.printSchema()
root
|-- key1: string (nullable = true)
|-- value: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key2: string (nullable = true)
| | |-- score: double (nullable = true)
df1=df.select('key1',F.explode('value').alias('value'))
df1.show()
+----------+--------------------+
| key1| value |
+----------+--------------------+
|1052762305|[1007819788, 0.92...|
|1052762305|[1005886801, 0.91...|
|1052762305|[1003863766, 0.91...|
|1052762305|[1007811435, 0.91...|
|1052762305|[1005879599, 0.91...|
|1052762305|[1003705572, 0.91...|
|1052762305|[1007804896, 0.90...|
|1052762305|[1005890270, 0.89...|
|1052762305|[1007806781, 0.87...|
|1052762305|[1003670458, 0.84...|
|1064808607|[1007804896, 0.99...|
|1064808607|[1003705572, 0.99...|
|1064808607|[1005879599, 0.99...|
|1064808607|[1007811435, 0.99...|
|1064808607|[1005886801, 0.99...|
|1064808607|[1003863766, 0.99...|
|1064808607|[1007819788, 0.98...|
|1064808607|[1005890270, 0.96...|
|1064808607|[1007806781, 0.92...|
|1064808607|[1003670458, 0.85...|
+----------+--------------------+
df1.printSchema()
root
|-- key1: string (nullable = true)
|-- value: struct (nullable = true)
| |-- key2: string (nullable = true)
| |-- score: double (nullable = true)
df1.select("key1", "value.key2","value.score").show()
+----------+----------+------------------+
| key1| key2| score|
+----------+----------+------------------+
|1052762305|1007819788|0.9206884810054885|
|1052762305|1005886801| 0.913818268123084|
|1052762305|1003863766|0.9131746152849486|
|1052762305|1007811435|0.9128666156173751|
|1052762305|1005879599|0.9126368405937075|
|1052762305|1003705572|0.9122051062936369|
|1052762305|1007804896|0.9083424459788203|
|1052762305|1005890270|0.8982097535650703|
|1052762305|1007806781|0.8708761186829758|
|1052762305|1003670458|0.8452789033694487|
|1064808607|1007804896|0.9984397647563017|
|1064808607|1003705572|0.9970498347406341|
|1064808607|1005879599|0.9951581013190172|
|1064808607|1007811435|0.9934813787902085|
|1064808607|1005886801|0.9930572794622374|
|1064808607|1003863766|0.9928815742735568|
|1064808607|1007819788|0.9869723713790797|
|1064808607|1005890270|0.9642640856016443|
|1064808607|1007806781|0.9211558765137313|
|1064808607|1003670458|0.8519872445941068|
You basically need to do following:
create a dataframe from your list
promote the pairs from elements of array into a separate row by using explode
extract key & value from pair via select
This could be done by something like this (source data is in the variable called a):
from pyspark.sql.functions import explode, col
df = spark.createDataFrame(a, ['key1', 'val'])
df2 = df.select(col('key1'), explode(col('val')).alias('val'))
df3 = df2.select('key1', col('val')._1.alias('key2'), col('val')._2.alias('value'))
we can check that schema & data is matching:
>>> df3.printSchema()
root
|-- key1: string (nullable = true)
|-- key2: string (nullable = true)
|-- value: double (nullable = true)
>>> df3.show(2)
+----------+----------+------------------+
| key1| key2| value|
+----------+----------+------------------+
|1052762305|1007819788|0.9206884810054885|
|1052762305|1005886801| 0.913818268123084|
+----------+----------+------------------+
only showing top 2 rows
we can also check the schema for intermediate results:
>>> df.printSchema()
root
|-- key1: string (nullable = true)
|-- val: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: string (nullable = true)
| | |-- _2: double (nullable = true)
>>> df2.printSchema()
root
|-- key1: string (nullable = true)
|-- val: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: double (nullable = true)

How to JSON-escape a String field in Spark dataFrame with new column

How to write a new column with JSON format through DataFrame. I tried several approaches but it's writing the data as JSON-escaped String field.
Currently its writing as
{"test":{"id":1,"name":"name","problem_field": "{\"x\":100,\"y\":200}"}}
Instead I want it to be as
{"test":{"id":1,"name":"name","problem_field": {"x":100,"y":200}}}
problem_field is a new column that is being created based on the values read from other fields as:
val dataFrame = oldDF.withColumn("problem_field", s)
I have tried the following approaches
dataFrame.write.json(<<outputPath>>)
dataFrame.toJSON.map(value => value.replace("\\", "").replace("{\"value\":\"", "").replace("}\"}", "}")).write.json(<<outputPath>>)
Tried converting to DataSet as well but no luck. Any pointers are greatly appreciated.
I have already tried the logic mentioned here: How to let Spark parse a JSON-escaped String field as a JSON Object to infer the proper structure in DataFrames?
For starters, your example data has an extraneous comma after "y\":200 which will prevent it from being parsed as it is not valid JSON.
From there, you can use from_json to parse the field, assuming you know the schema. In this example, I'm parsing the field separately to first get the schema:
scala> val json = spark.read.json(Seq("""{"test":{"id":1,"name":"name","problem_field": "{\"x\":100,\"y\":200}"}}""").toDS)
json: org.apache.spark.sql.DataFrame = [test: struct<id: bigint, name: string ... 1 more field>]
scala> json.printSchema
root
|-- test: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- name: string (nullable = true)
| |-- problem_field: string (nullable = true)
scala> val problem_field = spark.read.json(json.select($"test.problem_field").map{
case org.apache.spark.sql.Row(x : String) => x
})
problem_field: org.apache.spark.sql.DataFrame = [x: bigint, y: bigint]
scala> problem_field.printSchema
root
|-- x: long (nullable = true)
|-- y: long (nullable = true)
scala> val fixed = json.withColumn("test", struct($"test.id", $"test.name", from_json($"test.problem_field", problem_field.schema).as("problem_field")))
fixed: org.apache.spark.sql.DataFrame = [test: struct<id: bigint, name: string ... 1 more field>]
scala> fixed.printSchema
root
|-- test: struct (nullable = false)
| |-- id: long (nullable = true)
| |-- name: string (nullable = true)
| |-- problem_field: struct (nullable = true)
| | |-- x: long (nullable = true)
| | |-- y: long (nullable = true)
If the schema of problem_fields contents is inconsistent between rows, this solution will still work but may not be an optimal way of handling things, as it will produce a sparse Dataframe where each row contains every field encountered in problem_field. For example:
scala> val json = spark.read.json(Seq("""{"test":{"id":1,"name":"name","problem_field": "{\"x\":100,\"y\":200}"}}""", """{"test":{"id":1,"name":"name","problem_field": "{\"a\":10,\"b\":20}"}}""").toDS)
json: org.apache.spark.sql.DataFrame = [test: struct<id: bigint, name: string ... 1 more field>]
scala> val problem_field = spark.read.json(json.select($"test.problem_field").map{case org.apache.spark.sql.Row(x : String) => x})
problem_field: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 2 more fields]
scala> problem_field.printSchema
root
|-- a: long (nullable = true)
|-- b: long (nullable = true)
|-- x: long (nullable = true)
|-- y: long (nullable = true)
scala> val fixed = json.withColumn("test", struct($"test.id", $"test.name", from_json($"test.problem_field", problem_field.schema).as("problem_field")))
fixed: org.apache.spark.sql.DataFrame = [test: struct<id: bigint, name: string ... 1 more field>]
scala> fixed.printSchema
root
|-- test: struct (nullable = false)
| |-- id: long (nullable = true)
| |-- name: string (nullable = true)
| |-- problem_field: struct (nullable = true)
| | |-- a: long (nullable = true)
| | |-- b: long (nullable = true)
| | |-- x: long (nullable = true)
| | |-- y: long (nullable = true)
scala> fixed.select($"test.problem_field.*").show
+----+----+----+----+
| a| b| x| y|
+----+----+----+----+
|null|null| 100| 200|
| 10| 20|null|null|
+----+----+----+----+
Over the course of hundreds, thousands, or millions of rows, you can see how this would present a problem.

How do I query DataFrame by the value of struct?

There is a DataFrame of such schema:
a.printSchema()
root
|-- fieldA: long (nullable = false)
|-- fieldB: struct (nullable = true)
| |-- _1: string(nullable = true)
| |-- _2: long (nullable = false)
I want to query by the value of _2 of fieldB.
I know if I want to query by the value of fieldA, a.where("fieldA = 1234").show(). But I don't know how I query by the value inside struct.
You can access the struct element with field.subfield, for your case:
a.where("fieldB._2 = 123")
Example:
val df = Seq((1L, ("a", 2L)), (2L, ("b", 3L))).toDF("A", "B")
// df: org.apache.spark.sql.DataFrame = [A: bigint, B: struct<_1: string, _2: bigint>]
df.printSchema
root
|-- A: long (nullable = false)
|-- B: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: long (nullable = false)
df.where("B._2 = 3").show
+---+-----+
| A| B|
+---+-----+
| 2|[b,3]|
+---+-----+

Resources