We're processing data from different sources which all differ in structure, but yet have core common fields. For the sake of example:
root
|-- sesionId: long (nullable = false)
|-- region: string (nullable = true)
|-- auctions: map (nullable = true)
| |-- key: long
| |-- value: struct (valueContainsNull = true)
| | |-- requests: long (nullable = false)
| | |-- wins: long (nullable = false)
| | |-- malformed: long (nullable = false)
and
root
|-- sesionId: long (nullable = false)
|-- datacenter: string (nullable = true)
|-- trafficType: string (nullable = true)
|-- auctions: map (nullable = true)
| |-- key: long
| |-- value: struct (valueContainsNull = true)
| | |-- requests: long (nullable = false)
| | |-- wins: long (nullable = false)
| | |-- rejected: long (nullable = false)
| | |-- monetized: long (nullable = false)
Let's say, I want to aggregate core statistics regardless of input source.
case class CoreKpis(requests: Long, wins: Long)
class DummyMapAggregate[K, V](implicit E: Encoder[Map[K, V]])
extends Aggregator[Map[K, V], Map[K, V], Map[K, V]] {
def zero: Map[K, V] = Map.empty
def reduce(buffer: Map[K, V], input: Map[K, V]): Map[K, V] = input
def merge(buffer1: Map[K, V], buffer2: Map[K, V]): Map[K, V] = buffer2
def finish(reduction: Map[K, V]): Map[K, V] = reduction
def bufferEncoder: Encoder[Map[K, V]] = E
def outputEncoder: Encoder[Map[K, V]] = E
}
val aggMap = udaf(new DummyMapAggregate[Long, CoreKpis]())
df.groupBy(col("sessionId")).agg(aggMap(col("auctions"))
But I end up getting
org.apache.spark.sql.AnalysisException: cannot resolve 'DummyMapAggregate(auctions)'
due to data type mismatch: argument 1 requires map<bigint,struct<requests:bigint,wins:bigint>> type,
however, '`auctions`' is of map<bigint,struct<requests:bigint,wins:bigint,malformed:bigint>> type.
The question is following: how to avoid this exception? Should I somehow cast the column to correct type first or what?
Related
I have a DataFrame with a single column which is a struct type and contains an array.
users_tp_df.printSchema()
root
|-- x: struct (nullable = true)
| |-- ActiveDirectoryName: string (nullable = true)
| |-- AvailableFrom: string (nullable = true)
| |-- AvailableFutureAllocation: long (nullable = true)
| |-- AvailableFutureHours: double (nullable = true)
| |-- CreateDate: string (nullable = true)
| |-- CurrentAllocation: long (nullable = true)
| |-- CurrentAvailableHours: double (nullable = true)
| |-- CustomFields: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Name: string (nullable = true)
| | | |-- Type: string (nullable = true)
| | | |-- Value: string (nullable = true)
I'm trying to convert the CustomFields array column in 3 three columns:
Country;
isExternal;
Service.
So for example, I've these values:
and the final dataframe output excepted for that row will be:
Can anyone please help me in achieving this?
Thank you!
This would work:
initial_expansion= df.withColumn("id", F.monotonically_increasing_id()).select("id","x.*");
final_df = initial_expansion\
.join(initial_expansion.withColumn("CustomFields", F.explode("CustomFields"))\
.select("*", "CustomFields.*")\
.groupBy("id").pivot("Name").agg(F.first("Value")), \
"id").drop("CustomFields")
Sample Input:
Json - {'x': {'CurrentAvailableHours': 2, 'CustomFields': [{'Name': 'Country', 'Value': 'Italy'}, {'Name': 'Service', 'Value':'Dev'}]}}
Input Structure:
root
|-- x: struct (nullable = true)
| |-- CurrentAvailableHours: integer (nullable = true)
| |-- CustomFields: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Name: string (nullable = true)
| | | |-- Value: string (nullable = true)
Output:
Output Structure (Id can be dropped):
root
|-- id: long (nullable = false)
|-- CurrentAvailableHours: integer (nullable = true)
|-- Country: string (nullable = true)
|-- Service: string (nullable = true)
Considering the mockup structure below, similar with the one from your example,
you can do it the sql way by using the inline function:
with alpha as (
select named_struct("alpha", "abc", "beta", 2.5, "gamma", 3, "delta"
, array( named_struct("a", "x", "b", "y", "c", "z")
, named_struct("a", "xx", "b", "yy", "c","zz"))
) root
)
select root.alpha, root.beta, root.gamma, inline(root.delta) as (a, b, c)
from alpha
The result:
Mockup structure:
Using PySpark, I need to load "Properties" object (map's value) from an avro file into its own Spark dataframe. Such that, "Properties" from my avro file will become a dataframe with its elements and values as columns and rows. Hence, struggling to find some clear examples accomplishing that.
Schema of the file:
root
|-- SequenceNumber: long (nullable = true)
|-- Offset: string (nullable = true)
|-- EnqueuedTimeUtc: string (nullable = true)
|-- SystemProperties: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- member0: long (nullable = true)
| | |-- member1: double (nullable = true)
| | |-- member2: string (nullable = true)
| | |-- member3: binary (nullable = true)
|-- Properties: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- member0: long (nullable = true)
| | |-- member1: double (nullable = true)
| | |-- member2: string (nullable = true)
| | |-- member3: binary (nullable = true)
|-- Body: binary (nullable = true)
The resulting "Properties" dataframe loaded from the above avro file needs to be like this:
member0
member1
member2
member3
value
value
value
value
map_values is your friend.
Collection function: Returns an unordered array containing the values of the map.
New in version 2.3.0.
df_properties = df.select((F.map_values(F.col('Properties'))[0]).alias('vals')).select('vals.*')
Full example:
df = spark.createDataFrame(
[('a', 20, 4.5, 'r', b'8')],
['key', 'member0', 'member1', 'member2', 'member3'])
df = df.select(F.create_map('key', F.struct('member0', 'member1', 'member2', 'member3')).alias('Properties'))
df.printSchema()
# root
# |-- Properties: map (nullable = false)
# | |-- key: string
# | |-- value: struct (valueContainsNull = false)
# | | |-- member0: long (nullable = true)
# | | |-- member1: double (nullable = true)
# | | |-- member2: string (nullable = true)
# | | |-- member3: binary (nullable = true)
df_properties = df.select((F.map_values(F.col('Properties'))[0]).alias('vals')).select('vals.*')
df_properties.show()
# +-------+-------+-------+-------+
# |member0|member1|member2|member3|
# +-------+-------+-------+-------+
# | 20| 4.5| r| [38]|
# +-------+-------+-------+-------+
df_properties.printSchema()
# root
# |-- member0: long (nullable = true)
# |-- member1: double (nullable = true)
# |-- member2: string (nullable = true)
# |-- member3: binary (nullable = true)
My Spark Dataframe current schema is as shown below, is there a way i can remove the outer Struct column(DTC_CAN_SIGNALS).
**Current Schema**:
root
|-- DTC: string (nullable = true)
|-- DTCTS: long (nullable = true)
|-- VIN: string (nullable = true)
|-- DTC_CAN_SIGNALS: struct (nullable = true)
| |-- SGNL: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- SN: string (nullable = true)
| | | |-- ST: long (nullable = true)
| | | |-- SV: double (nullable = true)
**Expected Schema**:
root
|-- DTC: string (nullable = true)
|-- DTCTS: long (nullable = true)
|-- VIN: string (nullable = true)
|-- SGNL: array (nullable = true)
|-- element: struct (containsNull = true)
| |-- SN: string (nullable = true)
| |-- ST: long (nullable = true)
| |-- SV: double (nullable = true)
Just select your column from struct, like
df.withColumn("SGNL", col("DTC_CAN_SIGNALS.SGNL"))
or
df.select("DTC_CAN_SIGNALS.SGNL")
Code:
import sparkSession.implicits._
import org.apache.spark.sql.functions._
val data = Seq(
("DTC", 42L, "VIN")
).toDF("DTC", "DTCTS", "VIN")
val df = data.withColumn("DTC_CAN_SIGNALS", struct(array(struct(lit("sn1").as("SN"), lit(42L).as("ST"), lit(42.0D).as("SV"))).as("SGNL")))
df.show()
df.printSchema()
// alternatively
// val resDf = df
// .withColumn("SGNL", col("DTC_CAN_SIGNALS.SGNL"))
// .drop("DTC_CAN_SIGNALS")
val resDf = df.select("DTC", "DTCTS", "VIN", "DTC_CAN_SIGNALS.SGNL")
resDf.show()
resDf.printSchema()
Output:
+---+-----+---+-------------------+
|DTC|DTCTS|VIN| DTC_CAN_SIGNALS|
+---+-----+---+-------------------+
|DTC| 42|VIN|[[[sn1, 42, 42.0]]]|
+---+-----+---+-------------------+
root
|-- DTC: string (nullable = true)
|-- DTCTS: long (nullable = false)
|-- VIN: string (nullable = true)
|-- DTC_CAN_SIGNALS: struct (nullable = false)
| |-- SGNL: array (nullable = false)
| | |-- element: struct (containsNull = false)
| | | |-- SN: string (nullable = false)
| | | |-- ST: long (nullable = false)
| | | |-- SV: double (nullable = false)
+---+-----+---+-----------------+
|DTC|DTCTS|VIN| SGNL|
+---+-----+---+-----------------+
|DTC| 42|VIN|[[sn1, 42, 42.0]]|
+---+-----+---+-----------------+
root
|-- DTC: string (nullable = true)
|-- DTCTS: long (nullable = false)
|-- VIN: string (nullable = true)
|-- SGNL: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- SN: string (nullable = false)
| | |-- ST: long (nullable = false)
| | |-- SV: double (nullable = false)
How to write a new column with JSON format through DataFrame. I tried several approaches but it's writing the data as JSON-escaped String field.
Currently its writing as
{"test":{"id":1,"name":"name","problem_field": "{\"x\":100,\"y\":200}"}}
Instead I want it to be as
{"test":{"id":1,"name":"name","problem_field": {"x":100,"y":200}}}
problem_field is a new column that is being created based on the values read from other fields as:
val dataFrame = oldDF.withColumn("problem_field", s)
I have tried the following approaches
dataFrame.write.json(<<outputPath>>)
dataFrame.toJSON.map(value => value.replace("\\", "").replace("{\"value\":\"", "").replace("}\"}", "}")).write.json(<<outputPath>>)
Tried converting to DataSet as well but no luck. Any pointers are greatly appreciated.
I have already tried the logic mentioned here: How to let Spark parse a JSON-escaped String field as a JSON Object to infer the proper structure in DataFrames?
For starters, your example data has an extraneous comma after "y\":200 which will prevent it from being parsed as it is not valid JSON.
From there, you can use from_json to parse the field, assuming you know the schema. In this example, I'm parsing the field separately to first get the schema:
scala> val json = spark.read.json(Seq("""{"test":{"id":1,"name":"name","problem_field": "{\"x\":100,\"y\":200}"}}""").toDS)
json: org.apache.spark.sql.DataFrame = [test: struct<id: bigint, name: string ... 1 more field>]
scala> json.printSchema
root
|-- test: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- name: string (nullable = true)
| |-- problem_field: string (nullable = true)
scala> val problem_field = spark.read.json(json.select($"test.problem_field").map{
case org.apache.spark.sql.Row(x : String) => x
})
problem_field: org.apache.spark.sql.DataFrame = [x: bigint, y: bigint]
scala> problem_field.printSchema
root
|-- x: long (nullable = true)
|-- y: long (nullable = true)
scala> val fixed = json.withColumn("test", struct($"test.id", $"test.name", from_json($"test.problem_field", problem_field.schema).as("problem_field")))
fixed: org.apache.spark.sql.DataFrame = [test: struct<id: bigint, name: string ... 1 more field>]
scala> fixed.printSchema
root
|-- test: struct (nullable = false)
| |-- id: long (nullable = true)
| |-- name: string (nullable = true)
| |-- problem_field: struct (nullable = true)
| | |-- x: long (nullable = true)
| | |-- y: long (nullable = true)
If the schema of problem_fields contents is inconsistent between rows, this solution will still work but may not be an optimal way of handling things, as it will produce a sparse Dataframe where each row contains every field encountered in problem_field. For example:
scala> val json = spark.read.json(Seq("""{"test":{"id":1,"name":"name","problem_field": "{\"x\":100,\"y\":200}"}}""", """{"test":{"id":1,"name":"name","problem_field": "{\"a\":10,\"b\":20}"}}""").toDS)
json: org.apache.spark.sql.DataFrame = [test: struct<id: bigint, name: string ... 1 more field>]
scala> val problem_field = spark.read.json(json.select($"test.problem_field").map{case org.apache.spark.sql.Row(x : String) => x})
problem_field: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 2 more fields]
scala> problem_field.printSchema
root
|-- a: long (nullable = true)
|-- b: long (nullable = true)
|-- x: long (nullable = true)
|-- y: long (nullable = true)
scala> val fixed = json.withColumn("test", struct($"test.id", $"test.name", from_json($"test.problem_field", problem_field.schema).as("problem_field")))
fixed: org.apache.spark.sql.DataFrame = [test: struct<id: bigint, name: string ... 1 more field>]
scala> fixed.printSchema
root
|-- test: struct (nullable = false)
| |-- id: long (nullable = true)
| |-- name: string (nullable = true)
| |-- problem_field: struct (nullable = true)
| | |-- a: long (nullable = true)
| | |-- b: long (nullable = true)
| | |-- x: long (nullable = true)
| | |-- y: long (nullable = true)
scala> fixed.select($"test.problem_field.*").show
+----+----+----+----+
| a| b| x| y|
+----+----+----+----+
|null|null| 100| 200|
| 10| 20|null|null|
+----+----+----+----+
Over the course of hundreds, thousands, or millions of rows, you can see how this would present a problem.
I want to use Spark slice function with start and length defined as Column(s).
def slice(x: Column, start: Int, length: Int): Column
x looks like this:
`|-- x: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: double (nullable = true)
| | |-- b : double (nullable = true)
| | |-- c: double (nullable = true)
| | |-- d: string (nullable = true)
| | |-- e: double (nullable = true)
| | |-- f: double (nullable = true)
| | |-- g: long (nullable = true)
| | |-- h: double (nullable = true)
| | |-- i: double (nullable = true)
...
`
any idea on how to achieve this ?
Thanks !
You cannot use the built-in DataFrame DSL function slice for this (as it needs constant slice bounds), you can use an UDF for that. If df is your dataframe and you have a from und until column, then you can do:
val mySlice = udf(
(data:Seq[Row], from:Int, until:Int) => data.slice(from,until),
df.schema.fields.find(_.name=="x").get.dataType
)
df
.select(mySlice($"x",$"from",$"until"))
.show()
Alternatively, you can use the SQL-Expression in Spark SQL:
df
.select(expr("slice(x,from,until)"))
.show()