Need to parse the json file - apache-spark

root
|-- eid: string (nullable = true)
|-- keys: array (nullable = true)
| |-- element: string (containsNull = true)
|-- type: string (nullable = true)
|-- values: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
Need to parse jsonfile with above schema using spark dataframe to structured format. keys column has column names which has values in 'values' column.
sample data file:
{'type': 'logs', ' eid': '1', 'keys': ['crt_ts', 'id', 'upd_ts', 'km', 'pivl', 'distance', 'speed'], 'values': [['12343.0000.012', 'AAGA1567', '1333.333.333', '565656', '10.5', '121', '64']]}
expected output:
eid crt_ts id upd_ts km pivl distance speed type
1 12343.0000.012 AAGA1567 1333.333.333 565656 10.5 121 64 logs

Please check below code, I have used groupBy, pivot & agg:
scala> val js = Seq(""" {'type': 'logs', 'eid': '1', 'keys': ['crt_ts', 'id', 'upd_ts', 'km', 'pivl', 'distance', 'speed'], 'values': [['12343.0000.012', 'AAGA1567', '1333.333.333', '565656', '10.5', '121', '64']]}""").toDS
js: org.apache.spark.sql.Dataset[String] = [value: string]
scala> val jdf = spark.read.json(js)
jdf: org.apache.spark.sql.DataFrame = [eid: string, keys: array<string> ... 2 more fields]
scala> jdf.printSchema
root
|-- eid: string (nullable = true)
|-- keys: array (nullable = true)
| |-- element: string (containsNull = true)
|-- type: string (nullable = true)
|-- values: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
scala> jdf.show(false)
+---+-----------------------------------------------+----+-----------------------------------------------------------------+
|eid|keys |type|values |
+---+-----------------------------------------------+----+-----------------------------------------------------------------+
|1 |[crt_ts, id, upd_ts, km, pivl, distance, speed]|logs|[[12343.0000.012, AAGA1567, 1333.333.333, 565656, 10.5, 121, 64]]|
+---+-----------------------------------------------+----+-----------------------------------------------------------------+
scala> :paste
// Entering paste mode (ctrl-D to finish)
jdf.select($"eid",$"keys",explode($"values").as("values"),$"type")
.select($"eid",$"type",explode(arrays_zip($"keys",$"values")).as("azip"))
.select($"eid",$"azip.*",$"type")
.groupBy($"type",$"eid")
.pivot($"keys")
.agg(first("values"))
.show(false)
// Exiting paste mode, now interpreting.
+----+---+--------------+--------+--------+------+----+-----+------------+
|type|eid|crt_ts |distance|id |km |pivl|speed|upd_ts |
+----+---+--------------+--------+--------+------+----+-----+------------+
|logs|1 |12343.0000.012|121 |AAGA1567|565656|10.5|64 |1333.333.333|
+----+---+--------------+--------+--------+------+----+-----+------------+

Related

Pyspark - Expand column with struct of arrays into new columns

I have a DataFrame with a single column which is a struct type and contains an array.
users_tp_df.printSchema()
root
|-- x: struct (nullable = true)
| |-- ActiveDirectoryName: string (nullable = true)
| |-- AvailableFrom: string (nullable = true)
| |-- AvailableFutureAllocation: long (nullable = true)
| |-- AvailableFutureHours: double (nullable = true)
| |-- CreateDate: string (nullable = true)
| |-- CurrentAllocation: long (nullable = true)
| |-- CurrentAvailableHours: double (nullable = true)
| |-- CustomFields: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Name: string (nullable = true)
| | | |-- Type: string (nullable = true)
| | | |-- Value: string (nullable = true)
I'm trying to convert the CustomFields array column in 3 three columns:
Country;
isExternal;
Service.
So for example, I've these values:
and the final dataframe output excepted for that row will be:
Can anyone please help me in achieving this?
Thank you!
This would work:
initial_expansion= df.withColumn("id", F.monotonically_increasing_id()).select("id","x.*");
final_df = initial_expansion\
.join(initial_expansion.withColumn("CustomFields", F.explode("CustomFields"))\
.select("*", "CustomFields.*")\
.groupBy("id").pivot("Name").agg(F.first("Value")), \
"id").drop("CustomFields")
Sample Input:
Json - {'x': {'CurrentAvailableHours': 2, 'CustomFields': [{'Name': 'Country', 'Value': 'Italy'}, {'Name': 'Service', 'Value':'Dev'}]}}
Input Structure:
root
|-- x: struct (nullable = true)
| |-- CurrentAvailableHours: integer (nullable = true)
| |-- CustomFields: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Name: string (nullable = true)
| | | |-- Value: string (nullable = true)
Output:
Output Structure (Id can be dropped):
root
|-- id: long (nullable = false)
|-- CurrentAvailableHours: integer (nullable = true)
|-- Country: string (nullable = true)
|-- Service: string (nullable = true)
Considering the mockup structure below, similar with the one from your example,
you can do it the sql way by using the inline function:
with alpha as (
select named_struct("alpha", "abc", "beta", 2.5, "gamma", 3, "delta"
, array( named_struct("a", "x", "b", "y", "c", "z")
, named_struct("a", "xx", "b", "yy", "c","zz"))
) root
)
select root.alpha, root.beta, root.gamma, inline(root.delta) as (a, b, c)
from alpha
The result:
Mockup structure:

Load only struct from map's value from an avro file into a Spark Dataframe

Using PySpark, I need to load "Properties" object (map's value) from an avro file into its own Spark dataframe. Such that, "Properties" from my avro file will become a dataframe with its elements and values as columns and rows. Hence, struggling to find some clear examples accomplishing that.
Schema of the file:
root
|-- SequenceNumber: long (nullable = true)
|-- Offset: string (nullable = true)
|-- EnqueuedTimeUtc: string (nullable = true)
|-- SystemProperties: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- member0: long (nullable = true)
| | |-- member1: double (nullable = true)
| | |-- member2: string (nullable = true)
| | |-- member3: binary (nullable = true)
|-- Properties: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- member0: long (nullable = true)
| | |-- member1: double (nullable = true)
| | |-- member2: string (nullable = true)
| | |-- member3: binary (nullable = true)
|-- Body: binary (nullable = true)
The resulting "Properties" dataframe loaded from the above avro file needs to be like this:
member0
member1
member2
member3
value
value
value
value
map_values is your friend.
Collection function: Returns an unordered array containing the values of the map.
New in version 2.3.0.
df_properties = df.select((F.map_values(F.col('Properties'))[0]).alias('vals')).select('vals.*')
Full example:
df = spark.createDataFrame(
[('a', 20, 4.5, 'r', b'8')],
['key', 'member0', 'member1', 'member2', 'member3'])
df = df.select(F.create_map('key', F.struct('member0', 'member1', 'member2', 'member3')).alias('Properties'))
df.printSchema()
# root
# |-- Properties: map (nullable = false)
# | |-- key: string
# | |-- value: struct (valueContainsNull = false)
# | | |-- member0: long (nullable = true)
# | | |-- member1: double (nullable = true)
# | | |-- member2: string (nullable = true)
# | | |-- member3: binary (nullable = true)
df_properties = df.select((F.map_values(F.col('Properties'))[0]).alias('vals')).select('vals.*')
df_properties.show()
# +-------+-------+-------+-------+
# |member0|member1|member2|member3|
# +-------+-------+-------+-------+
# | 20| 4.5| r| [38]|
# +-------+-------+-------+-------+
df_properties.printSchema()
# root
# |-- member0: long (nullable = true)
# |-- member1: double (nullable = true)
# |-- member2: string (nullable = true)
# |-- member3: binary (nullable = true)

Making one dataframe out of two dataframes as separate subcolumns in pyspark

I want to put two data frames into one, so each one is sub column, it's not join of dataframes. So I have two dataframes, stat1_df and stat2_df and they look something like this:
root
|-- max_scenes: integer (nullable = true)
|-- median_scenes: double (nullable = false)
|-- avg_scenes: double (nullable = true)
+----------+-------------+------------------+
|max_scenes|median_scenes|avg_scenes |
+----------+-------------+------------------+
|97 |7.0 |10.806451612903226|
|97 |7.0 |10.806451612903226|
|97 |7.0 |10.806451612903226|
|97 |7.0 |10.806451612903226|
+----------+-------------+------------------+
root
|-- max: double (nullable = true)
|-- type: string (nullable = true)
+-----+-----------+
|max |type |
+-----+-----------+
|10.0 |small |
|25.0 |medium |
|50.0 |large |
|250.0|extra_large|
+-----+-----------+
, and I want the result_df to be as:
root
|-- some_statistics1: struct (nullable = true)
| |-- max_scenes: integer (nullable = true)
|-- median_scenes: double (nullable = false)
|-- avg_scenes: double (nullable = true)
|-- some_statistics2: struct (nullable = true)
| |-- max: double (nullable = true)
|-- type: string (nullable = true)
Is there any way to put those two as shown? stat1_df and stat2_df are simple dataframes, without arrays and nested columns.Final dataframe is written to mongodb. If there any additional questions I am here to answer.
Check below code.
Add id column in both DataFrame, move all columns into struct & then use join both DataFrame's
scala> val dfa = Seq(("10","8.9","7.9")).toDF("max_scenes","median_scenes","avg_scenes")
dfa: org.apache.spark.sql.DataFrame = [max_scenes: string, median_scenes: string ... 1 more field]
scala> dfa.show(false)
+----------+-------------+----------+
|max_scenes|median_scenes|avg_scenes|
+----------+-------------+----------+
|10 |8.9 |7.9 |
+----------+-------------+----------+
scala> dfa.printSchema
root
|-- max_scenes: string (nullable = true)
|-- median_scenes: string (nullable = true)
|-- avg_scenes: string (nullable = true)
scala> val mdfa = dfa.select(struct($"*").as("some_statistics1")).withColumn("id",monotonically_increasing_id)
mdfa: org.apache.spark.sql.DataFrame = [some_statistics1: struct<max_scenes: string, median_scenes: string ... 1 more field>, id: bigint]
scala> mdfa.printSchema
root
|-- some_statistics1: struct (nullable = false)
| |-- max_scenes: string (nullable = true)
| |-- median_scenes: string (nullable = true)
| |-- avg_scenes: string (nullable = true)
|-- id: long (nullable = false)
scala> mdfa.show(false)
+----------------+---+
|some_statistics1|id |
+----------------+---+
|[10,8.9,7.9] |0 |
+----------------+---+
scala> val dfb = Seq(("11.2","sample")).toDF("max","type")
dfb: org.apache.spark.sql.DataFrame = [max: string, type: string]
scala> dfb.printSchema
root
|-- max: string (nullable = true)
|-- type: string (nullable = true)
scala> dfb.show(false)
+----+------+
|max |type |
+----+------+
|11.2|sample|
+----+------+
scala> val mdfb = dfb.select(struct($"*").as("some_statistics2")).withColumn("id",monotonically_increasing_id)
mdfb: org.apache.spark.sql.DataFrame = [some_statistics2: struct<max: string, type: string>, id: bigint]
scala> mdfb.printSchema
root
|-- some_statistics2: struct (nullable = false)
| |-- max: string (nullable = true)
| |-- type: string (nullable = true)
|-- id: long (nullable = false)
scala> mdfb.show(false)
+----------------+---+
|some_statistics2|id |
+----------------+---+
|[11.2,sample] |0 |
+----------------+---+
scala> mdfa.join(mdfb,Seq("id"),"inner").drop("id").printSchema
root
|-- some_statistics1: struct (nullable = false)
| |-- max_scenes: string (nullable = true)
| |-- median_scenes: string (nullable = true)
| |-- avg_scenes: string (nullable = true)
|-- some_statistics2: struct (nullable = false)
| |-- max: string (nullable = true)
| |-- type: string (nullable = true)
scala> mdfa.join(mdfb,Seq("id"),"inner").drop("id").show(false)
+----------------+----------------+
|some_statistics1|some_statistics2|
+----------------+----------------+
|[10,8.9,7.9] |[11.2,sample] |
+----------------+----------------+

How to convert a list with structure like (key1, list(key2, value)) into a dataframe in pyspark?

I have a list as shown below:
It is of the type as shown below:
[(key1, [(key11, value11), (key12, value12)]), (key2, [(key21, value21), (key22, value22)...])...]
A sample structure is shown below:
[('1052762305',
[('1007819788', 0.9206884810054885),
('1005886801', 0.913818268123084),
('1003863766', 0.9131746152849486),
('1007811435', 0.9128666156173751),
('1005879599', 0.9126368405937075),
('1003705572', 0.9122051062936369),
('1007804896', 0.9083424459788203),
('1005890270', 0.8982097535650703),
('1007806781', 0.8708761186829758),
('1003670458', 0.8452789033694487)]),
('1064808607',
[('1007804896', 0.9984397647563017),
('1003705572', 0.9970498347406341),
('1005879599', 0.9951581013190172),
('1007811435', 0.9934813787902085),
('1005886801', 0.9930572794622374),
('1003863766', 0.9928815742735568),
('1007819788', 0.9869723713790797),
('1005890270', 0.9642640856016443),
('1007806781', 0.9211558765137313),
('1003670458', 0.8519872445941068)])]
I want to convert this into a dataframe of the form
key1 key2 score
1052762305 1007819788 0.9206884810054885
1052762305 1005886801 0.913818268123084
1052762305 1003863766 0.9131746152849486
... ... ...
1064808607 1007804896 0.9984397647563017
1064808607 1003705572 0.9970498347406341
1064808607 1005879599 0.9951581013190172
... ... ...
How can we implement this in pyspark?
You can create a schema upfront with the input. Use explode and access the elements with in the value struct.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField,StringType,ArrayType, DoubleType
spark = SparkSession.builder \
.appName('SO')\
.getOrCreate()
schema = StructType([StructField("key1",StringType()), StructField("value",ArrayType(
StructType([ StructField("key2", StringType()),
StructField("score", DoubleType())])
)) ])
df = spark.createDataFrame(
[('1052762305',
[('1007819788', 0.9206884810054885),
('1005886801', 0.913818268123084),
('1003863766', 0.9131746152849486),
('1007811435', 0.9128666156173751),
('1005879599', 0.9126368405937075),
('1003705572', 0.9122051062936369),
('1007804896', 0.9083424459788203),
('1005890270', 0.8982097535650703),
('1007806781', 0.8708761186829758),
('1003670458', 0.8452789033694487)]),
('1064808607',
[('1007804896', 0.9984397647563017),
('1003705572', 0.9970498347406341),
('1005879599', 0.9951581013190172),
('1007811435', 0.9934813787902085),
('1005886801', 0.9930572794622374),
('1003863766', 0.9928815742735568),
('1007819788', 0.9869723713790797),
('1005890270', 0.9642640856016443),
('1007806781', 0.9211558765137313),
('1003670458', 0.8519872445941068)])
],schema
)
df.show()
+----------+--------------------+
| key1| value |
+----------+--------------------+
|1052762305|[[1007819788, 0.9...|
|1064808607|[[1007804896, 0.9...|
+----------+--------------------+
df.printSchema()
root
|-- key1: string (nullable = true)
|-- value: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key2: string (nullable = true)
| | |-- score: double (nullable = true)
df1=df.select('key1',F.explode('value').alias('value'))
df1.show()
+----------+--------------------+
| key1| value |
+----------+--------------------+
|1052762305|[1007819788, 0.92...|
|1052762305|[1005886801, 0.91...|
|1052762305|[1003863766, 0.91...|
|1052762305|[1007811435, 0.91...|
|1052762305|[1005879599, 0.91...|
|1052762305|[1003705572, 0.91...|
|1052762305|[1007804896, 0.90...|
|1052762305|[1005890270, 0.89...|
|1052762305|[1007806781, 0.87...|
|1052762305|[1003670458, 0.84...|
|1064808607|[1007804896, 0.99...|
|1064808607|[1003705572, 0.99...|
|1064808607|[1005879599, 0.99...|
|1064808607|[1007811435, 0.99...|
|1064808607|[1005886801, 0.99...|
|1064808607|[1003863766, 0.99...|
|1064808607|[1007819788, 0.98...|
|1064808607|[1005890270, 0.96...|
|1064808607|[1007806781, 0.92...|
|1064808607|[1003670458, 0.85...|
+----------+--------------------+
df1.printSchema()
root
|-- key1: string (nullable = true)
|-- value: struct (nullable = true)
| |-- key2: string (nullable = true)
| |-- score: double (nullable = true)
df1.select("key1", "value.key2","value.score").show()
+----------+----------+------------------+
| key1| key2| score|
+----------+----------+------------------+
|1052762305|1007819788|0.9206884810054885|
|1052762305|1005886801| 0.913818268123084|
|1052762305|1003863766|0.9131746152849486|
|1052762305|1007811435|0.9128666156173751|
|1052762305|1005879599|0.9126368405937075|
|1052762305|1003705572|0.9122051062936369|
|1052762305|1007804896|0.9083424459788203|
|1052762305|1005890270|0.8982097535650703|
|1052762305|1007806781|0.8708761186829758|
|1052762305|1003670458|0.8452789033694487|
|1064808607|1007804896|0.9984397647563017|
|1064808607|1003705572|0.9970498347406341|
|1064808607|1005879599|0.9951581013190172|
|1064808607|1007811435|0.9934813787902085|
|1064808607|1005886801|0.9930572794622374|
|1064808607|1003863766|0.9928815742735568|
|1064808607|1007819788|0.9869723713790797|
|1064808607|1005890270|0.9642640856016443|
|1064808607|1007806781|0.9211558765137313|
|1064808607|1003670458|0.8519872445941068|
You basically need to do following:
create a dataframe from your list
promote the pairs from elements of array into a separate row by using explode
extract key & value from pair via select
This could be done by something like this (source data is in the variable called a):
from pyspark.sql.functions import explode, col
df = spark.createDataFrame(a, ['key1', 'val'])
df2 = df.select(col('key1'), explode(col('val')).alias('val'))
df3 = df2.select('key1', col('val')._1.alias('key2'), col('val')._2.alias('value'))
we can check that schema & data is matching:
>>> df3.printSchema()
root
|-- key1: string (nullable = true)
|-- key2: string (nullable = true)
|-- value: double (nullable = true)
>>> df3.show(2)
+----------+----------+------------------+
| key1| key2| value|
+----------+----------+------------------+
|1052762305|1007819788|0.9206884810054885|
|1052762305|1005886801| 0.913818268123084|
+----------+----------+------------------+
only showing top 2 rows
we can also check the schema for intermediate results:
>>> df.printSchema()
root
|-- key1: string (nullable = true)
|-- val: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: string (nullable = true)
| | |-- _2: double (nullable = true)
>>> df2.printSchema()
root
|-- key1: string (nullable = true)
|-- val: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: double (nullable = true)

How to JSON-escape a String field in Spark dataFrame with new column

How to write a new column with JSON format through DataFrame. I tried several approaches but it's writing the data as JSON-escaped String field.
Currently its writing as
{"test":{"id":1,"name":"name","problem_field": "{\"x\":100,\"y\":200}"}}
Instead I want it to be as
{"test":{"id":1,"name":"name","problem_field": {"x":100,"y":200}}}
problem_field is a new column that is being created based on the values read from other fields as:
val dataFrame = oldDF.withColumn("problem_field", s)
I have tried the following approaches
dataFrame.write.json(<<outputPath>>)
dataFrame.toJSON.map(value => value.replace("\\", "").replace("{\"value\":\"", "").replace("}\"}", "}")).write.json(<<outputPath>>)
Tried converting to DataSet as well but no luck. Any pointers are greatly appreciated.
I have already tried the logic mentioned here: How to let Spark parse a JSON-escaped String field as a JSON Object to infer the proper structure in DataFrames?
For starters, your example data has an extraneous comma after "y\":200 which will prevent it from being parsed as it is not valid JSON.
From there, you can use from_json to parse the field, assuming you know the schema. In this example, I'm parsing the field separately to first get the schema:
scala> val json = spark.read.json(Seq("""{"test":{"id":1,"name":"name","problem_field": "{\"x\":100,\"y\":200}"}}""").toDS)
json: org.apache.spark.sql.DataFrame = [test: struct<id: bigint, name: string ... 1 more field>]
scala> json.printSchema
root
|-- test: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- name: string (nullable = true)
| |-- problem_field: string (nullable = true)
scala> val problem_field = spark.read.json(json.select($"test.problem_field").map{
case org.apache.spark.sql.Row(x : String) => x
})
problem_field: org.apache.spark.sql.DataFrame = [x: bigint, y: bigint]
scala> problem_field.printSchema
root
|-- x: long (nullable = true)
|-- y: long (nullable = true)
scala> val fixed = json.withColumn("test", struct($"test.id", $"test.name", from_json($"test.problem_field", problem_field.schema).as("problem_field")))
fixed: org.apache.spark.sql.DataFrame = [test: struct<id: bigint, name: string ... 1 more field>]
scala> fixed.printSchema
root
|-- test: struct (nullable = false)
| |-- id: long (nullable = true)
| |-- name: string (nullable = true)
| |-- problem_field: struct (nullable = true)
| | |-- x: long (nullable = true)
| | |-- y: long (nullable = true)
If the schema of problem_fields contents is inconsistent between rows, this solution will still work but may not be an optimal way of handling things, as it will produce a sparse Dataframe where each row contains every field encountered in problem_field. For example:
scala> val json = spark.read.json(Seq("""{"test":{"id":1,"name":"name","problem_field": "{\"x\":100,\"y\":200}"}}""", """{"test":{"id":1,"name":"name","problem_field": "{\"a\":10,\"b\":20}"}}""").toDS)
json: org.apache.spark.sql.DataFrame = [test: struct<id: bigint, name: string ... 1 more field>]
scala> val problem_field = spark.read.json(json.select($"test.problem_field").map{case org.apache.spark.sql.Row(x : String) => x})
problem_field: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 2 more fields]
scala> problem_field.printSchema
root
|-- a: long (nullable = true)
|-- b: long (nullable = true)
|-- x: long (nullable = true)
|-- y: long (nullable = true)
scala> val fixed = json.withColumn("test", struct($"test.id", $"test.name", from_json($"test.problem_field", problem_field.schema).as("problem_field")))
fixed: org.apache.spark.sql.DataFrame = [test: struct<id: bigint, name: string ... 1 more field>]
scala> fixed.printSchema
root
|-- test: struct (nullable = false)
| |-- id: long (nullable = true)
| |-- name: string (nullable = true)
| |-- problem_field: struct (nullable = true)
| | |-- a: long (nullable = true)
| | |-- b: long (nullable = true)
| | |-- x: long (nullable = true)
| | |-- y: long (nullable = true)
scala> fixed.select($"test.problem_field.*").show
+----+----+----+----+
| a| b| x| y|
+----+----+----+----+
|null|null| 100| 200|
| 10| 20|null|null|
+----+----+----+----+
Over the course of hundreds, thousands, or millions of rows, you can see how this would present a problem.

Resources