Making one dataframe out of two dataframes as separate subcolumns in pyspark - apache-spark

I want to put two data frames into one, so each one is sub column, it's not join of dataframes. So I have two dataframes, stat1_df and stat2_df and they look something like this:
root
|-- max_scenes: integer (nullable = true)
|-- median_scenes: double (nullable = false)
|-- avg_scenes: double (nullable = true)
+----------+-------------+------------------+
|max_scenes|median_scenes|avg_scenes |
+----------+-------------+------------------+
|97 |7.0 |10.806451612903226|
|97 |7.0 |10.806451612903226|
|97 |7.0 |10.806451612903226|
|97 |7.0 |10.806451612903226|
+----------+-------------+------------------+
root
|-- max: double (nullable = true)
|-- type: string (nullable = true)
+-----+-----------+
|max |type |
+-----+-----------+
|10.0 |small |
|25.0 |medium |
|50.0 |large |
|250.0|extra_large|
+-----+-----------+
, and I want the result_df to be as:
root
|-- some_statistics1: struct (nullable = true)
| |-- max_scenes: integer (nullable = true)
|-- median_scenes: double (nullable = false)
|-- avg_scenes: double (nullable = true)
|-- some_statistics2: struct (nullable = true)
| |-- max: double (nullable = true)
|-- type: string (nullable = true)
Is there any way to put those two as shown? stat1_df and stat2_df are simple dataframes, without arrays and nested columns.Final dataframe is written to mongodb. If there any additional questions I am here to answer.

Check below code.
Add id column in both DataFrame, move all columns into struct & then use join both DataFrame's
scala> val dfa = Seq(("10","8.9","7.9")).toDF("max_scenes","median_scenes","avg_scenes")
dfa: org.apache.spark.sql.DataFrame = [max_scenes: string, median_scenes: string ... 1 more field]
scala> dfa.show(false)
+----------+-------------+----------+
|max_scenes|median_scenes|avg_scenes|
+----------+-------------+----------+
|10 |8.9 |7.9 |
+----------+-------------+----------+
scala> dfa.printSchema
root
|-- max_scenes: string (nullable = true)
|-- median_scenes: string (nullable = true)
|-- avg_scenes: string (nullable = true)
scala> val mdfa = dfa.select(struct($"*").as("some_statistics1")).withColumn("id",monotonically_increasing_id)
mdfa: org.apache.spark.sql.DataFrame = [some_statistics1: struct<max_scenes: string, median_scenes: string ... 1 more field>, id: bigint]
scala> mdfa.printSchema
root
|-- some_statistics1: struct (nullable = false)
| |-- max_scenes: string (nullable = true)
| |-- median_scenes: string (nullable = true)
| |-- avg_scenes: string (nullable = true)
|-- id: long (nullable = false)
scala> mdfa.show(false)
+----------------+---+
|some_statistics1|id |
+----------------+---+
|[10,8.9,7.9] |0 |
+----------------+---+
scala> val dfb = Seq(("11.2","sample")).toDF("max","type")
dfb: org.apache.spark.sql.DataFrame = [max: string, type: string]
scala> dfb.printSchema
root
|-- max: string (nullable = true)
|-- type: string (nullable = true)
scala> dfb.show(false)
+----+------+
|max |type |
+----+------+
|11.2|sample|
+----+------+
scala> val mdfb = dfb.select(struct($"*").as("some_statistics2")).withColumn("id",monotonically_increasing_id)
mdfb: org.apache.spark.sql.DataFrame = [some_statistics2: struct<max: string, type: string>, id: bigint]
scala> mdfb.printSchema
root
|-- some_statistics2: struct (nullable = false)
| |-- max: string (nullable = true)
| |-- type: string (nullable = true)
|-- id: long (nullable = false)
scala> mdfb.show(false)
+----------------+---+
|some_statistics2|id |
+----------------+---+
|[11.2,sample] |0 |
+----------------+---+
scala> mdfa.join(mdfb,Seq("id"),"inner").drop("id").printSchema
root
|-- some_statistics1: struct (nullable = false)
| |-- max_scenes: string (nullable = true)
| |-- median_scenes: string (nullable = true)
| |-- avg_scenes: string (nullable = true)
|-- some_statistics2: struct (nullable = false)
| |-- max: string (nullable = true)
| |-- type: string (nullable = true)
scala> mdfa.join(mdfb,Seq("id"),"inner").drop("id").show(false)
+----------------+----------------+
|some_statistics1|some_statistics2|
+----------------+----------------+
|[10,8.9,7.9] |[11.2,sample] |
+----------------+----------------+

Related

Load only struct from map's value from an avro file into a Spark Dataframe

Using PySpark, I need to load "Properties" object (map's value) from an avro file into its own Spark dataframe. Such that, "Properties" from my avro file will become a dataframe with its elements and values as columns and rows. Hence, struggling to find some clear examples accomplishing that.
Schema of the file:
root
|-- SequenceNumber: long (nullable = true)
|-- Offset: string (nullable = true)
|-- EnqueuedTimeUtc: string (nullable = true)
|-- SystemProperties: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- member0: long (nullable = true)
| | |-- member1: double (nullable = true)
| | |-- member2: string (nullable = true)
| | |-- member3: binary (nullable = true)
|-- Properties: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- member0: long (nullable = true)
| | |-- member1: double (nullable = true)
| | |-- member2: string (nullable = true)
| | |-- member3: binary (nullable = true)
|-- Body: binary (nullable = true)
The resulting "Properties" dataframe loaded from the above avro file needs to be like this:
member0
member1
member2
member3
value
value
value
value
map_values is your friend.
Collection function: Returns an unordered array containing the values of the map.
New in version 2.3.0.
df_properties = df.select((F.map_values(F.col('Properties'))[0]).alias('vals')).select('vals.*')
Full example:
df = spark.createDataFrame(
[('a', 20, 4.5, 'r', b'8')],
['key', 'member0', 'member1', 'member2', 'member3'])
df = df.select(F.create_map('key', F.struct('member0', 'member1', 'member2', 'member3')).alias('Properties'))
df.printSchema()
# root
# |-- Properties: map (nullable = false)
# | |-- key: string
# | |-- value: struct (valueContainsNull = false)
# | | |-- member0: long (nullable = true)
# | | |-- member1: double (nullable = true)
# | | |-- member2: string (nullable = true)
# | | |-- member3: binary (nullable = true)
df_properties = df.select((F.map_values(F.col('Properties'))[0]).alias('vals')).select('vals.*')
df_properties.show()
# +-------+-------+-------+-------+
# |member0|member1|member2|member3|
# +-------+-------+-------+-------+
# | 20| 4.5| r| [38]|
# +-------+-------+-------+-------+
df_properties.printSchema()
# root
# |-- member0: long (nullable = true)
# |-- member1: double (nullable = true)
# |-- member2: string (nullable = true)
# |-- member3: binary (nullable = true)

Pyspark conver the column into list

In my pyspark dataframe, there is a column which is being inferred as string, but it is actually a List. I need to convert the same into list
Sample Value of the column :
[{"#ID":"07D40CB5-273E-4814-B82C-B5DCA7145D20","#ProductName":"Event Registration","#ProductCode":null,"#Type":"Admission Item","#ProductDescription":null,"#SessionCategoryName":null,"#SessionCategoryId":"00000000-0000-0000-0000-000000000000","#Status":"Active"},{"CustomFieldDetail":{"#FieldName":"Description","#FieldType":"Open Ended Text - Comment Box","#FieldValue":null,"#FieldId":"C6D46AD1-9B3F-45FF-9331-27EA47811E37"},"#ID":"8EA83E8B-7573-4550-905D-D4320496AD89","#ProductName":"Test","#ProductCode":null,"#Type":"Session","#ProductDescription":null,"#SessionCategoryName":null,"#SessionCategoryId":"00000000-0000-0000-0000-000000000000","#StartTime":"2018-01-29T18:00:00","#EndTime":"2018-01-29T19:00:00","#Status":"Active"}]
Please let me know how to achieve this
This is a dynamic solution that first converts the JSON strings column to an RDD, then read the RDD to a dataframe and finally makes use of the inferred schema to read the original dataframe while parsing the JSON strings.
import pyspark.sql.functions as F
import pyspark.sql.types as T
This is not part of the solution, only creation of the sample dataframe
str1 = '''[{"#ID":"str1_1","a":11},{"#ID":"str1_2","d":[41,42,43]}]'''
str2 = '''[{"#ID":"str2_1","a":101},{"#ID":"str2_2","c":301},{"#ID":"str2_3","b":201,"nested":{"a":1001,"b":2001}}]'''
df = spark.createDataFrame([(1, str1,),(2, str2,)],['id', 'json_str'])
This is the actual solution
df_json_str = spark.read.json(df.rdd.map(lambda x:x.json_str))
json_str_schema = T.ArrayType(df_json_str.schema)
df = df.withColumn('json_str', F.from_json('json_str',json_str_schema))
df_json_str.printSchema()
root
|-- #ID: string (nullable = true)
|-- a: long (nullable = true)
|-- b: long (nullable = true)
|-- c: long (nullable = true)
|-- d: array (nullable = true)
| |-- element: long (containsNull = true)
|-- nested: struct (nullable = true)
| |-- a: long (nullable = true)
| |-- b: long (nullable = true)
df.printSchema()
root
|-- id: long (nullable = true)
|-- json_str: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- #ID: string (nullable = true)
| | |-- a: long (nullable = true)
| | |-- b: long (nullable = true)
| | |-- c: long (nullable = true)
| | |-- d: array (nullable = true)
| | | |-- element: long (containsNull = true)
| | |-- nested: struct (nullable = true)
| | | |-- a: long (nullable = true)
| | | |-- b: long (nullable = true)
df.show(truncate=False)
+---+-----------------------------------------------------------------------------------------------------------------------------+
|id |json_str |
+---+-----------------------------------------------------------------------------------------------------------------------------+
|1 |[{str1_1, 11, null, null, null, null}, {str1_2, null, null, null, [41, 42, 43], null}] |
|2 |[{str2_1, 101, null, null, null, null}, {str2_2, null, null, 301, null, null}, {str2_3, null, 201, null, null, {1001, 2001}}]|
+---+-----------------------------------------------------------------------------------------------------------------------------+
A quick demonstration that everything is fine with the result and the dataframe is exploding well
df_explode = df.selectExpr('id', 'inline(json_str)')
df_explode.printSchema()
root
|-- id: long (nullable = true)
|-- #ID: string (nullable = true)
|-- a: long (nullable = true)
|-- b: long (nullable = true)
|-- c: long (nullable = true)
|-- d: array (nullable = true)
| |-- element: long (containsNull = true)
|-- nested: struct (nullable = true)
| |-- a: long (nullable = true)
| |-- b: long (nullable = true)
df_explode.show(truncate=False)
+---+------+----+----+----+------------+------------+
|id |#ID |a |b |c |d |nested |
+---+------+----+----+----+------------+------------+
|1 |str1_1|11 |null|null|null |null |
|1 |str1_2|null|null|null|[41, 42, 43]|null |
|2 |str2_1|101 |null|null|null |null |
|2 |str2_2|null|null|301 |null |null |
|2 |str2_3|null|201 |null|null |{1001, 2001}|
+---+------+----+----+----+------------+------------+
Demonstration on OP sample
import pyspark.sql.functions as F
import pyspark.sql.types as T
This is not part of the solution, only creation of the sample dataframe
str1 = '''[{"#ID":"07D40CB5-273E-4814-B82C-B5DCA7145D20","#ProductName":"Event Registration","#ProductCode":null,"#Type":"Admission Item","#ProductDescription":null,"#SessionCategoryName":null,"#SessionCategoryId":"00000000-0000-0000-0000-000000000000","#Status":"Active"},{"CustomFieldDetail":{"#FieldName":"Description","#FieldType":"Open Ended Text - Comment Box","#FieldValue":null,"#FieldId":"C6D46AD1-9B3F-45FF-9331-27EA47811E37"},"#ID":"8EA83E8B-7573-4550-905D-D4320496AD89","#ProductName":"Test","#ProductCode":null,"#Type":"Session","#ProductDescription":null,"#SessionCategoryName":null,"#SessionCategoryId":"00000000-0000-0000-0000-000000000000","#StartTime":"2018-01-29T18:00:00","#EndTime":"2018-01-29T19:00:00","#Status":"Active"}]'''
df = spark.createDataFrame([(1, str1,)],['id', 'json_str'])
df_json_str = spark.read.json(df.rdd.map(lambda x:x.json_str))
df_json_str.printSchema()
root
|-- #EndTime: string (nullable = true)
|-- #ID: string (nullable = true)
|-- #ProductCode: string (nullable = true)
|-- #ProductDescription: string (nullable = true)
|-- #ProductName: string (nullable = true)
|-- #SessionCategoryId: string (nullable = true)
|-- #SessionCategoryName: string (nullable = true)
|-- #StartTime: string (nullable = true)
|-- #Status: string (nullable = true)
|-- #Type: string (nullable = true)
|-- CustomFieldDetail: struct (nullable = true)
| |-- #FieldId: string (nullable = true)
| |-- #FieldName: string (nullable = true)
| |-- #FieldType: string (nullable = true)
| |-- #FieldValue: string (nullable = true)
json_str_schema = T.ArrayType(df_json_str.schema)
df = df.withColumn('json_str', F.from_json('json_str',json_str_schema))
df.printSchema()
root
|-- id: long (nullable = true)
|-- json_str: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- #EndTime: string (nullable = true)
| | |-- #ID: string (nullable = true)
| | |-- #ProductCode: string (nullable = true)
| | |-- #ProductDescription: string (nullable = true)
| | |-- #ProductName: string (nullable = true)
| | |-- #SessionCategoryId: string (nullable = true)
| | |-- #SessionCategoryName: string (nullable = true)
| | |-- #StartTime: string (nullable = true)
| | |-- #Status: string (nullable = true)
| | |-- #Type: string (nullable = true)
| | |-- CustomFieldDetail: struct (nullable = true)
| | | |-- #FieldId: string (nullable = true)
| | | |-- #FieldName: string (nullable = true)
| | | |-- #FieldType: string (nullable = true)
| | | |-- #FieldValue: string (nullable = true)
df.show(truncate=False)
+---+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |json_str |
+---+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |[{null, 07D40CB5-273E-4814-B82C-B5DCA7145D20, null, null, Event Registration, 00000000-0000-0000-0000-000000000000, null, null, Active, Admission Item, null}, {2018-01-29T19:00:00, 8EA83E8B-7573-4550-905D-D4320496AD89, null, null, Test, 00000000-0000-0000-0000-000000000000, null, 2018-01-29T18:00:00, Active, Session, {C6D46AD1-9B3F-45FF-9331-27EA47811E37, Description, Open Ended Text - Comment Box, null}}]|
+---+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
df.selectExpr('id', 'inline(json_str)').show(truncate=False)
+---+-------------------+------------------------------------+------------+-------------------+------------------+------------------------------------+--------------------+-------------------+-------+--------------+----------------------------------------------------------------------------------------+
|id |#EndTime |#ID |#ProductCode|#ProductDescription|#ProductName |#SessionCategoryId |#SessionCategoryName|#StartTime |#Status|#Type |CustomFieldDetail |
+---+-------------------+------------------------------------+------------+-------------------+------------------+------------------------------------+--------------------+-------------------+-------+--------------+----------------------------------------------------------------------------------------+
|1 |null |07D40CB5-273E-4814-B82C-B5DCA7145D20|null |null |Event Registration|00000000-0000-0000-0000-000000000000|null |null |Active |Admission Item|null |
|1 |2018-01-29T19:00:00|8EA83E8B-7573-4550-905D-D4320496AD89|null |null |Test |00000000-0000-0000-0000-000000000000|null |2018-01-29T18:00:00|Active |Session |{C6D46AD1-9B3F-45FF-9331-27EA47811E37, Description, Open Ended Text - Comment Box, null}|
+---+-------------------+------------------------------------+------------+-------------------+------------------+------------------------------------+--------------------+-------------------+-------+--------------+----------------------------------------------------------------------------------------+
This solution is relevant if you know the schema in advance.
It is off course more effective performance vise than inferring the schema.
import pyspark.sql.functions as F
myval = '''[{"#ID":"07D40CB5-273E-4814-B82C-B5DCA7145D20","#ProductName":"Event Registration","#ProductCode":null,"#Type":"Admission Item","#ProductDescription":null,"#SessionCategoryName":null,"#SessionCategoryId":"00000000-0000-0000-0000-000000000000","#Status":"Active"},{"CustomFieldDetail":{"#FieldName":"Description","#FieldType":"Open Ended Text - Comment Box","#FieldValue":null,"#FieldId":"C6D46AD1-9B3F-45FF-9331-27EA47811E37"},"#ID":"8EA83E8B-7573-4550-905D-D4320496AD89","#ProductName":"Test","#ProductCode":null,"#Type":"Session","#ProductDescription":null,"#SessionCategoryName":null,"#SessionCategoryId":"00000000-0000-0000-0000-000000000000","#StartTime":"2018-01-29T18:00:00","#EndTime":"2018-01-29T19:00:00","#Status":"Active"}]'''
df = spark.createDataFrame([(1, myval,)], 'id int, myjson string')
myschema = F.schema_of_json(mylist)
# -------------- or -----------------
# myschema = spark.sql(f"select schema_of_json('{mylist}')").first()[0]
# print (myschema)
# ARRAY<STRUCT<`#EndTime`: STRING, `#ID`: STRING, `#ProductCode`: STRING, `#ProductDescription`: STRING, `#ProductName`: STRING, `#SessionCategoryId`: STRING, `#SessionCategoryName`: STRING, `#StartTime`: STRING, `#Status`: STRING, `#Type`: STRING, `CustomFieldDetail`: STRUCT<`#FieldId`: STRING, `#FieldName`: STRING, `#FieldType`: STRING, `#FieldValue`: STRING>>>
# -----------------------------------
df = df.withColumn('myjson', F.from_json('myjson',myschema))
df.show()
+---+--------------------+
| id| myjson|
+---+--------------------+
| 1|[{null, 07D40CB5-...|
+---+--------------------+
df.selectExpr('id', 'inline(myjson)').show()
+---+-------------------+--------------------+------------+-------------------+------------------+--------------------+--------------------+-------------------+-------+--------------+--------------------+
| id| #EndTime| #ID|#ProductCode|#ProductDescription| #ProductName| #SessionCategoryId|#SessionCategoryName| #StartTime|#Status| #Type| CustomFieldDetail|
+---+-------------------+--------------------+------------+-------------------+------------------+--------------------+--------------------+-------------------+-------+--------------+--------------------+
| 1| null|07D40CB5-273E-481...| null| null|Event Registration|00000000-0000-000...| null| null| Active|Admission Item| null|
| 1|2018-01-29T19:00:00|8EA83E8B-7573-455...| null| null| Test|00000000-0000-000...| null|2018-01-29T18:00:00| Active| Session|{C6D46AD1-9B3F-45...|
+---+-------------------+--------------------+------------+-------------------+------------------+--------------------+--------------------+-------------------+-------+--------------+--------------------+

How to Convert Map of Struct type to Json in Spark2

I have a map field in dataset with below schema
|-- party: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- partyName: string (nullable = true)
| | |-- cdrId: string (nullable = true)
| | |-- legalEntityId: string (nullable = true)
| | |-- customPartyId: string (nullable = true)
| | |-- partyIdScheme: string (nullable = true)
| | |-- customPartyIdScheme: string (nullable = true)
| | |-- bdrId: string (nullable = true)
Need to convert it to JSON type. Please suggest how to do it. Thanks in advance
Spark provides to_json function available for DataFrame operations:
import org.apache.spark.sql.functions._
import spark.implicits._
val df =
List(
("key1", "party01", "cdrId01"),
("key2", "party02", "cdrId02"),
)
.toDF("key", "partyName", "cdrId")
.select(struct($"key", struct($"partyName", $"cdrId")).as("col1"))
.agg(map_from_entries(collect_set($"col1")).as("map_col"))
.select($"map_col", to_json($"map_col").as("json_col"))

Remove the outer struct column in spark dataframe

My Spark Dataframe current schema is as shown below, is there a way i can remove the outer Struct column(DTC_CAN_SIGNALS).
**Current Schema**:
root
|-- DTC: string (nullable = true)
|-- DTCTS: long (nullable = true)
|-- VIN: string (nullable = true)
|-- DTC_CAN_SIGNALS: struct (nullable = true)
| |-- SGNL: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- SN: string (nullable = true)
| | | |-- ST: long (nullable = true)
| | | |-- SV: double (nullable = true)
**Expected Schema**:
root
|-- DTC: string (nullable = true)
|-- DTCTS: long (nullable = true)
|-- VIN: string (nullable = true)
|-- SGNL: array (nullable = true)
|-- element: struct (containsNull = true)
| |-- SN: string (nullable = true)
| |-- ST: long (nullable = true)
| |-- SV: double (nullable = true)
Just select your column from struct, like
df.withColumn("SGNL", col("DTC_CAN_SIGNALS.SGNL"))
or
df.select("DTC_CAN_SIGNALS.SGNL")
Code:
import sparkSession.implicits._
import org.apache.spark.sql.functions._
val data = Seq(
("DTC", 42L, "VIN")
).toDF("DTC", "DTCTS", "VIN")
val df = data.withColumn("DTC_CAN_SIGNALS", struct(array(struct(lit("sn1").as("SN"), lit(42L).as("ST"), lit(42.0D).as("SV"))).as("SGNL")))
df.show()
df.printSchema()
// alternatively
// val resDf = df
// .withColumn("SGNL", col("DTC_CAN_SIGNALS.SGNL"))
// .drop("DTC_CAN_SIGNALS")
val resDf = df.select("DTC", "DTCTS", "VIN", "DTC_CAN_SIGNALS.SGNL")
resDf.show()
resDf.printSchema()
Output:
+---+-----+---+-------------------+
|DTC|DTCTS|VIN| DTC_CAN_SIGNALS|
+---+-----+---+-------------------+
|DTC| 42|VIN|[[[sn1, 42, 42.0]]]|
+---+-----+---+-------------------+
root
|-- DTC: string (nullable = true)
|-- DTCTS: long (nullable = false)
|-- VIN: string (nullable = true)
|-- DTC_CAN_SIGNALS: struct (nullable = false)
| |-- SGNL: array (nullable = false)
| | |-- element: struct (containsNull = false)
| | | |-- SN: string (nullable = false)
| | | |-- ST: long (nullable = false)
| | | |-- SV: double (nullable = false)
+---+-----+---+-----------------+
|DTC|DTCTS|VIN| SGNL|
+---+-----+---+-----------------+
|DTC| 42|VIN|[[sn1, 42, 42.0]]|
+---+-----+---+-----------------+
root
|-- DTC: string (nullable = true)
|-- DTCTS: long (nullable = false)
|-- VIN: string (nullable = true)
|-- SGNL: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- SN: string (nullable = false)
| | |-- ST: long (nullable = false)
| | |-- SV: double (nullable = false)

Parse JSON in Spark containing reserve character

I have a JSON input.txt file with data as follows:
2018-05-30.txt:{"Message":{"eUuid":"6e7d4890-9279-491a-ae4d-70416ef9d42d","schemaVersion":"1.0-AB1","timestamp":1527539376,"id":"XYZ","location":{"dim":{"x":2,"y":-7},"towards":121.0},"source":"a","UniqueId":"test123","code":"del","signature":"xyz","":{},"vel":{"ground":15},"height":{},"next":{"dim":{}},"sub":"del1"}}
2018-05-30.txt:{"Message":{"eUuid":"5e7d4890-9279-491a-ae4d-70416ef9d42d","schemaVersion":"1.0-AB1","timestamp":1627539376,"id":"ABC","location":{"dim":{"x":1,"y":-8},"towards":132.0},"source":"b","UniqueId":"hello123","code":"fra","signature":"abc","":{},"vel":{"ground":16},"height":{},"next":{"dim":{}},"sub":"fra1"}}
.
.
I tried to load the JSON into a DataFrame as follows:
>>val df = spark.read.json("<full path of input.txt file>")
I am receiving
_corrupt_record
dataframe
I am aware that json character contains "." (2018-05-30.txt) as reserve character which is causing the issue. How may I resolve this?
val rdd = sc.textFile("/Users/kishore/abc.json")
val jsonRdd= rdd.map(x=>x.split("txt:")(1))
scala> df.show
+--------------------+
| Message|
+--------------------+
|[test123,del,6e7d...|
|[hello123,fra,5e7...|
+--------------------+
import org.apache.spark.sql.functions._
import sqlContext.implicits._
// val df = sqlContext.read.json(jsonRdd)
// df.show(false)
val df = sqlContext.read.json(jsonRdd).withColumn("eUuid", $"Message"("eUuid"))
.withColumn("schemaVersion", $"Message"("schemaVersion"))
.withColumn("timestamp", $"Message"("timestamp"))
.withColumn("id", $"Message"("id"))
.withColumn("source", $"Message"("source"))
.withColumn("UniqueId", $"Message"("UniqueId"))
.withColumn("location", $"Message"("location"))
.withColumn("dim", $"location"("dim"))
.withColumn("x", $"dim"("x"))
.withColumn("y", $"dim"("y"))
.drop("dim")
.withColumn("vel", $"Message"("vel"))
.withColumn("ground", $"vel"("ground"))
.withColumn("sub", $"Message"("sub"))
.drop("Message")
df.show()
+--------------------+-------------+----------+---+------+--------+------------+---+---+----+------+----+
| eUuid|schemaVersion| timestamp| id|source|UniqueId| location| x| y| vel|ground| sub|
+--------------------+-------------+----------+---+------+--------+------------+---+---+----+------+----+
|6e7d4890-9279-491...| 1.0-AB1|1527539376|XYZ| a| test123|[[2,-7],121]| 2| -7|[15]| 15|del1|
+--------------------+-------------+----------+---+------+--------+------------+---+---+----+------+----+
The problem is not a reserved character it is that the file does not contain valid JSON
so you can
val df=spark.read.textFile(...)
val json=spark.read.json(df.map(v=>v.drop(15)))
json.printSchema()
root
|-- Message: struct (nullable = true)
| |-- UniqueId: string (nullable = true)
| |-- code: string (nullable = true)
| |-- eUuid: string (nullable = true)
| |-- id: string (nullable = true)
| |-- location: struct (nullable = true)
| | |-- dim: struct (nullable = true)
| | | |-- x: long (nullable = true)
| | | |-- y: long (nullable = true)
| | |-- towards: double (nullable = true)
| |-- schemaVersion: string (nullable = true)
| |-- signature: string (nullable = true)
| |-- source: string (nullable = true)
| |-- sub: string (nullable = true)
| |-- timestamp: long (nullable = true)
| |-- vel: struct (nullable = true)
| | |-- ground: long (nullable = true)

Resources