Can i create a dataframe from another dataframes rows - apache-spark

Can I create a dataframe from below's rows , as columns of the new dataframe using Pyspark?
+------------+
| col|
+------------|
|created_meta|
| updated_at|
|updated_meta|
| meta|
| Year|
| First Name|
| County|
| Sex|
| Count|
+------------

Two ways.
Using pivot:
df1 = df.groupBy().pivot('col').agg(F.lit(None)).limit(0)
df1.show()
+-----+------+---------+---+----+------------+----+----------+------------+
|Count|County|FirstName|Sex|Year|created_meta|meta|updated_at|updated_meta|
+-----+------+---------+---+----+------------+----+----------+------------+
+-----+------+---------+---+----+------------+----+----------+------------+
Creating it from scratch:
df2 = df.select([F.lit(r[0]) for r in df.collect()]).limit(0)
df2.show()
+------------+----------+------------+----+----+---------+------+---+-----+
|created_meta|updated_at|updated_meta|meta|Year|FirstName|County|Sex|Count|
+------------+----------+------------+----+----+---------+------+---+-----+
+------------+----------+------------+----+----+---------+------+---+-----+

// sorry in Scala + Spark
import spark.implicits._
import org.apache.spark.sql.functions._
val lst = List("created_meta",
"updated_at",
"updated_meta",
"meta",
"Year",
"First Name",
"County",
"Sex",
"Count")
val source = lst.toDF("col")
source.show(false)
// +------------+
// |col |
// +------------+
// |created_meta|
// |updated_at |
// |updated_meta|
// |meta |
// |Year |
// |First Name |
// |County |
// |Sex |
// |Count |
// +------------+
val l = source.select('col).as[String].collect.toList
val df1 = l.foldLeft(source)((acc, col) => {
acc.withColumn(col, lit(""))
})
val df2 = df1.drop("col")
df2.printSchema()
// root
// |-- created_meta: string (nullable = false)
// |-- updated_at: string (nullable = false)
// |-- updated_meta: string (nullable = false)
// |-- meta: string (nullable = false)
// |-- Year: string (nullable = false)
// |-- First Name: string (nullable = false)
// |-- County: string (nullable = false)
// |-- Sex: string (nullable = false)
// |-- Count: string (nullable = false)
df2.show(1, false)
// +------------+----------+------------+----+----+----------+------+---+-----+
// |created_meta|updated_at|updated_meta|meta|Year|First Name|County|Sex|Count|
// +------------+----------+------------+----+----+----------+------+---+-----+
// | | | | | | | | | |
// +------------+----------+------------+----+----+----------+------+---+-----+

Related

Pyspark conver the column into list

In my pyspark dataframe, there is a column which is being inferred as string, but it is actually a List. I need to convert the same into list
Sample Value of the column :
[{"#ID":"07D40CB5-273E-4814-B82C-B5DCA7145D20","#ProductName":"Event Registration","#ProductCode":null,"#Type":"Admission Item","#ProductDescription":null,"#SessionCategoryName":null,"#SessionCategoryId":"00000000-0000-0000-0000-000000000000","#Status":"Active"},{"CustomFieldDetail":{"#FieldName":"Description","#FieldType":"Open Ended Text - Comment Box","#FieldValue":null,"#FieldId":"C6D46AD1-9B3F-45FF-9331-27EA47811E37"},"#ID":"8EA83E8B-7573-4550-905D-D4320496AD89","#ProductName":"Test","#ProductCode":null,"#Type":"Session","#ProductDescription":null,"#SessionCategoryName":null,"#SessionCategoryId":"00000000-0000-0000-0000-000000000000","#StartTime":"2018-01-29T18:00:00","#EndTime":"2018-01-29T19:00:00","#Status":"Active"}]
Please let me know how to achieve this
This is a dynamic solution that first converts the JSON strings column to an RDD, then read the RDD to a dataframe and finally makes use of the inferred schema to read the original dataframe while parsing the JSON strings.
import pyspark.sql.functions as F
import pyspark.sql.types as T
This is not part of the solution, only creation of the sample dataframe
str1 = '''[{"#ID":"str1_1","a":11},{"#ID":"str1_2","d":[41,42,43]}]'''
str2 = '''[{"#ID":"str2_1","a":101},{"#ID":"str2_2","c":301},{"#ID":"str2_3","b":201,"nested":{"a":1001,"b":2001}}]'''
df = spark.createDataFrame([(1, str1,),(2, str2,)],['id', 'json_str'])
This is the actual solution
df_json_str = spark.read.json(df.rdd.map(lambda x:x.json_str))
json_str_schema = T.ArrayType(df_json_str.schema)
df = df.withColumn('json_str', F.from_json('json_str',json_str_schema))
df_json_str.printSchema()
root
|-- #ID: string (nullable = true)
|-- a: long (nullable = true)
|-- b: long (nullable = true)
|-- c: long (nullable = true)
|-- d: array (nullable = true)
| |-- element: long (containsNull = true)
|-- nested: struct (nullable = true)
| |-- a: long (nullable = true)
| |-- b: long (nullable = true)
df.printSchema()
root
|-- id: long (nullable = true)
|-- json_str: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- #ID: string (nullable = true)
| | |-- a: long (nullable = true)
| | |-- b: long (nullable = true)
| | |-- c: long (nullable = true)
| | |-- d: array (nullable = true)
| | | |-- element: long (containsNull = true)
| | |-- nested: struct (nullable = true)
| | | |-- a: long (nullable = true)
| | | |-- b: long (nullable = true)
df.show(truncate=False)
+---+-----------------------------------------------------------------------------------------------------------------------------+
|id |json_str |
+---+-----------------------------------------------------------------------------------------------------------------------------+
|1 |[{str1_1, 11, null, null, null, null}, {str1_2, null, null, null, [41, 42, 43], null}] |
|2 |[{str2_1, 101, null, null, null, null}, {str2_2, null, null, 301, null, null}, {str2_3, null, 201, null, null, {1001, 2001}}]|
+---+-----------------------------------------------------------------------------------------------------------------------------+
A quick demonstration that everything is fine with the result and the dataframe is exploding well
df_explode = df.selectExpr('id', 'inline(json_str)')
df_explode.printSchema()
root
|-- id: long (nullable = true)
|-- #ID: string (nullable = true)
|-- a: long (nullable = true)
|-- b: long (nullable = true)
|-- c: long (nullable = true)
|-- d: array (nullable = true)
| |-- element: long (containsNull = true)
|-- nested: struct (nullable = true)
| |-- a: long (nullable = true)
| |-- b: long (nullable = true)
df_explode.show(truncate=False)
+---+------+----+----+----+------------+------------+
|id |#ID |a |b |c |d |nested |
+---+------+----+----+----+------------+------------+
|1 |str1_1|11 |null|null|null |null |
|1 |str1_2|null|null|null|[41, 42, 43]|null |
|2 |str2_1|101 |null|null|null |null |
|2 |str2_2|null|null|301 |null |null |
|2 |str2_3|null|201 |null|null |{1001, 2001}|
+---+------+----+----+----+------------+------------+
Demonstration on OP sample
import pyspark.sql.functions as F
import pyspark.sql.types as T
This is not part of the solution, only creation of the sample dataframe
str1 = '''[{"#ID":"07D40CB5-273E-4814-B82C-B5DCA7145D20","#ProductName":"Event Registration","#ProductCode":null,"#Type":"Admission Item","#ProductDescription":null,"#SessionCategoryName":null,"#SessionCategoryId":"00000000-0000-0000-0000-000000000000","#Status":"Active"},{"CustomFieldDetail":{"#FieldName":"Description","#FieldType":"Open Ended Text - Comment Box","#FieldValue":null,"#FieldId":"C6D46AD1-9B3F-45FF-9331-27EA47811E37"},"#ID":"8EA83E8B-7573-4550-905D-D4320496AD89","#ProductName":"Test","#ProductCode":null,"#Type":"Session","#ProductDescription":null,"#SessionCategoryName":null,"#SessionCategoryId":"00000000-0000-0000-0000-000000000000","#StartTime":"2018-01-29T18:00:00","#EndTime":"2018-01-29T19:00:00","#Status":"Active"}]'''
df = spark.createDataFrame([(1, str1,)],['id', 'json_str'])
df_json_str = spark.read.json(df.rdd.map(lambda x:x.json_str))
df_json_str.printSchema()
root
|-- #EndTime: string (nullable = true)
|-- #ID: string (nullable = true)
|-- #ProductCode: string (nullable = true)
|-- #ProductDescription: string (nullable = true)
|-- #ProductName: string (nullable = true)
|-- #SessionCategoryId: string (nullable = true)
|-- #SessionCategoryName: string (nullable = true)
|-- #StartTime: string (nullable = true)
|-- #Status: string (nullable = true)
|-- #Type: string (nullable = true)
|-- CustomFieldDetail: struct (nullable = true)
| |-- #FieldId: string (nullable = true)
| |-- #FieldName: string (nullable = true)
| |-- #FieldType: string (nullable = true)
| |-- #FieldValue: string (nullable = true)
json_str_schema = T.ArrayType(df_json_str.schema)
df = df.withColumn('json_str', F.from_json('json_str',json_str_schema))
df.printSchema()
root
|-- id: long (nullable = true)
|-- json_str: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- #EndTime: string (nullable = true)
| | |-- #ID: string (nullable = true)
| | |-- #ProductCode: string (nullable = true)
| | |-- #ProductDescription: string (nullable = true)
| | |-- #ProductName: string (nullable = true)
| | |-- #SessionCategoryId: string (nullable = true)
| | |-- #SessionCategoryName: string (nullable = true)
| | |-- #StartTime: string (nullable = true)
| | |-- #Status: string (nullable = true)
| | |-- #Type: string (nullable = true)
| | |-- CustomFieldDetail: struct (nullable = true)
| | | |-- #FieldId: string (nullable = true)
| | | |-- #FieldName: string (nullable = true)
| | | |-- #FieldType: string (nullable = true)
| | | |-- #FieldValue: string (nullable = true)
df.show(truncate=False)
+---+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |json_str |
+---+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |[{null, 07D40CB5-273E-4814-B82C-B5DCA7145D20, null, null, Event Registration, 00000000-0000-0000-0000-000000000000, null, null, Active, Admission Item, null}, {2018-01-29T19:00:00, 8EA83E8B-7573-4550-905D-D4320496AD89, null, null, Test, 00000000-0000-0000-0000-000000000000, null, 2018-01-29T18:00:00, Active, Session, {C6D46AD1-9B3F-45FF-9331-27EA47811E37, Description, Open Ended Text - Comment Box, null}}]|
+---+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
df.selectExpr('id', 'inline(json_str)').show(truncate=False)
+---+-------------------+------------------------------------+------------+-------------------+------------------+------------------------------------+--------------------+-------------------+-------+--------------+----------------------------------------------------------------------------------------+
|id |#EndTime |#ID |#ProductCode|#ProductDescription|#ProductName |#SessionCategoryId |#SessionCategoryName|#StartTime |#Status|#Type |CustomFieldDetail |
+---+-------------------+------------------------------------+------------+-------------------+------------------+------------------------------------+--------------------+-------------------+-------+--------------+----------------------------------------------------------------------------------------+
|1 |null |07D40CB5-273E-4814-B82C-B5DCA7145D20|null |null |Event Registration|00000000-0000-0000-0000-000000000000|null |null |Active |Admission Item|null |
|1 |2018-01-29T19:00:00|8EA83E8B-7573-4550-905D-D4320496AD89|null |null |Test |00000000-0000-0000-0000-000000000000|null |2018-01-29T18:00:00|Active |Session |{C6D46AD1-9B3F-45FF-9331-27EA47811E37, Description, Open Ended Text - Comment Box, null}|
+---+-------------------+------------------------------------+------------+-------------------+------------------+------------------------------------+--------------------+-------------------+-------+--------------+----------------------------------------------------------------------------------------+
This solution is relevant if you know the schema in advance.
It is off course more effective performance vise than inferring the schema.
import pyspark.sql.functions as F
myval = '''[{"#ID":"07D40CB5-273E-4814-B82C-B5DCA7145D20","#ProductName":"Event Registration","#ProductCode":null,"#Type":"Admission Item","#ProductDescription":null,"#SessionCategoryName":null,"#SessionCategoryId":"00000000-0000-0000-0000-000000000000","#Status":"Active"},{"CustomFieldDetail":{"#FieldName":"Description","#FieldType":"Open Ended Text - Comment Box","#FieldValue":null,"#FieldId":"C6D46AD1-9B3F-45FF-9331-27EA47811E37"},"#ID":"8EA83E8B-7573-4550-905D-D4320496AD89","#ProductName":"Test","#ProductCode":null,"#Type":"Session","#ProductDescription":null,"#SessionCategoryName":null,"#SessionCategoryId":"00000000-0000-0000-0000-000000000000","#StartTime":"2018-01-29T18:00:00","#EndTime":"2018-01-29T19:00:00","#Status":"Active"}]'''
df = spark.createDataFrame([(1, myval,)], 'id int, myjson string')
myschema = F.schema_of_json(mylist)
# -------------- or -----------------
# myschema = spark.sql(f"select schema_of_json('{mylist}')").first()[0]
# print (myschema)
# ARRAY<STRUCT<`#EndTime`: STRING, `#ID`: STRING, `#ProductCode`: STRING, `#ProductDescription`: STRING, `#ProductName`: STRING, `#SessionCategoryId`: STRING, `#SessionCategoryName`: STRING, `#StartTime`: STRING, `#Status`: STRING, `#Type`: STRING, `CustomFieldDetail`: STRUCT<`#FieldId`: STRING, `#FieldName`: STRING, `#FieldType`: STRING, `#FieldValue`: STRING>>>
# -----------------------------------
df = df.withColumn('myjson', F.from_json('myjson',myschema))
df.show()
+---+--------------------+
| id| myjson|
+---+--------------------+
| 1|[{null, 07D40CB5-...|
+---+--------------------+
df.selectExpr('id', 'inline(myjson)').show()
+---+-------------------+--------------------+------------+-------------------+------------------+--------------------+--------------------+-------------------+-------+--------------+--------------------+
| id| #EndTime| #ID|#ProductCode|#ProductDescription| #ProductName| #SessionCategoryId|#SessionCategoryName| #StartTime|#Status| #Type| CustomFieldDetail|
+---+-------------------+--------------------+------------+-------------------+------------------+--------------------+--------------------+-------------------+-------+--------------+--------------------+
| 1| null|07D40CB5-273E-481...| null| null|Event Registration|00000000-0000-000...| null| null| Active|Admission Item| null|
| 1|2018-01-29T19:00:00|8EA83E8B-7573-455...| null| null| Test|00000000-0000-000...| null|2018-01-29T18:00:00| Active| Session|{C6D46AD1-9B3F-45...|
+---+-------------------+--------------------+------------+-------------------+------------------+--------------------+--------------------+-------------------+-------+--------------+--------------------+

Modify a struct column in Spark dataframe

I have a PySpark dataframe which contains a column "student" as follows:
"student" : {
"name" : "kaleem",
"rollno" : "12"
}
Schema for this in dataframe is:
structType(List(
name: String,
rollno: String))
I need to modify this column as
"student" : {
"student_details" : {
"name" : "kaleem",
"rollno" : "12"
}
}
Schema for this in dataframe must be:
structType(List(
student_details:
structType(List(
name: String,
rollno: String))
))
How to do this in Spark?
Use named_struct function to achieve this
1. Read the json as column
val data =
"""
| {
| "student": {
| "name": "kaleem",
| "rollno": "12"
| }
|}
""".stripMargin
val df = spark.read.json(Seq(data).toDS())
df.show(false)
println(df.schema("student"))
Output
+------------+
|student |
+------------+
|[kaleem, 12]|
+------------+
StructField(student,StructType(StructField(name,StringType,true), StructField(rollno,StringType,true)),true)
2. change the schema using named_struct
val processedDf = df.withColumn("student",
expr("named_struct('student_details', student)")
)
processedDf.show(false)
println(processedDf.schema("student"))
Output
+--------------+
|student |
+--------------+
|[[kaleem, 12]]|
+--------------+
StructField(student,StructType(StructField(student_details,StructType(StructField(name,StringType,true), StructField(rollno,StringType,true)),true)),false)
For python step#2 will work as is just remove val
With a library called spark-hats - This library extends Spark DataFrame API with helpers for transforming fields inside nested structures and arrays of arbitrary levels of nesting., you can do a lot of these transformations.
scala> import za.co.absa.spark.hats.Extensions._
scala> df.printSchema
root
|-- ID: string (nullable = true)
scala> val df2 = df.nestedMapColumn("ID", "ID", c => struct(c as alfa))
scala> df2.printSchema
root
|-- ID: struct (nullable = false)
| |-- alfa: string (nullable = true)
scala> val df3 = df2.nestedMapColumn("ID.alfa", "ID.alfa", c => struct(c as "beta"))
scala> df3.printSchema
root
|-- ID: struct (nullable = false)
| |-- alfa: struct (nullable = false)
| | |-- beta: string (nullable = true)
Your query would be
df.nestedMapColumn("student", "student", c => struct(c as "student_details"))
Spark 3.1+
To modify struct type columns, we can use withField and dropFields
F.col("Student").withField("student_details", F.col("student"))
F.col("Student").dropFields("name", "rollno")
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame([(("kaleem", "12"),)], "student struct<name:string,rollno:string>")
df.printSchema()
# root
# |-- student: struct (nullable = true)
# | |-- name: string (nullable = true)
# | |-- rollno: string (nullable = true)
Script:
df = df.withColumn("student", F.col("Student")
.withField("student_details", F.col("student"))
.dropFields("name", "rollno")
)
Result:
df.printSchema()
# root
# |-- student: struct (nullable = true)
# | |-- student_details: struct (nullable = true)
# | | |-- name: string (nullable = true)
# | | |-- rollno: string (nullable = true)

Spark SQL not equals not working on column created using lag

I am trying to find missing sequences in a CSV File and here is the code i have
val customSchema = StructType(
Array(
StructField("MessageId", StringType, false),
StructField("msgSeqID", LongType, false),
StructField("site", StringType, false),
StructField("msgType", StringType, false)
)
)
val logFileDF = sparkSession.sqlContext.read.format("csv")
.option("delimiter",",")
.option("header", false)
.option("mode", "DROPMALFORMED")
.schema(customSchema)
.load(logFilePath)
.toDF()
logFileDF.printSchema()
logFileDF.repartition(5000)
logFileDF.createOrReplaceTempView("LogMessageData")
sparkSession.sqlContext.cacheTable("LogMessageData")
val selectQuery: String = "SELECT MessageId,site,msgSeqID,msgType,lag(msgSeqID,1) over (partition by site,msgType order by site,msgType,msgSeqID) as prev_val FROM LogMessageData order by site,msgType,msgSeqID"
val logFileLagRowNumDF = sparkSession.sqlContext.sql(selectQuery).toDF()
logFileLagRowNumDF.repartition(1000)
logFileLagRowNumDF.printSchema()
logFileLagRowNumDF.createOrReplaceTempView("LogMessageDataUpdated")
sparkSession.sqlContext.cacheTable("LogMessageDataUpdated")
val errorRecordQuery: String = "select * from LogMessageDataUpdated where prev_val!=null and msgSeqID != prev_val+1 order by site,msgType,msgSeqID";
val errorRecordQueryDF = sparkSession.sqlContext.sql(errorRecordQuery).toDF()
logger.info("Total. No.of Missing records =[" + errorRecordQueryDF.count() + "]")
val noOfMissingRecords = errorRecordQueryDF.count()
Here is the sample Data i have
Msg_1,S1,A,10000000
Msg_2,S1,A,10000002
Msg_3,S2,A,10000003
Msg_4,S3,B,10000000
Msg_5,S3,B,10000001
Msg_6,S3,A,10000003
Msg_7,S3,A,10000001
Msg_8,S3,A,10000002
Msg_9,S4,A,10000000
Msg_10,S4,A,10000001
Msg_11,S4,A,10000000
Msg_12,S4,A,10000005
And here is the output i am getting.
root
|-- MessageId: string (nullable = true)
|-- site: string (nullable = true)
|-- msgType: string (nullable = true)
|-- msgSeqID: long (nullable = true)
INFO EimComparisonProcessV2 - Total No.Of Records =[12]
+---------+----+-------+--------+
|MessageId|site|msgType|msgSeqID|
+---------+----+-------+--------+
| Msg_1| S1| A|10000000|
| Msg_2| S1| A|10000002|
| Msg_3| S2| A|10000003|
| Msg_4| S3| B|10000000|
| Msg_5| S3| B|10000001|
| Msg_6| S3| A|10000003|
| Msg_7| S3| A|10000001|
| Msg_8| S3| A|10000002|
| Msg_9| S4| A|10000000|
| Msg_10| S4| A|10000001|
| Msg_11| S4| A|10000000|
| Msg_12| S4| A|10000005|
+---------+----+-------+--------+
root
|-- MessageId: string (nullable = true)
|-- site: string (nullable = true)
|-- msgSeqID: long (nullable = true)
|-- msgType: string (nullable = true)
|-- prev_val: long (nullable = true)
+---------+----+--------+-------+--------+
|MessageId|site|msgSeqID|msgType|prev_val|
+---------+----+--------+-------+--------+
| Msg_1| S1|10000000| A| null|
| Msg_2| S1|10000002| A|10000000|
| Msg_3| S2|10000003| A| null|
| Msg_7| S3|10000001| A| null|
| Msg_8| S3|10000002| A|10000001|
+---------+----+--------+-------+--------+
only showing top 5 rows
INFO TestProcess - Total No.Of Records Updated DF=[12]
INFO TestProcess - Total. No.of Missing records =[0]
Just replace your query with this line of code it will show you the results.
val errorRecordQuery: String = "select * from LogMessageDataUpdated where prev_val is not null and msgSeqID <> prev_val+1 order by site,msgType,msgSeqID";

How to create a list json using Pyspark?

I am trying to create a json file with below structure using Pyspark.
Target Output:
[{
"Loaded_data": [{
"Loaded_numeric_columns": ["id", "val"],
"Loaded_category_columns": ["name", "branch"]
}],
"enriched_data": [{
"enriched_category_columns": ["country__4"],
"enriched_index_columns": ["id__1", "val__3"]
}]
}]
I could able to create list for each section . Please refer below code. I kind of stuck here, could you please help.
Sample data:
input_data=spark.read.csv("/tmp/test234.csv",header=True, inferSchema=True)
def is_numeric(data_type):
return data_type not in ('date', 'string', 'boolean')
def is_nonnumeric(data_type):
return data_type in ('string')
sub="__"
Loaded_numeric_columns = [name for name, data_type in input_data.dtypes if is_numeric(data_type) and (sub not in name)]
print Loaded_numeric_columns
Loaded_category_columns = [name for name, data_type in input_data.dtypes if is_nonnumeric(data_type) and (sub not in name)]
print Loaded_category_columns
enriched_category_columns = [name for name, data_type in input_data.dtypes if is_nonnumeric(data_type) and (sub in name)]
print enriched_category_columns
enriched_index_columns = [name for name, data_type in input_data.dtypes if is_numeric(data_type) and (sub in name)]
print enriched_index_columns
You can just create the new column type with struct and array :
from pyspark.sql import functions as F
df.show()
+---+-----+-------+------+----------+-----+-------+
| id| val| name|branch|country__4|id__1| val__3|
+---+-----+-------+------+----------+-----+-------+
| 1|67.87|Shankar| a| 1|67.87|Shankar|
+---+-----+-------+------+----------+-----+-------+
df.select(
F.struct(
F.array(F.col("id"), F.col("val")).alias("Loaded_numeric_columns"),
F.array(F.col("name"), F.col("branch")).alias("Loaded_category_columns"),
).alias("Loaded_data"),
F.struct(
F.array(F.col("country__4")).alias("enriched_category_columns"),
F.array(F.col("id__1"), F.col("val__3")).alias("enriched_index_columns"),
).alias("enriched_data"),
).printSchema()
root
|-- Loaded_data: struct (nullable = false)
| |-- Loaded_numeric_columns: array (nullable = false)
| | |-- element: double (containsNull = true)
| |-- Loaded_category_columns: array (nullable = false)
| | |-- element: string (containsNull = true)
|-- enriched_data: struct (nullable = false)
| |-- enriched_category_columns: array (nullable = false)
| | |-- element: long (containsNull = true)
| |-- enriched_index_columns: array (nullable = false)
| | |-- element: string (containsNull = true)

Parse JSON in Spark containing reserve character

I have a JSON input.txt file with data as follows:
2018-05-30.txt:{"Message":{"eUuid":"6e7d4890-9279-491a-ae4d-70416ef9d42d","schemaVersion":"1.0-AB1","timestamp":1527539376,"id":"XYZ","location":{"dim":{"x":2,"y":-7},"towards":121.0},"source":"a","UniqueId":"test123","code":"del","signature":"xyz","":{},"vel":{"ground":15},"height":{},"next":{"dim":{}},"sub":"del1"}}
2018-05-30.txt:{"Message":{"eUuid":"5e7d4890-9279-491a-ae4d-70416ef9d42d","schemaVersion":"1.0-AB1","timestamp":1627539376,"id":"ABC","location":{"dim":{"x":1,"y":-8},"towards":132.0},"source":"b","UniqueId":"hello123","code":"fra","signature":"abc","":{},"vel":{"ground":16},"height":{},"next":{"dim":{}},"sub":"fra1"}}
.
.
I tried to load the JSON into a DataFrame as follows:
>>val df = spark.read.json("<full path of input.txt file>")
I am receiving
_corrupt_record
dataframe
I am aware that json character contains "." (2018-05-30.txt) as reserve character which is causing the issue. How may I resolve this?
val rdd = sc.textFile("/Users/kishore/abc.json")
val jsonRdd= rdd.map(x=>x.split("txt:")(1))
scala> df.show
+--------------------+
| Message|
+--------------------+
|[test123,del,6e7d...|
|[hello123,fra,5e7...|
+--------------------+
import org.apache.spark.sql.functions._
import sqlContext.implicits._
// val df = sqlContext.read.json(jsonRdd)
// df.show(false)
val df = sqlContext.read.json(jsonRdd).withColumn("eUuid", $"Message"("eUuid"))
.withColumn("schemaVersion", $"Message"("schemaVersion"))
.withColumn("timestamp", $"Message"("timestamp"))
.withColumn("id", $"Message"("id"))
.withColumn("source", $"Message"("source"))
.withColumn("UniqueId", $"Message"("UniqueId"))
.withColumn("location", $"Message"("location"))
.withColumn("dim", $"location"("dim"))
.withColumn("x", $"dim"("x"))
.withColumn("y", $"dim"("y"))
.drop("dim")
.withColumn("vel", $"Message"("vel"))
.withColumn("ground", $"vel"("ground"))
.withColumn("sub", $"Message"("sub"))
.drop("Message")
df.show()
+--------------------+-------------+----------+---+------+--------+------------+---+---+----+------+----+
| eUuid|schemaVersion| timestamp| id|source|UniqueId| location| x| y| vel|ground| sub|
+--------------------+-------------+----------+---+------+--------+------------+---+---+----+------+----+
|6e7d4890-9279-491...| 1.0-AB1|1527539376|XYZ| a| test123|[[2,-7],121]| 2| -7|[15]| 15|del1|
+--------------------+-------------+----------+---+------+--------+------------+---+---+----+------+----+
The problem is not a reserved character it is that the file does not contain valid JSON
so you can
val df=spark.read.textFile(...)
val json=spark.read.json(df.map(v=>v.drop(15)))
json.printSchema()
root
|-- Message: struct (nullable = true)
| |-- UniqueId: string (nullable = true)
| |-- code: string (nullable = true)
| |-- eUuid: string (nullable = true)
| |-- id: string (nullable = true)
| |-- location: struct (nullable = true)
| | |-- dim: struct (nullable = true)
| | | |-- x: long (nullable = true)
| | | |-- y: long (nullable = true)
| | |-- towards: double (nullable = true)
| |-- schemaVersion: string (nullable = true)
| |-- signature: string (nullable = true)
| |-- source: string (nullable = true)
| |-- sub: string (nullable = true)
| |-- timestamp: long (nullable = true)
| |-- vel: struct (nullable = true)
| | |-- ground: long (nullable = true)

Resources