Handling corrupted data in Pyspark dataframe - apache-spark

I have data which I need to handle using Pyspark dataframe even when it is corrupted. I tried using PERMISSIVE but still I am getting error. I can read the same code if have some data in the account_id
The data I have where the account_id(integer) has no value:
{
"Name:"
"account_id":,
"phone_number":1234567890,
"transactions":[
{
"Spent":1000,
},
{
"spent":1100,
}
]
}
The code I tried:
df=spark.read.option("mode","PERMISSIVE").json("path\complex.json",multiLine=True)
df.show()
The error and warning I get:
pyspark.sql.utils.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).json(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().;
How can I read corrupted data in Pyspark Dataframe?

Related

Cannot write Dataframe result as a Hive table/LFS file

I have encountered an issue while writing the filtered data to a file. There are around 27 files created in local file system but with no output.
Below is the code used:
I'm reading the file as a dataframe
val in_df=spark.read.csv("file:///home/Desktop/Project/inputdata.csv").selectExpr("_c0 as Id","_c1 as name","_c2 as dept")
Then to register this dataframe as a temp table
in_df.registerTempTable("employeeDetails")
Now the requirement is to count the number of employees for each department and store it to a file.
val employeeDeptCount=spark.sql("select dept,count(*) from employeedetails group by dept")
//The following code is writing to Hive default warehouse as n number parquet files.
employeeDeptCount.write.saveAsTable("aggregatedcount")
//The following code is writing to LFS but No Output but n files are created
employeeDeptCount.write.mode("append").csv("file:///home/Desktop/Project")
val in_df=spark.read.csv("file:///home/Desktop/Project/inputdata.csv").selectExpr("_c0 as Id","_c1 as name","_c2 as dept")
// please, show your result
in_df.show(false)
val employeeDeptCount= in_df.groupBy("dept").count().alias("count")
employeeDeptCount.persist()
employeeDeptCount.write.format("csv").mode(SaveMode.Overwrite).saveAsTable("aggregatedcount")
employeeDeptCount.repartition(1).write.mode("append").csv("file:///home/Desktop/Project")
employeeDeptCount.unpersist()
// in_df.createOrReplaceTempView()
// in_df.createOrReplaceGlobalTempView()

how to query data from Pyspark sql context if key is not present in json fie , How to catch give sql analysis execption

I am using Pyspark to transform JSON in a Dataframe. And I am successfully able to transform it. But the problem I am facing is there is a key which will be present in some JSON file and will not be present in another. When I flatten the JSON with Pyspark SQL context and the key is not present in some JSON file, it gives error in creating my Pyspark data frame, throwing SQL Analysis Exception.
for example my sample JSON
{
"_id" : ObjectId("5eba227a0bce34b401e7899a"),
"origin" : "inbound",
"converse" : "72412952",
"Start" : "2020-04-20T06:12:20.89Z",
"End" : "2020-04-20T06:12:53.919Z",
"ConversationMos" : 4.88228940963745,
"ConversationRFactor" : 92.4383773803711,
"participantId" : "bbe4de4c-7b3e-49f1-8",
}
The above JSON participant id will be available in some JSON and not in another JSON files
My pysaprk code snippet:
fetchFile = sark.read.format(file_type)\
.option("inferSchema", "true")\
.option("header","true")\
.load(generated_FileLocation)
fetch file.registerTempTable("CreateDataFrame")
tempData = sqlContext.sql("select origin,converse,start,end,participantId from CreateDataFrame")
When, in some JSON file participantId is not present, an exception is coming. How to handle that kind of exception that if the key is not present so column will contain null or any other ways to handle it
You can simply check if the column is not there then add it will empty values.
The code for the same goes like:
from pyspark.sql import functions as f
fetchFile = sark.read.format(file_type)\
.option("inferSchema", "true")\
.option("header","true")\
.load(generated_FileLocation)
if not 'participantId' in df.columns:
df = df.withColumn('participantId', f.lit(''))
fetch file.registerTempTable("CreateDataFrame")
tempData = sqlContext.sql("select origin,converse,start,end,participantId from CreateDataFrame")
I think you're calling Spark to read one file at a time and inferring the schema at the same time.
What Spark is telling you with the SQL Analysis exception is that your file and your inferred schema doesn't have the key you're looking for. What you have to do is get to your good schema and apply it to all of the files you want to process. Ideally, processing all of your files at once.
There are three strategies:
Infer your schema from lots of files. You should get the aggregate of all of the keys. Spark will run two passes over the data.
df = spark.read.json('/path/to/your/directory/full/of/json/files')
schema = df.schema
print(schema)
Create a schema object
I find this tedious to do, but will speed up your code. Here is a reference: https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.types.StructType
Read the schema from a well formed file then use that to read your whole directory. Also, by printing the schema object, you can copy paste that back into your code for option #2.
schema = spark.read.json('path/to/well/formed/file.json')
print(schema)
my_df = spark.read.schema(schema).json('path/to/entire/folder/full/of/json')

Saving DataFrame as ORC loses StructField metadata

Spark version: 2.4.4
When writing a DataFrame, I want to put some metadata on certain fields. It's important for this metadata to be persisted when I write out the DataFrame and read it again later. If I save this DataFrame as Parquet and then read it back, I see the metadata is preserved. But saving as ORC, the metadata is lost when I read the files. Here is a bit of code to show how I'm doing this (in Java):
// set up schema and dataframe
Metadata myMeta = new MetadataBuilder().putString("myMetaData", "foo").build();
StructField field = DataTypes.createStructField("x", DataTypes.IntegerType, true, myMeta);
Dataset<Row> df = sparkSession.createDataFrame(rdd, /* a schema using this field */);
// write it
df.write().format("parquet").save("test");
// read it again
Dataset<Row> df2 = sparkSession.read().format("parquet").load("test");
// check the schema after reading files
df2.schema().prettyJson();
df2.schema().fields()[0].metadata();
Using Parquet format, the metadata is deserialized as I expect. However if changed to ORC format, the metadata comes back as an empty map.
Is this a known bug in the Spark ORC implementation or am I missing something? Thanks.

Spark dataframe is NULL (Invalid Tree)

I have a spark (spark 2.1) job which processes the stream data using Kafka direct stream. I enriched the stream data with the data files stored in HDFS. I first read the data files(*.parquet) and stored them in a data frame, then enrich one record each time with this data frame.
The code ran without any error, but the enrichment did not occur. I ran the codes in debug mode and found the data frame (eg. df) is shown as an invalid tree. Why the data frame is null inside rdd.foreachPartition? how to correct this problem? Thanks!
val kafkaSinkVar = ssc.sparkContext.broadcast(KafkaSink(kafkaServers, outputTopic))
Service.aggregate(kafkaInputStream).foreachRDD(rdd => {
val df =ss.read.parquet( filePath + "/*.parquet" )
println("Record Count in DF: " + df.count()) ==> the console shows the files were loaded successfully with the record count = 1300
rdd.foreachPartition(partition => {
val futures = partition.map(event => {
sentMsgsNo.add(1L)
val eventEnriched = someEnrichmen1(event,df) ==> df is shown as invalid tree here
kafkaSinkVar.value.sendCef(eventEnriched)
})
})
})
})

Null type schema in spark-salesforce connector

I have a Dataset< Row> with 48 columns imported from Salesforce:
Dataset<Row> df = spark.read()
.format("com.springml.spark.salesforce")
.option("username", prop.getProperty("salesforce_user"))
.option("password", prop.getProperty("salesforce_auth"))
.option("login", prop.getProperty("salesforce_login_url"))
.option("soql", "SELECT "+srcCols+" from "+tableNm)
.option("version", prop.getProperty("salesforce_version"))
.load()
Columns contain null as well.
I need to store this Dataset in a .txt file and delimited by ^.
I tried to store is as text file using:
finalDS.coalesce(1).write().option("delimiter", "^").toString().text(hdfsExportLoaction);
But I got error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Try to map struct<Columns....>to Tuple1, but failed as the number of fields does not line up.;
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveDeserializer$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveDeserializer$$fail(Analyzer.scala:2320)
I tried:
finalDS.map(row -> row.mkString(), Encoders.STRING()).write().option("delimiter", "^").text(hdfsExportLoaction);
but the delimiters are vanishing and all the data is getting written concatenated.
I then tried to save as csv (just to make it work):
finalDS.coalesce(1).write().mode(SaveMode.Overwrite).option("header", "true").option("delimiter", "^").option("nullValue", "").csv(hdfsExportLoaction+"/"+tableNm);
and:
finalDS.na().fill("").coalesce(1).write().option("delimiter", "^").mode(SaveMode.Overwrite).csv(hdfsExportLoaction);
but then it complained that
Exception in thread "main" java.lang.UnsupportedOperationException: CSV data source does not support null data type.
Nothing is working.
When trying to write as a text file, then either the delimiter is getting removed, or the error that only single column can be written to text file,
When trying to write as a CSV, then Null data type is not supported exception.
I think you have a problem in the dataset or dataframe itself. For me
df.coalesce(1).write.option("delimiter", "^").mode(SaveMode.Overwrite).csv("<path>")
this worked as expected.Its properly delimited with "^". I would suggest inspect the data of your dataframe or datasets and the operations you are doing into it. Before writing the data use df.count once and see its failing or not

Resources