how to query data from Pyspark sql context if key is not present in json fie , How to catch give sql analysis execption - apache-spark

I am using Pyspark to transform JSON in a Dataframe. And I am successfully able to transform it. But the problem I am facing is there is a key which will be present in some JSON file and will not be present in another. When I flatten the JSON with Pyspark SQL context and the key is not present in some JSON file, it gives error in creating my Pyspark data frame, throwing SQL Analysis Exception.
for example my sample JSON
{
"_id" : ObjectId("5eba227a0bce34b401e7899a"),
"origin" : "inbound",
"converse" : "72412952",
"Start" : "2020-04-20T06:12:20.89Z",
"End" : "2020-04-20T06:12:53.919Z",
"ConversationMos" : 4.88228940963745,
"ConversationRFactor" : 92.4383773803711,
"participantId" : "bbe4de4c-7b3e-49f1-8",
}
The above JSON participant id will be available in some JSON and not in another JSON files
My pysaprk code snippet:
fetchFile = sark.read.format(file_type)\
.option("inferSchema", "true")\
.option("header","true")\
.load(generated_FileLocation)
fetch file.registerTempTable("CreateDataFrame")
tempData = sqlContext.sql("select origin,converse,start,end,participantId from CreateDataFrame")
When, in some JSON file participantId is not present, an exception is coming. How to handle that kind of exception that if the key is not present so column will contain null or any other ways to handle it

You can simply check if the column is not there then add it will empty values.
The code for the same goes like:
from pyspark.sql import functions as f
fetchFile = sark.read.format(file_type)\
.option("inferSchema", "true")\
.option("header","true")\
.load(generated_FileLocation)
if not 'participantId' in df.columns:
df = df.withColumn('participantId', f.lit(''))
fetch file.registerTempTable("CreateDataFrame")
tempData = sqlContext.sql("select origin,converse,start,end,participantId from CreateDataFrame")

I think you're calling Spark to read one file at a time and inferring the schema at the same time.
What Spark is telling you with the SQL Analysis exception is that your file and your inferred schema doesn't have the key you're looking for. What you have to do is get to your good schema and apply it to all of the files you want to process. Ideally, processing all of your files at once.
There are three strategies:
Infer your schema from lots of files. You should get the aggregate of all of the keys. Spark will run two passes over the data.
df = spark.read.json('/path/to/your/directory/full/of/json/files')
schema = df.schema
print(schema)
Create a schema object
I find this tedious to do, but will speed up your code. Here is a reference: https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.types.StructType
Read the schema from a well formed file then use that to read your whole directory. Also, by printing the schema object, you can copy paste that back into your code for option #2.
schema = spark.read.json('path/to/well/formed/file.json')
print(schema)
my_df = spark.read.schema(schema).json('path/to/entire/folder/full/of/json')

Related

Create a parquet file with custom schema

I have a requirement like this:
In Databricks, we are reading a csv file. This file has multiple columns like emp_name, emp_salary, joining_date etc. When we read this file in a dataframe, we are getting all the columns as string.
We have an API which will give us the schema of the columns. emp_name is string(50), emp_salary is decimal(7,4), joining_date as timestamp etc.
I have to create a parquet file with the schema that is coming from the API.
How can we do this in Databricks using PySpark.
You can always pass in the schema when reading:
schema = 'emp_name string, emp_salary decimal(7,4), joining_date timestamp'
df = spark.read.csv('input.csv', schema=schema)
df.printSchema()
df.show()
The only thing to be careful is that some strings cannot be used directly from API, e.g., "string(50)" needs to be converted to "string".
input.csv:
"name","123.1234","2022-01-01 10:10:00"

Handling corrupted data in Pyspark dataframe

I have data which I need to handle using Pyspark dataframe even when it is corrupted. I tried using PERMISSIVE but still I am getting error. I can read the same code if have some data in the account_id
The data I have where the account_id(integer) has no value:
{
"Name:"
"account_id":,
"phone_number":1234567890,
"transactions":[
{
"Spent":1000,
},
{
"spent":1100,
}
]
}
The code I tried:
df=spark.read.option("mode","PERMISSIVE").json("path\complex.json",multiLine=True)
df.show()
The error and warning I get:
pyspark.sql.utils.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).json(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().;
How can I read corrupted data in Pyspark Dataframe?

spark read schema from separate file

I have my data in HDFS and it's schema in MySQL. I'm able to fetch the schema to a DataFrame and it is as below :
col1,string
col2,date
col3,int
col4,string
How to read this schema and assign it to data while reading from HDFS?
I will be reading schema from MySql . It will be different for different datasets . I require a dynamic approach , where for any dataset I can fetch schema details from MySQL -> convert it into schema -> and then apply to dataset.
You can use the built-in pyspark function _parse_datatype_string:
from pyspark.sql.types import _parse_datatype_string
df = spark.createDataFrame([
["col1,string"],
["col3,int"],
["col3,int"]
], ["schema"])
str_schema = ",".join(map(lambda c: c["schema"].replace(",", ":") , df.collect()))
# col1:string,col3:int,col3:int
final_schema = _parse_datatype_string(str_schema)
# StructType(List(StructField(col1,StringType,true),StructField(col3,IntegerType,true),StructField(col3,IntegerType,true)))
_parse_datatype_string expects a DDL-formatted string i.e: col1:string, col2:int hence we need first to replace , with : then join all together seperated by comma. The function will return an instance of StructType which will be your final schema.

Spark infer schema with limit during a read.csv

I'd like to infer a Spark.DataFrame schema from a directory of CSV files using a small subset of the rows (say limit(100)).
However, setting inferSchema to True means that the Input Size / Records for the FileScanRDD seems to always be equal to the number of rows in all the CSV files.
Is there a way to make the FileScan more selective, such that Spark looks at fewer rows when inferring a schema?
Note: setting the samplingRatio option to be < 1.0 does not have the desired behaviour, though it is clear that inferSchema uses only the sampled subset of rows.
You could read a subset of your input data into a dataSet of String.
The CSV method allows you to pass this as a parameter.
Here is a simple example (I'll leave reading the sample of rows from the input file to you):
val data = List("1,2,hello", "2,3,what's up?")
val csvRDD = sc.parallelize(data)
val df = spark.read.option("inferSchema","true").csv(csvRDD.toDS)
df.schema
When run in spark-shell, the final line from the above prints (I reformatted it for readability):
res4: org.apache.spark.sql.types.StructType =
StructType(
StructField(_c0,IntegerType,true),
StructField(_c1,IntegerType,true),
StructField(_c2,StringType,true)
)
Which is the correct Schema for my limited input data set.
Assuming you are only interested in the schema, here is a possible approach based on cipri.l's post in this link
import org.apache.spark.sql.execution.datasources.csv.{CSVOptions, TextInputCSVDataSource}
def inferSchemaFromSample(sparkSession: SparkSession, fileLocation: String, sampleSize: Int, isFirstRowHeader: Boolean): StructType = {
// Build a Dataset composed of the first sampleSize lines from the input files as plain text strings
val dataSample: Array[String] = sparkSession.read.textFile(fileLocation).head(sampleSize)
import sparkSession.implicits._
val sampleDS: Dataset[String] = sparkSession.createDataset(dataSample)
// Provide information about the CSV files' structure
val firstLine = dataSample.head
val extraOptions = Map("inferSchema" -> "true", "header" -> isFirstRowHeader.toString)
val csvOptions: CSVOptions = new CSVOptions(extraOptions, sparkSession.sessionState.conf.sessionLocalTimeZone)
// Infer the CSV schema based on the sample data
val schema = TextInputCSVDataSource.inferFromDataset(sparkSession, sampleDS, Some(firstLine), csvOptions)
schema
}
Unlike GMc's answer from above, this approach tries to directly infer the schema the same way the DataFrameReader.csv() does in the background (but without going through the effort of building an additional Dataset with that schema, that we would then only use to retrieve the schema back from it)
The schema is inferred based on a Dataset[String] containing only the first sampleSize lines from the input files as plain text strings.
When trying to retrieve samples from data, Spark has only 2 types of methods:
Methods that retrieve a given percentage of the data. This operation takes random samples from all partitions. It benefits from higher parallelism, but it must read all the input files.
Methods that retrieve a specific number of rows. This operation must collect the data on the driver, but it could read a single partition (if the required row count is low enough)
Since you mentioned you want to use a specific small number of rows and since you want to avoid touching all the data, I provided a solution based on option 2
PS: The DataFrameReader.textFile method accepts paths to files, folders and it also has a varargs variant, so you could pass in one or more files or folders.

Null type schema in spark-salesforce connector

I have a Dataset< Row> with 48 columns imported from Salesforce:
Dataset<Row> df = spark.read()
.format("com.springml.spark.salesforce")
.option("username", prop.getProperty("salesforce_user"))
.option("password", prop.getProperty("salesforce_auth"))
.option("login", prop.getProperty("salesforce_login_url"))
.option("soql", "SELECT "+srcCols+" from "+tableNm)
.option("version", prop.getProperty("salesforce_version"))
.load()
Columns contain null as well.
I need to store this Dataset in a .txt file and delimited by ^.
I tried to store is as text file using:
finalDS.coalesce(1).write().option("delimiter", "^").toString().text(hdfsExportLoaction);
But I got error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Try to map struct<Columns....>to Tuple1, but failed as the number of fields does not line up.;
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveDeserializer$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveDeserializer$$fail(Analyzer.scala:2320)
I tried:
finalDS.map(row -> row.mkString(), Encoders.STRING()).write().option("delimiter", "^").text(hdfsExportLoaction);
but the delimiters are vanishing and all the data is getting written concatenated.
I then tried to save as csv (just to make it work):
finalDS.coalesce(1).write().mode(SaveMode.Overwrite).option("header", "true").option("delimiter", "^").option("nullValue", "").csv(hdfsExportLoaction+"/"+tableNm);
and:
finalDS.na().fill("").coalesce(1).write().option("delimiter", "^").mode(SaveMode.Overwrite).csv(hdfsExportLoaction);
but then it complained that
Exception in thread "main" java.lang.UnsupportedOperationException: CSV data source does not support null data type.
Nothing is working.
When trying to write as a text file, then either the delimiter is getting removed, or the error that only single column can be written to text file,
When trying to write as a CSV, then Null data type is not supported exception.
I think you have a problem in the dataset or dataframe itself. For me
df.coalesce(1).write.option("delimiter", "^").mode(SaveMode.Overwrite).csv("<path>")
this worked as expected.Its properly delimited with "^". I would suggest inspect the data of your dataframe or datasets and the operations you are doing into it. Before writing the data use df.count once and see its failing or not

Resources