Create a parquet file with custom schema - apache-spark

I have a requirement like this:
In Databricks, we are reading a csv file. This file has multiple columns like emp_name, emp_salary, joining_date etc. When we read this file in a dataframe, we are getting all the columns as string.
We have an API which will give us the schema of the columns. emp_name is string(50), emp_salary is decimal(7,4), joining_date as timestamp etc.
I have to create a parquet file with the schema that is coming from the API.
How can we do this in Databricks using PySpark.

You can always pass in the schema when reading:
schema = 'emp_name string, emp_salary decimal(7,4), joining_date timestamp'
df = spark.read.csv('input.csv', schema=schema)
df.printSchema()
df.show()
The only thing to be careful is that some strings cannot be used directly from API, e.g., "string(50)" needs to be converted to "string".
input.csv:
"name","123.1234","2022-01-01 10:10:00"

Related

How to iterate in Databricks to read hundreds of files stored in different subdirectories in a Data Lake?

I have to read hundreds of avro files in Databricks from an Azure Data Lake Gen2, extract data from the Body field inside every file, and concatenate all the extracted data in a unique dataframe. The point is that all avro files to read are stored in different subdirectories in the lake, following the pattern:
root/YYYY/MM/DD/HH/mm/ss.avro
This forces me to loop the ingestion and selection of data. I'm using this Python code, in which list_avro_files is the list of paths to all files:
list_data = []
for file_avro in list_avro_files:
df = spark.read.format('avro').load(file_avro)
data1 = spark.read.json(df.select(df.Body.cast('string')).rdd.map(lambda x: x[0]))
list_data.append(data1)
data = reduce(DataFrame.unionAll, list_data)
Is there any way to do this more efficiently? How can I parallelize/speed up this process?
As long as your list_avro_files can be expressed through standard wildcard syntax, you can probably use Spark's own ability to parallelize read operation. All you'd need is to specify a basepath and a filename pattern for your avro files:
scala> var df = spark.read
.option("basepath","/user/hive/warehouse/root")
.format("avro")
.load("/user/hive/warehouse/root/*/*/*/*.avro")
And, in case you find that you need to know exactly which file any given row came from, use input_file_name() built-in function to enrich your dataframe:
scala> df = df.withColumn("source",input_file_name())

SPARK Parquet file loading with different schema

We have parquet files generated with two different schemas where we have ID and Amount fields.
File:
file1.snappy.parquet
ID: INT
AMOUNT: DECIMAL(15,6)
Content:
1,19500.00
2,198.34
file2.snappy.parquet
ID: INT
AMOUNT: DECIMAL(15,2)
Content:
1,19500.00
3,198.34
When I am loading both the files together df3 = spark.read.parquet("output/"), and tried to get the data it is inferring the schema of Decimal(15,6) to the file which has amount with Decimal(16,2) and that files data is getting manipulated wrongly. Is there is a way that I can retrieve the data properly for this case.
Final output I could see after executing df3.show()
+---+-----------------+
|ID|       AMOUNT|
+---+-----------------+
| 1|        1.950000|
| 3|        0.019834|
| 1|19500.000000|
| 2|    198.340000|
+---+-----------------+
Here if you see for 1st and 2nd row the amount got manipulated incorrectly.
Looking for some suggestions on this. I know if we regenerate the files with same schema this issue will go away, this requires regeneration and replacing of the files which were delivered, is there any other way temporary which we can use and mean while we will work on regenerating those files.
~R,
Krish
You can try by using mergeSchema property as true.
So instead of
df3 = spark.read.parquet("output/")
Try this:
df3 = spark.read.option("mergeSchema","true").parquet("output/")
But this will give inconsistency records if the version of spark is different for both the parquet. in this case the new version of spark should set the below property to true.
spark.sql.parquet.writeLegacyFormat
Try to read this as a string and provide the schema manually while reading the file
schema = StructType([
StructField("flag_piece", StringType(), True)
])
spark.read.format("parquet").schema(schema).load(path)
The following worked for me:
df = spark.read.parquet("data_file/")
for col in df.columns:
df = df.withColumn(col, df[col].cast("string"))

how to query data from Pyspark sql context if key is not present in json fie , How to catch give sql analysis execption

I am using Pyspark to transform JSON in a Dataframe. And I am successfully able to transform it. But the problem I am facing is there is a key which will be present in some JSON file and will not be present in another. When I flatten the JSON with Pyspark SQL context and the key is not present in some JSON file, it gives error in creating my Pyspark data frame, throwing SQL Analysis Exception.
for example my sample JSON
{
"_id" : ObjectId("5eba227a0bce34b401e7899a"),
"origin" : "inbound",
"converse" : "72412952",
"Start" : "2020-04-20T06:12:20.89Z",
"End" : "2020-04-20T06:12:53.919Z",
"ConversationMos" : 4.88228940963745,
"ConversationRFactor" : 92.4383773803711,
"participantId" : "bbe4de4c-7b3e-49f1-8",
}
The above JSON participant id will be available in some JSON and not in another JSON files
My pysaprk code snippet:
fetchFile = sark.read.format(file_type)\
.option("inferSchema", "true")\
.option("header","true")\
.load(generated_FileLocation)
fetch file.registerTempTable("CreateDataFrame")
tempData = sqlContext.sql("select origin,converse,start,end,participantId from CreateDataFrame")
When, in some JSON file participantId is not present, an exception is coming. How to handle that kind of exception that if the key is not present so column will contain null or any other ways to handle it
You can simply check if the column is not there then add it will empty values.
The code for the same goes like:
from pyspark.sql import functions as f
fetchFile = sark.read.format(file_type)\
.option("inferSchema", "true")\
.option("header","true")\
.load(generated_FileLocation)
if not 'participantId' in df.columns:
df = df.withColumn('participantId', f.lit(''))
fetch file.registerTempTable("CreateDataFrame")
tempData = sqlContext.sql("select origin,converse,start,end,participantId from CreateDataFrame")
I think you're calling Spark to read one file at a time and inferring the schema at the same time.
What Spark is telling you with the SQL Analysis exception is that your file and your inferred schema doesn't have the key you're looking for. What you have to do is get to your good schema and apply it to all of the files you want to process. Ideally, processing all of your files at once.
There are three strategies:
Infer your schema from lots of files. You should get the aggregate of all of the keys. Spark will run two passes over the data.
df = spark.read.json('/path/to/your/directory/full/of/json/files')
schema = df.schema
print(schema)
Create a schema object
I find this tedious to do, but will speed up your code. Here is a reference: https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.types.StructType
Read the schema from a well formed file then use that to read your whole directory. Also, by printing the schema object, you can copy paste that back into your code for option #2.
schema = spark.read.json('path/to/well/formed/file.json')
print(schema)
my_df = spark.read.schema(schema).json('path/to/entire/folder/full/of/json')

Spark reading parquet compressed data

I have nested JSON converted to Parquet (snappy) without any flattening. The structure, for example, has the following:
{"a":{"b":{"c":"abcd","d":[1,2,3]},"e":["asdf","pqrs"]}}
df = spark.read.parquet('<File on AWS S3>')
df.createOrReplaceTempView("test")
query = """select a.b.c from test"""
df = spark.sql(query)
df.show()
When the query is executed, does Spark read only the lowest-level attribute column referenced in query or does it read the top-level attribute that has this referenced attribute in its hierarchy?

Call inferSchema directly after the load is done with spark-csv

Is there a way that I can directly call inferSchema after load is done?
Ex:
val df = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema", "false").load(location)
df.schema
I want to call some thing like below:
val newdf = df.inferSchema()
newdf.printSchema()
Regards
It's not possible unless you define a new schema and apply it to the new DataFrame on creation.
You can also read the schema from using the csv source and store it to use afterwards but this will scan the data either way.
You haven't inferred a schema, spark-csv considers every column as a string.

Resources