Spark reading parquet compressed data - apache-spark

I have nested JSON converted to Parquet (snappy) without any flattening. The structure, for example, has the following:
{"a":{"b":{"c":"abcd","d":[1,2,3]},"e":["asdf","pqrs"]}}
df = spark.read.parquet('<File on AWS S3>')
df.createOrReplaceTempView("test")
query = """select a.b.c from test"""
df = spark.sql(query)
df.show()
When the query is executed, does Spark read only the lowest-level attribute column referenced in query or does it read the top-level attribute that has this referenced attribute in its hierarchy?

Related

Create a parquet file with custom schema

I have a requirement like this:
In Databricks, we are reading a csv file. This file has multiple columns like emp_name, emp_salary, joining_date etc. When we read this file in a dataframe, we are getting all the columns as string.
We have an API which will give us the schema of the columns. emp_name is string(50), emp_salary is decimal(7,4), joining_date as timestamp etc.
I have to create a parquet file with the schema that is coming from the API.
How can we do this in Databricks using PySpark.
You can always pass in the schema when reading:
schema = 'emp_name string, emp_salary decimal(7,4), joining_date timestamp'
df = spark.read.csv('input.csv', schema=schema)
df.printSchema()
df.show()
The only thing to be careful is that some strings cannot be used directly from API, e.g., "string(50)" needs to be converted to "string".
input.csv:
"name","123.1234","2022-01-01 10:10:00"

Setting datatypes when writting parquet files with spark

I’m reading some data from an Oracle DB and storing it in parquet files with spark-3.0.1/hadoop3.2 through python api.
The process works, but the datatypes aren’t maintained, and the resulting types in the parquet file are all “object” except for date fields.
This is how the datatypes look after creating the dataframe:
conn_properties = {"driver": "oracle.jdbc.driver.OracleDriver",
"url": cfg.ORACLE_URL,
"query": "select ..... from ...",
}
df = spark.read.format("jdbc") \
.options(**conn_properties) \
.load()
df.dtypes:
[('idest', 'string'),
('date', 'timestamp'),
('temp', 'decimal(18,3)'),
('hum', 'decimal(18,3)'),
('prec', 'decimal(18,3)'),
('wspeed', 'decimal(18,3)'),
('wdir', 'decimal(18,3)'),
('radiation', 'decimal(18,3)')]
They all match the original database schema.
But opening the parquet file with pandas I get:
df.dtypes:
idest object
date datetime64[ns]
temp object
hum object
prec object
wspeed object
wdir object
radiation object
dtype: object
I tried to change the datatype for one column using the customSchema option, the doc says Spark SQL types can be used:
Users can specify the corresponding data types of Spark SQL instead of using the defaults.
And FloatType is included in Spark SQL datatypes.
So I expect something like this should work:
custom_schema = "radiation FloatType"
conn_properties = {"driver": "oracle.jdbc.driver.OracleDriver",
"url": cfg.ORACLE_URL,
"query": 'select ... from .... ",
"customSchema": custom_schema
}
But I get this error
pyspark.sql.utils.ParseException:
DataType floattype is not supported.(line 1, pos 10)
== SQL ==
radiation FloatType
I can’t find an option in the corresponding “write” method to specify the schema/types for the parquet DataFrameWriter.
Any idea how can I set the datatype mapping?
Thank you

how to query data from Pyspark sql context if key is not present in json fie , How to catch give sql analysis execption

I am using Pyspark to transform JSON in a Dataframe. And I am successfully able to transform it. But the problem I am facing is there is a key which will be present in some JSON file and will not be present in another. When I flatten the JSON with Pyspark SQL context and the key is not present in some JSON file, it gives error in creating my Pyspark data frame, throwing SQL Analysis Exception.
for example my sample JSON
{
"_id" : ObjectId("5eba227a0bce34b401e7899a"),
"origin" : "inbound",
"converse" : "72412952",
"Start" : "2020-04-20T06:12:20.89Z",
"End" : "2020-04-20T06:12:53.919Z",
"ConversationMos" : 4.88228940963745,
"ConversationRFactor" : 92.4383773803711,
"participantId" : "bbe4de4c-7b3e-49f1-8",
}
The above JSON participant id will be available in some JSON and not in another JSON files
My pysaprk code snippet:
fetchFile = sark.read.format(file_type)\
.option("inferSchema", "true")\
.option("header","true")\
.load(generated_FileLocation)
fetch file.registerTempTable("CreateDataFrame")
tempData = sqlContext.sql("select origin,converse,start,end,participantId from CreateDataFrame")
When, in some JSON file participantId is not present, an exception is coming. How to handle that kind of exception that if the key is not present so column will contain null or any other ways to handle it
You can simply check if the column is not there then add it will empty values.
The code for the same goes like:
from pyspark.sql import functions as f
fetchFile = sark.read.format(file_type)\
.option("inferSchema", "true")\
.option("header","true")\
.load(generated_FileLocation)
if not 'participantId' in df.columns:
df = df.withColumn('participantId', f.lit(''))
fetch file.registerTempTable("CreateDataFrame")
tempData = sqlContext.sql("select origin,converse,start,end,participantId from CreateDataFrame")
I think you're calling Spark to read one file at a time and inferring the schema at the same time.
What Spark is telling you with the SQL Analysis exception is that your file and your inferred schema doesn't have the key you're looking for. What you have to do is get to your good schema and apply it to all of the files you want to process. Ideally, processing all of your files at once.
There are three strategies:
Infer your schema from lots of files. You should get the aggregate of all of the keys. Spark will run two passes over the data.
df = spark.read.json('/path/to/your/directory/full/of/json/files')
schema = df.schema
print(schema)
Create a schema object
I find this tedious to do, but will speed up your code. Here is a reference: https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.types.StructType
Read the schema from a well formed file then use that to read your whole directory. Also, by printing the schema object, you can copy paste that back into your code for option #2.
schema = spark.read.json('path/to/well/formed/file.json')
print(schema)
my_df = spark.read.schema(schema).json('path/to/entire/folder/full/of/json')

how to insert dataframe having map column in hive table

I have a dataframe with multiple columns out of which one column is map(string,string) type. I'm able to print this dataframe having column as map which gives data as Map("PUN" -> "Pune"). I want to write this dataframe to hive table (stored as avro) which has same column with type map.
Df.withcolumn("cname", lit("Pune"))
withcolumn("city_code_name", map(lit("PUN"), col("cname"))
Df.show(false)
//table - created external hive table..stored as avro..with avro schema
After removing this map type column I'm able to save the dataframe to hive avro table.
Save way to hive table:
spark.save - saving avro file
spark.sql - creating partition on hive table with avro file location
see this test case as an example from spark tests
test("Insert MapType.valueContainsNull == false") {
val schema = StructType(Seq(
StructField("m", MapType(StringType, StringType, valueContainsNull = false))))
val rowRDD = spark.sparkContext.parallelize(
(1 to 100).map(i => Row(Map(s"key$i" -> s"value$i"))))
val df = spark.createDataFrame(rowRDD, schema)
df.createOrReplaceTempView("tableWithMapValue")
sql("CREATE TABLE hiveTableWithMapValue(m Map <STRING, STRING>)")
sql("INSERT OVERWRITE TABLE hiveTableWithMapValue SELECT m FROM tableWithMapValue")
checkAnswer(
sql("SELECT * FROM hiveTableWithMapValue"),
rowRDD.collect().toSeq)
sql("DROP TABLE hiveTableWithMapValue")
}
also if you want save option then you can try with saveAsTable as showed here
Seq(9 -> "x").toDF("i", "j")
.write.format("hive").mode(SaveMode.Overwrite).option("fileFormat", "avro").saveAsTable("t")
yourdataframewithmapcolumn.write.partitionBy is the way to create partitions.
You can achieve that with saveAsTable
Example:
Df\
.write\
.saveAsTable(name='tableName',
format='com.databricks.spark.avro',
mode='append',
path='avroFileLocation')
Change the mode option to whatever suits you

Save and append a file in HDFS using PySpark

I have a data frame in PySpark called df. I have registered this df as a temptable like below.
df.registerTempTable('mytempTable')
date=datetime.now().strftime('%Y-%m-%d %H:%M:%S')
Now from this temp table I will get certain values, like max_id of a column id
min_id = sqlContext.sql("select nvl(min(id),0) as minval from mytempTable").collect()[0].asDict()['minval']
max_id = sqlContext.sql("select nvl(max(id),0) as maxval from mytempTable").collect()[0].asDict()['maxval']
Now I will collect all these values like below.
test = ("{},{},{}".format(date,min_id,max_id))
I found that test is not a data frame but it is a str string
>>> type(test)
<type 'str'>
Now I want save this test as a file in HDFS. I would also like to append data to the same file in hdfs.
How can I do that using PySpark?
FYI I am using Spark 1.6 and don't have access to Databricks spark-csv package.
Here you go, you'll just need to concat your data with concat_ws and right it as a text:
query = """select concat_ws(',', date, nvl(min(id), 0), nvl(max(id), 0))
from mytempTable"""
sqlContext.sql(query).write("text").mode("append").save("/tmp/fooo")
Or even a better alternative :
from pyspark.sql import functions as f
(sqlContext
.table("myTempTable")
.select(f.concat_ws(",", f.first(f.lit(date)), f.min("id"), f.max("id")))
.coalesce(1)
.write.format("text").mode("append").save("/tmp/fooo"))

Resources