Spark SQL queries on Partitioned Data - apache-spark

I have setup a Spark 1.3.1 application that collects event data. One of the attributes is a timestamp called 'occurredAt'. Im intending to partition the event data into parquet files on a filestore and in accordance with the documentation (https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#partition-discovery) it indicates that time based values are not supported only string and int, so i've split the date into Year, Month, Day values and partitioned as follows
events
|---occurredAtYear=2015
| |---occurredAtMonth=07
| | |---occurredAtDay=16
| | | |---<parquet-files>
...
I then load the parquet file from the root path /events
sqlContext.parquetFile('/var/tmp/events')
Documentation says:
'Spark SQL will automatically extract the partitioning information
from the paths'
However my query
SELECT * FROM events where occurredAtYear=2015
Fails miserably saying spark cannot resolve 'occurredAtYear'
I can see the schema for all other aspects of the event and can do queries on those attributes, but printSchema does not list occurredAtYear/Month/Day on the schema at all? What am I missing to get partitioning working appropriately.
Cheers

So it turns out I was following the instructions too precisely, I was actually writing the parquet files out to
/var/tmp/occurredAtYear=2015/occurredAtMonth=07/occurredAtDay=16/data.parquet
The 'data.parquet' was additionally creating a further directory with parquet files underneath, I should have been saving the parquet file to
/var/tmp/occurredAtYear=2015/occurredAtMonth=07/occurredAtDay=16
All works now with the schema being discovered correctly.

Related

spark streaming and delta tables: java.lang.UnsupportedOperationException: Detected a data update

The setup:
Azure Event Hub -> raw delta table -> agg1 delta table -> agg2 delta table
The data is processed by spark structured streaming.
Updates on target delta tables are done via foreachBatch using merge.
In the result I'm getting error:
java.lang.UnsupportedOperationException: Detected a data update (for
example
partKey=ap-2/part-00000-2ddcc5bf-a475-4606-82fc-e37019793b5a.c000.snappy.parquet)
in the source table at version 2217. This is currently not supported.
If you'd like to ignore updates, set the option 'ignoreChanges' to
'true'. If you would like the data update to be reflected, please
restart this query with a fresh checkpoint directory.
Basically I'm not able to read the agg1 delta table via any kind of streaming. If I switch the last streaming from delta to memory I'm getting the same error message. With first streaming I don't have any problems.
Notes.
Between aggregations I'm changing granuality: agg1 delta table (trunc date to minutes), agg2 delta table (trunc date to days).
If I turn off all other streaming, the last one still doesn't work
The agg2 delta table is new fresh table with no data
How the streaming works on the source table:
It reads the files that belongs to our source table. It's not able to handle changes in these files (updates, deletes). If anything like that happens you will get the error above. In other words. DDL operations modify the underlying files. The only difference is for INSERTS. New data arrives in new file if not configured differently.
To fix that you would need to set an option: ignoreChanges to True.
This option will cause that you will get all the records from the modified file. So, you will get again the same records as before plus this one modified.
The problem: we have aggregations, the aggregated values are stored in the checkpoint. If we get again the same record (not modified) we will recognize it as an update and we will increase the aggregation for its grouping key.
Solution: we can't read agg table to make another aggregations. We need to read the raw table.
reference: https://docs.databricks.com/structured-streaming/delta-lake.html#ignore-updates-and-deletes
Note: I'm working on Databricks Runtime 10.4, so I'm using new shuffle merge by default.

Spark parquet schema evolution

I have a partitioned hdfs parquet location which is having different schema is different partition.
Say 5 columns in first partition, 4 cols in 2nd partition. Now I try to read the base Parquet path and then filter the 2nd partition.
This gives me 5 columns in the DF even though I have only 4 columns in Parquet files in 2nd partition.
When I read the 2nd partition directly, it gives correct 4 cols. How to fix this.
You can specify the required schema(4 columns) while reading the parquet file!
Then spark only reads the fields that included in the schema, if field not exists in the data then null will be returned.
Example:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val sch=new StructType().add("i",IntegerType).add("z",StringType)
spark.read.schema(sch).parquet("<parquet_file_path>").show()
//here i have i in my data and not have z field
//+---+----+
//| i| z|
//+---+----+
//| 1|null|
//+---+----+
I would really like to hep you but I am not sure what you actually want to achieve. What's your intention about this?
If you to read the parquet file with all it's partitions and you just wanna get the columns both partitions have, maybe the read option "mergeSchema" fits your need.
Like Protocol Buffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.
Since schema merging is a relatively expensive operation, and is not a
necessity in most cases, we turned it off by default starting from
1.5.0. You may enable it by setting data source option mergeSchema to true when reading Parquet files (as shown in the examples below), or
setting the global SQL option spark.sql.parquet.mergeSchema to true.
refer to spark documentation
so it would be interesting which version of spark you are using and how the properties spark.sql.parquet.mergeSchema (spark setting) and mergeSchema (client) are set

Does Spark saveastable infer schema from dataframe

I am using Spark 2.0. Requirement is to create a new table from selecting values into dataframe. While writing out the df as
df.write.saveasTable(hive_table_name, format='parquet',mode='overwrite')
Error is:
client cannot authenticate via:[TOKEN, KERBEROS]
Host Details : local host is: "some-ip" and destination host is:"some-other-ip"
Also if the table does not exist in hive, will spark.write.saveasTable create a new table in hive and auto infer the schema?
I am not sure about the error you are receiving, but yes, Spark will infer the schema automatically if you try to create a table that doesn't exist.
Hope that helps!
Subhash
Ok, some learnings over the past weeks,
saveasTable saves a table to the hdfs file system. Without a schema explicitly created on Hive to consume the parquet file, the schema inference from spark, while creating the dataframe is not used by hive to reflect the existing columns of a table on Hive.
Schema inference is only for JSON, CSV and not for .dat files or compressed textfiles. These files would have to be dealt with delimiters and the dataframe would have to be renamed with first row as column title and then saved to the disk.

Apache Drill cannot read partitioned parquet files

I have created on azure blob storage a parquet file structure with Apache Spark on HD Insight.
This is the structure:
/root
/sitename=www.site1.com
/datekey=20160101
log-01-file.parquet
/sitename=www.site2.com
/datekey=29160192
We want to use Apache Drill in order to run queries againts this parquet structure but we found a few issues.
When running this query
SELECT datekey FROM azure.root.`./root` WHERE sitename='www.mysite.com' GROUP BY datekey
We get this error
"org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: NumberFormatException: www.trovaprezzi.it Fragment 2:2"
What could be the cause of the error?
Again, when running queries without WHERE clause seems that partitions key are seen as null values.
SELECT sitename, COUNT(*) as N FROM azure.root.`./root` GROUP BY sitename
|sitename|N
|NULL |100000|
Does anyone experimented this issue?
Any help will be really appreciated.
Thanks
Rob
HDInsight does not support Drill today. Hive (on Tez) should also be able to leverage the Parquet format, maybe you can try that instead?
At the moment of writing this post drill 1.6 seems work in this way.
Whatever partition scheme you will use DRILL will cal you parition directories structure using: dir0, dir1, etc.etc..
For instance, if we partitioned our data by hostname and date we obtain
|dir0|dir1|...
|host1|20160101|....
|host2|20160101|....

External Table not getting updated from parquet files written by spark streaming

I am using spark streaming to write the aggregated output as parquet files to the hdfs using SaveMode.Append. I have an external table created like :
CREATE TABLE if not exists rolluptable
USING org.apache.spark.sql.parquet
OPTIONS (
path "hdfs:////"
);
I had an impression that in case of external table the queries should fetch the data from newly parquet added files also. But, seems like the newly written files are not being picked up.
Dropping and recreating the table every time works fine but not a solution.
Please suggest how can my table have the data from newer files also.
Are you reading those tables with spark?
if so, spark caches parquet tables metadata (since schema discovery can be expensive)
To overcome this, you have 2 options:
Set the config spark.sql.parquet.cacheMetadata to false
refresh the table before the query: sqlContext.refreshTable("my_table")
See here for more details: http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-metastore-parquet-table-conversion

Resources