Apache Drill cannot read partitioned parquet files - apache-spark

I have created on azure blob storage a parquet file structure with Apache Spark on HD Insight.
This is the structure:
/root
/sitename=www.site1.com
/datekey=20160101
log-01-file.parquet
/sitename=www.site2.com
/datekey=29160192
We want to use Apache Drill in order to run queries againts this parquet structure but we found a few issues.
When running this query
SELECT datekey FROM azure.root.`./root` WHERE sitename='www.mysite.com' GROUP BY datekey
We get this error
"org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: NumberFormatException: www.trovaprezzi.it Fragment 2:2"
What could be the cause of the error?
Again, when running queries without WHERE clause seems that partitions key are seen as null values.
SELECT sitename, COUNT(*) as N FROM azure.root.`./root` GROUP BY sitename
|sitename|N
|NULL |100000|
Does anyone experimented this issue?
Any help will be really appreciated.
Thanks
Rob

HDInsight does not support Drill today. Hive (on Tez) should also be able to leverage the Parquet format, maybe you can try that instead?

At the moment of writing this post drill 1.6 seems work in this way.
Whatever partition scheme you will use DRILL will cal you parition directories structure using: dir0, dir1, etc.etc..
For instance, if we partitioned our data by hostname and date we obtain
|dir0|dir1|...
|host1|20160101|....
|host2|20160101|....

Related

Most optimal method to check length of a parquet table in dbfs with pyspark?

I have a table on dbfs I can read with pyspark, but I only need to know the length of it (nrows). I know I could just read the file and do a table.count() to get it, but that would take some time.
Is there a better way to solve this?
I am afraid not.
Since you are using dbfs, I suppose you are using Delta format with Databricks. So, theoretically, you could check the metastore, but:
The metastore is not the source of truth about the latest information
of a Delta table
https://docs.delta.io/latest/delta-batch.html#control-data-location

Is there a way to read a parquet file with apache flink?

I'm new on Apache Flink and I cannot find a way to read a parquet file from the file system.
I came from Spark where a simple "spark.read.parquet("...")" did the job.
Is it possible?
Thank you in advance
Actually, it depends on the way your are going to read the parquet.
If you are trying to simply read parquet files and want to leverage a DataStream connector, this stackoverflow question can be the entry point and a working example.
If you prefer the Table API, Table & SQL Connectors - Parquet Format can be helpful to start from.

HDFS memory not deleting when table dropped HIVE

Hi I am relatively new to HIVE and HDFS so apologies in advance if I am not wording this correctly.
I have used Microsoft Azure to create a virtual machine. I am then logging into this using putty and Ambari Sandbox.
In Ambari I am using HIVE, all is working fine but I am having major issues with memory allocation.
When I drop a table in Hive I will then go into my 'Hive View' and delete the table from the trash folder. However this is freeing up no memory within the HDFS.
The table is now gone from my HIVE database and also from the trash folder but no memory has been freed.
Is there somewhere else where I should be deleting the table from?
Thanks in advance.
According to your description, as #DuduMarkovitz said, I also don't know what HDFS memory you said is, but I think that you want to say is the table data files on HDFS.
Per my experience, I think the table you dropped in Hive is an external table, not an internal table. You can get the feature below from Hive offical document for External Tables.
External Tables
The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes in handy if you already have data generated. When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.
The difference between interal table and external table, you can refer to here.
So if you want to recycle the external table data from HDFS after dropped the external table, you need to use the commend below for HDFS to remove it manually.
hadoop fs -rm -f -r <your-hdfs-path-url>/apps/hive/warehouse/<database name>/<table-name>
Hope it helps.
Try DESCRIBE FORMATTED <table_name> command. It should show you location of file in HDFS. Check if this location is empty.

External Table not getting updated from parquet files written by spark streaming

I am using spark streaming to write the aggregated output as parquet files to the hdfs using SaveMode.Append. I have an external table created like :
CREATE TABLE if not exists rolluptable
USING org.apache.spark.sql.parquet
OPTIONS (
path "hdfs:////"
);
I had an impression that in case of external table the queries should fetch the data from newly parquet added files also. But, seems like the newly written files are not being picked up.
Dropping and recreating the table every time works fine but not a solution.
Please suggest how can my table have the data from newer files also.
Are you reading those tables with spark?
if so, spark caches parquet tables metadata (since schema discovery can be expensive)
To overcome this, you have 2 options:
Set the config spark.sql.parquet.cacheMetadata to false
refresh the table before the query: sqlContext.refreshTable("my_table")
See here for more details: http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-metastore-parquet-table-conversion

Spark SQL queries on Partitioned Data

I have setup a Spark 1.3.1 application that collects event data. One of the attributes is a timestamp called 'occurredAt'. Im intending to partition the event data into parquet files on a filestore and in accordance with the documentation (https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#partition-discovery) it indicates that time based values are not supported only string and int, so i've split the date into Year, Month, Day values and partitioned as follows
events
|---occurredAtYear=2015
| |---occurredAtMonth=07
| | |---occurredAtDay=16
| | | |---<parquet-files>
...
I then load the parquet file from the root path /events
sqlContext.parquetFile('/var/tmp/events')
Documentation says:
'Spark SQL will automatically extract the partitioning information
from the paths'
However my query
SELECT * FROM events where occurredAtYear=2015
Fails miserably saying spark cannot resolve 'occurredAtYear'
I can see the schema for all other aspects of the event and can do queries on those attributes, but printSchema does not list occurredAtYear/Month/Day on the schema at all? What am I missing to get partitioning working appropriately.
Cheers
So it turns out I was following the instructions too precisely, I was actually writing the parquet files out to
/var/tmp/occurredAtYear=2015/occurredAtMonth=07/occurredAtDay=16/data.parquet
The 'data.parquet' was additionally creating a further directory with parquet files underneath, I should have been saving the parquet file to
/var/tmp/occurredAtYear=2015/occurredAtMonth=07/occurredAtDay=16
All works now with the schema being discovered correctly.

Resources