Does Presto support Parquet format? - presto

Running CDH4 cluster with Impala, I created parquet table and after adding parquet jar files to hive, I can query the table using hive.
Added same set of jars to /opt/presto/lib and restarted coordinator and workers.
parquet-avro-1.2.4.jar
parquet-cascading-1.2.4.jar
parquet-column-1.2.4.jar
parquet-common-1.2.4.jar
parquet-encoding-1.2.4.jar
parquet-format-1.0.0.jar
parquet-generator-1.2.4.jar
parquet-hadoop-1.2.4.jar
parquet-hive-1.2.4.jar
parquet-pig-1.2.4.jar
parquet-scrooge-1.2.4.jar
parquet-test-hadoop2-1.2.4.jar
parquet-thrift-1.2.4.jar
Still getting this error when running parquet select query from Presto:
> select * from test_pq limit 2;
Query 20131116_144258_00002_d3sbt failed : org/apache/hadoop/hive/serde2/SerDe

Presto now supports Parquet automatically.

Try to add the jars in presto plugin dir instead of presto lib dir.
Presto auto loads jars from plugins dirs.

Related

Cannot read Delta tables created by Spark in Hive or Dbeaver/JDBC

I've used Spark 3.3.1, configured with delta-core_2.12.2.2.0 and delta-storage-2.2.0, to create several tables within an external database.
spark.sql("create database if not exists {database}.{table} location {path_to_storage}")
Within that database I've got several delta tables, created and populated through Spark, e.g.:
{table_df}.write.format("delta").mode("overwrite").saveAsTable("{database}.{table}")
I can then, right away, address it so:
df = spark.sql("select * from {database}.{table} limit 10")
df.show()
And everything works fine.
When I try to run the same command (select * from {database}.{table} limit 10;) through hive, or dbeaver sql editor, I get the following error:
hive> select * from {database}.{table} limit 10;
2023-01-04T12:45:29,474 INFO [main] org.apache.hadoop.hive.conf.HiveConf - Using the default value passed in for log id: 9ecfc0dd-0606-4060-98ef-b1395fc62456
2023-01-04T12:45:29,484 INFO [main] org.apache.hadoop.hive.ql.session.SessionState - Updating thread name to 9ecfc0dd-0606-4060-98ef-b1395fc62456 main
2023-01-04T12:45:31,138 INFO [9ecfc0dd-0606-4060-98ef-b1395fc62456 main] org.apache.hadoop.hive.common.FileUtils - Creating directory if it doesn't exist: hdfs://localhost:9000/tmp/hive/user/9ecfc0dd-0606-4060-98ef-b1395fc62456/hive_2023-01-04_12-45-29_879_2613740326746479876-1/-mr-10001/.hive-staging_hive_2023-01-04_12-45-29_879_2613740326746479876-1
OK
Failed with exception java.io.IOException:java.io.IOException: file:/{path_to_file_storage}/part-00000-7708a52c-0939-4288-b56a-ecdeea197574-c000.snappy.parquet not a SequenceFile
Time taken: 1.649 seconds
2023-01-04T12:45:31,298 INFO [9ecfc0dd-0606-4060-98ef-b1395fc62456 main] org.apache.hadoop.hive.conf.HiveConf - Using the default value passed in for log id: 9ecfc0dd-0606-4060-98ef-b1395fc62456
2023-01-04T12:45:31,298 INFO [9ecfc0dd-0606-4060-98ef-b1395fc62456 main] org.apache.hadoop.hive.ql.session.SessionState - Resetting thread name to main
hive>
I have installed the delta connectors (delta-hive-assembly_2.12-0.6.0.jar) from here:
https://github.com/delta-io/connectors/blob/master/hive/README.md
Installed it in an auxjar folder in my main hive directory and added the following properties in my hive-site.xml file:
<property>
<name>hive.input.format</name>
<value>io.delta.hive.HiveInputFormat</value>
</property>
<property>
<name>hive.tez.input.format</name>
<value>io.delta.hive.HiveInputFormat</value>
</property>
<property>
<name>hive.aux.jars.path</name>
<value>file:/{path_to_file}/auxjar/delta-hive-assembly_2.12-0.6.0.jar</value>
</property>
When I start hive I'm not seeing an exceptions about the file not being found. Have I missed a critical step out?
Thanks
Tried running a simple query in hive, got an IOException.
Hive Connector
See FAQs at base of page:
If a table in the Hive Metastore is created by other systems such as
Apache Spark or Presto, can I use this connector to query it in Hive?
No. If a table in the Hive Metastore is created by other systems such
as Apache Spark or Presto, Hive cannot find the correct connector to
read it. You can follow our instruction to create a new table with a
different table name but point to the same path in Hive. Although it's
a different table name, the underlying data will be shared by all of
systems. We recommend to create different tables in different systems
but point to the same path.

Cannot find class 'org.apache.hadoop.hive.druid.DruidStorageHandler'

The jar file for druid hive handler is there. Clients table is already there in hive with data. Filename in hive library folder hive-druid-handler-3.1.2.jar.
I am getting the error an when I try to create table in hive for druid
FAILED: SemanticException Cannot find class 'org.apache.hadoop.hive.druid.DruidStorageHandler'
Here is the SQL.
CREATE TABLE ssb_druid_hive
STORED BY 'org.apache.hadoop.hive.
druid.DruidStorageHandler'
TBLPROPERTIES (
"druid.segment.granularity" = "MONTH",
"druid.query.granularity" = "DAY")
AS
SELECT
cast(clients.first_name as int) first_name ,
cast(clients.last_name as int) last_name
from clients
what could be the reason ?
I found some people having the similar problem and here's the Link to the external forum
In conclusion, you may have to reinstall the latest version of the file for it to work.
i.e. download the latest version of Hive. If you have downloaded Hive1, download Hive2 and it would work.
Here's a pdf format of the webpage (just in case that one is dropped):
https://drive.google.com/file/d/1-LgtgJa6FPgULeG09qbFNIYA2EgUCJK9/view?usp=sharing
I faced same issue while creating external table on hive.
We need to add hive-druid-handler-3.1.2.jar jar to your hive server.
To add this temporarily,
1. Download hive-druid-handler-3.1.2.jar from here
2. Copy .jar to s3 or blob
3. Goto hive CLI and type add jars s3://your-bucket/hive-druid-handler-3.1.2.jar
To add Permanently
1. Copy hive-druid-handler-3.1.2.jar into hive lib folder.
hdfs dfs -copyToLocal s3://your-bucket/hive-druid-handler-3.1.2.jar /usr/hdp/4.1.4.8/hive/lib/
2. Restart hive server

PrestoDB - Where are Parquet files stored?

I have Presto installed along side AWS EMR. I've created a table in Presto from a Hive table.
CREATE TABLE temp_table
WITH (format = 'PARQUET')
AS
SELECT * FROM <hive_table>;
Where are the Parquet files stored?
Or, where are any of the files stored when a CREATE TABLE statement is executed?
The data is stored in the Hive Warehouse, viewable on the Master node.
hdfs://ip-###-###-###-###.ec2.internal:8020/user/hive/warehouse/<table_name>/
Viewable through the following command:
hadoop fs -ls hdfs://ip-###-###-###-###.ec2.internal:8020/user/hive/warehouse/<table_name>/

How to specify the path where saveAsTable saves files to?

I am trying to save a DataFrame to S3 in pyspark in Spark1.4 using DataFrameWriter
df = sqlContext.read.format("json").load("s3a://somefile")
df_writer = pyspark.sql.DataFrameWriter(df)
df_writer.partitionBy('col1')\
.saveAsTable('test_table', format='parquet', mode='overwrite')
The parquet files went to "/tmp/hive/warehouse/...." which is a local tmp directory on my driver.
I did setup hive.metastore.warehouse.dir in hive-site.xml to a "s3a://...." location, but spark doesn't seem to respect to my hive warehouse setting.
Use path.
df_writer.partitionBy('col1')\
.saveAsTable('test_table', format='parquet', mode='overwrite',
path='s3a://bucket/foo')
you can use insertInto(tablename) to overwrite a existing table since 1.4

Spark Sql 1.3.0 + parquet

USING SPARK-SQL:
i've created a table without parquet in hdfs and everything is ok.
i've created the same table structure but with "store as parquet", also i've created the parquet files and upload to hdfs and "load inpath 'hdfs://servever/parquet_files'
but when i try to execute "select * from table_name";
i've this exception
Exception in thread "main" java.sql.SQLException: java.lang.IllegalArgumentException: Wrong FS: hdfs://server:8020/user/hive/warehouse/table_name, expected: file:///
any tip??
Fixed including hadoop configuration files (core-site.xml and hdfs-site.xml) in spark

Resources