I have Presto installed along side AWS EMR. I've created a table in Presto from a Hive table.
CREATE TABLE temp_table
WITH (format = 'PARQUET')
AS
SELECT * FROM <hive_table>;
Where are the Parquet files stored?
Or, where are any of the files stored when a CREATE TABLE statement is executed?
The data is stored in the Hive Warehouse, viewable on the Master node.
hdfs://ip-###-###-###-###.ec2.internal:8020/user/hive/warehouse/<table_name>/
Viewable through the following command:
hadoop fs -ls hdfs://ip-###-###-###-###.ec2.internal:8020/user/hive/warehouse/<table_name>/
Related
Version: DBR 8.4 | Spark 3.1.2
Spark allows me to create a bucketed hive table and save it to a location of my choosing.
df_data_bucketed = (df_data.write.mode('overwrite').bucketBy(9600, 'id').sortBy('id')
.saveAsTable('data_bucketed', format='parquet', path=bucketed_path)
)
I have verified that this saves the table data to my specified path (in my case, blob storage).
In the future, the table 'data_bucketed' might wiped from my spark catalog, or mapped to something else, and I'll want to "recreate it" using the data that's been previously written to blob, but I can find no way to load a pre-existing, already bucketed spark table.
The only thing that appears to work is
df_data_bucketed = (spark.read.format("parquet").load(bucketed_path)
.write.mode('overwrite').bucketBy(9600, 'id').sortBy('id')
.saveAsTable('data_bucketed', format='parquet', path=bucketed_path)
)
Which seems non-sensical, because it's essentially loading the data from disk and unnecessarily overwriting it with the exact same data just to take advantage of the buckets. (It's also very slow due to the size of this data)
You can use spark SQL to create that table in your catalog
spark.sql("""CREATE TABLE IF NOT EXISTS tbl...""") following this you can tell spark to rediscover data by running spark.sql("MSCK REPAIR TABLE tbl")
I found the answer at https://www.programmerall.com/article/3196638561/
Read from the saved Parquet file If you want to use historically saved data, you can't use the above method, nor can you use
spark.read.parquet() like reading regular files. The data read in this
way does not carry bucket information. The correct way is to use the
CREATE TABLE statement. For details, refer
to https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-table.html
CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name1 col_type1 [COMMENT col_comment1], ...)]
USING data_source
[OPTIONS (key1=val1, key2=val2, ...)]
[PARTITIONED BY (col_name1, col_name2, ...)]
[CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
[LOCATION path]
[COMMENT table_comment]
[TBLPROPERTIES (key1=val1, key2=val2, ...)]
[AS select_statement]
Examples are as follows:
spark.sql(
"""
|CREATE TABLE bucketed
| (name string)
| USING PARQUET
| CLUSTERED BY (name) INTO 10 BUCKETS
| LOCATION '/path/to'
|""".stripMargin)
I have read other question and I am confused about the option. I want to read a Athena view in EMR spark and from searching on google/stackoverflow, I realized that these view are somehow stored in S3, so I first tried to find the external location of the view through
Describe mydb.Myview
It provides schema but doesnt provide the external location. From which I assumed that I cannot read it as Dataframe from S3
What i have considered so far for reading athena view in Spark
I have considered following options
Make a new table out of this athena VIEW using WITH statment with external format as PARQUET
CREATE TABLE Temporary_tbl_from_view WITH ( format = 'PARQUET', external_location = 's3://my-bucket/views_to_parquet/', ) AS ( SELECT * FROM "mydb"."myview"; );
Another option is based on this answer,which suggests
When you start an EMR cluster (v5.8.0 and later) you can instruct it
to connect to your Glue Data Catalog. This is a checkbox in the
'create cluster' dialog. When you check this option your Spark
SqlContext will connect to the Glue Data Catalog, and you'll be able
to see the tables in Athena.
but I am not sure how can I query this view (not table) in pyspark if athena table/views are available through Glue catalogue in spark context, will the simple statement like this work?
sqlContext.sql("SELECT * from mydbmyview")
Question, What is the more effecient way to read this view in spark, does recreating a table using WITH statement (external location) means that I am storing this thing in Glue catalog or S3 twice? If yes, How can I read it directly through S3 or glue catalog?
Just to share the solution I followed with others, I created my cluster with the following option enabled
Use AWS Glue Data Catalog for table metadata
Afterwards, I saw the database name from AWS GLUE and Was able to see the desired view in tablename as below
spark.sql("use my_db_name")
spark.sql("show tables").show(truncate=False)
+------------+---------------------------+-----------+
|database |tableName |isTemporary|
+------------+---------------------------+-----------+
| my_db_name|tabel1 |false |
| my_db_name|desired_table |false |
| my_db_name|tabel3 |false |
+------------+---------------------------+-----------+
The jar file for druid hive handler is there. Clients table is already there in hive with data. Filename in hive library folder hive-druid-handler-3.1.2.jar.
I am getting the error an when I try to create table in hive for druid
FAILED: SemanticException Cannot find class 'org.apache.hadoop.hive.druid.DruidStorageHandler'
Here is the SQL.
CREATE TABLE ssb_druid_hive
STORED BY 'org.apache.hadoop.hive.
druid.DruidStorageHandler'
TBLPROPERTIES (
"druid.segment.granularity" = "MONTH",
"druid.query.granularity" = "DAY")
AS
SELECT
cast(clients.first_name as int) first_name ,
cast(clients.last_name as int) last_name
from clients
what could be the reason ?
I found some people having the similar problem and here's the Link to the external forum
In conclusion, you may have to reinstall the latest version of the file for it to work.
i.e. download the latest version of Hive. If you have downloaded Hive1, download Hive2 and it would work.
Here's a pdf format of the webpage (just in case that one is dropped):
https://drive.google.com/file/d/1-LgtgJa6FPgULeG09qbFNIYA2EgUCJK9/view?usp=sharing
I faced same issue while creating external table on hive.
We need to add hive-druid-handler-3.1.2.jar jar to your hive server.
To add this temporarily,
1. Download hive-druid-handler-3.1.2.jar from here
2. Copy .jar to s3 or blob
3. Goto hive CLI and type add jars s3://your-bucket/hive-druid-handler-3.1.2.jar
To add Permanently
1. Copy hive-druid-handler-3.1.2.jar into hive lib folder.
hdfs dfs -copyToLocal s3://your-bucket/hive-druid-handler-3.1.2.jar /usr/hdp/4.1.4.8/hive/lib/
2. Restart hive server
There is an external table in hive pointing to s3 location that is not partitioned. The table points to a folder in s3 but the data is in multiple subfolders inside that folder.
This table can be queried even though the table is not partitioned by setting few properties in hive like below,
set hive.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;
set hive.supports.subdirectories=true;
set mapred.input.dir.recursive=true;
However, when the same table is used in spark to load the data into a dataframe using a sql statement like df = sqlContext.sql("select * from table_name"), the action fails saying 'The subfolders in the external s3 location is not a file'.
I tried setting above hive properties in spark using sc.hadoopConfiguration.set("mapred.input.dir.recursive","true") method, but it did not help. Looks like this would help only for sc.textFile kind of loading.
This can be achieved by setting the following property in spark,
sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true")
Note here that the property is set usign sqlContext instead of sparkContext.
And I tested this in spark 1.6.2
Running CDH4 cluster with Impala, I created parquet table and after adding parquet jar files to hive, I can query the table using hive.
Added same set of jars to /opt/presto/lib and restarted coordinator and workers.
parquet-avro-1.2.4.jar
parquet-cascading-1.2.4.jar
parquet-column-1.2.4.jar
parquet-common-1.2.4.jar
parquet-encoding-1.2.4.jar
parquet-format-1.0.0.jar
parquet-generator-1.2.4.jar
parquet-hadoop-1.2.4.jar
parquet-hive-1.2.4.jar
parquet-pig-1.2.4.jar
parquet-scrooge-1.2.4.jar
parquet-test-hadoop2-1.2.4.jar
parquet-thrift-1.2.4.jar
Still getting this error when running parquet select query from Presto:
> select * from test_pq limit 2;
Query 20131116_144258_00002_d3sbt failed : org/apache/hadoop/hive/serde2/SerDe
Presto now supports Parquet automatically.
Try to add the jars in presto plugin dir instead of presto lib dir.
Presto auto loads jars from plugins dirs.