Failed to open HDFS file after load data from Spark - apache-spark

I'm Using Java-Spark.
I'm loading Parquet data into Hive table as follow:
ds.write().mode("append").format("parquet").save(path);
Then I make
spark.catalog().refreshTable("mytable");//mytable is External table
And after I'm trying to see the data from Impala I got the following exception:
Failed to open HDFS file
No such file or directory. root cause: RemoteException: File does not exist
After I make on impala refresh mytable I can see the data.
How can I make the refresh command from Spark?
I'm try also
spark.sql("msck repair table mytable");
And still not working for me.
Any suggestions?
Thanks.

Related

Cannot find class 'org.apache.hadoop.hive.druid.DruidStorageHandler'

The jar file for druid hive handler is there. Clients table is already there in hive with data. Filename in hive library folder hive-druid-handler-3.1.2.jar.
I am getting the error an when I try to create table in hive for druid
FAILED: SemanticException Cannot find class 'org.apache.hadoop.hive.druid.DruidStorageHandler'
Here is the SQL.
CREATE TABLE ssb_druid_hive
STORED BY 'org.apache.hadoop.hive.
druid.DruidStorageHandler'
TBLPROPERTIES (
"druid.segment.granularity" = "MONTH",
"druid.query.granularity" = "DAY")
AS
SELECT
cast(clients.first_name as int) first_name ,
cast(clients.last_name as int) last_name
from clients
what could be the reason ?
I found some people having the similar problem and here's the Link to the external forum
In conclusion, you may have to reinstall the latest version of the file for it to work.
i.e. download the latest version of Hive. If you have downloaded Hive1, download Hive2 and it would work.
Here's a pdf format of the webpage (just in case that one is dropped):
https://drive.google.com/file/d/1-LgtgJa6FPgULeG09qbFNIYA2EgUCJK9/view?usp=sharing
I faced same issue while creating external table on hive.
We need to add hive-druid-handler-3.1.2.jar jar to your hive server.
To add this temporarily,
1. Download hive-druid-handler-3.1.2.jar from here
2. Copy .jar to s3 or blob
3. Goto hive CLI and type add jars s3://your-bucket/hive-druid-handler-3.1.2.jar
To add Permanently
1. Copy hive-druid-handler-3.1.2.jar into hive lib folder.
hdfs dfs -copyToLocal s3://your-bucket/hive-druid-handler-3.1.2.jar /usr/hdp/4.1.4.8/hive/lib/
2. Restart hive server

Spark SQL SaveMode.Overwrite gives FileNotFoundException

I want to read a dataset from an S3 directory, make some updates and overwrite it to the same file. What I do is:
dataSetWriter.writeDf(
finalDataFrame,
destinationPath,
destinationFormat,
SaveMode.Overwrite,
destinationCompression)
However My job fails showing an errorwith this message:
java.io.FileNotFoundException: No such file or directory 's3://processed/fullTableUpdated.parquet/part-00503-2b642173-540d-4c7a-a29a-7d0ae598ea4a-c000.parquet'
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
Why is this happening? Is there anything that I am missing with the "overwrite" mode?
thanks

Error While Writing into a Hive table from Spark Sql

I am trying to insert data into a Hive External table from Spark Sql.
I am created the hive external table through the following command
CREATE EXTERNAL TABLE tab1 ( col1 type,col2 type ,col3 type) CLUSTERED BY (col1,col2) SORTED BY (col1) INTO 8 BUCKETS STORED AS PARQUET
In my spark job , I have written the following code
Dataset df = session.read().option("header","true").csv(csvInput);
df.repartition(numBuckets, somecol)
.write()
.format("parquet")
.bucketBy(numBuckets,col1,col2)
.sortBy(col1)
.saveAsTable(hiveTableName);
Each time I am running this code I am getting the following exception
org.apache.spark.sql.AnalysisException: Table `tab1` already exists.;
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:408)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:393)
at somepackage.Parquet_Read_WriteNew.writeToParquetHiveMetastore(Parquet_Read_WriteNew.java:100)
You should be specifying a save mode while saving the data in hive.
df.write.mode(SaveMode.Append)
.format("parquet")
.bucketBy(numBuckets,col1,col2)
.sortBy(col1)
.insertInto(hiveTableName);
Spark provides the following save modes:
Save Mode
ErrorIfExists: Throws an exception if the target already exists. If target doesn’t exist write the data out.
Append: If target already exists, append the data to it. If the data doesn’t exist write the data out.
Overwrite: If the target already exists, delete the target. Write the data out.
Ignore: If the target already exists, silently skip writing out. Otherwise write out the data.
You are using the saveAsTable API, which create the table into Hive. Since you have already created the hive table through command, the table tab1 already exists. so when Spark API trying to create it, it throws error saying table already exists, org.apache.spark.sql.AnalysisException: Tabletab1already exists.
Either drop the table and let spark API saveAsTable create the table itself.
Or use the API insertInto to insert into an existing hive table.
df.repartition(numBuckets, somecol)
.write()
.format("parquet")
.bucketBy(numBuckets,col1,col2)
.sortBy(col1)
.insertInto(hiveTableName);

Spark returns Empty DataFrame but Populated in Hive

I have a table in hive
db.table_name
When I run the following in hive I get results back
SELECT * FROM db.table_name;
When I run the following in a spark-shell
spark.read.table("db.table_name").show
It shows nothing. Similarly
sql("SELECT * FROM db.table_name").show
Also shows nothing. Selecting arbitrary columns out before the show also displays nothing. Performing a count states the table has 0 rows.
Running the same queries works against other tables in the same database.
Spark Version: 2.2.0.cloudera1
The table is created using
table.write.mode(SaveMode.Overwrite).saveAsTable("db.table_name")
And if I read the file using the parquet files directly it works.
spark.read.parquet(<path-to-files>).show
EDIT:
I'm currently using a workaround by describing the table and getting the location and using spark.read.parquet.
Have you refresh metadata table? Maybe you need to refresh table to access to new data.
spark.catalog.refreshTable("my_table")
I solved the problem by using
query_result.write.mode(SaveMode.Overwrite).format("hive").saveAsTable("table")
which stores the results in textfile.
There is probably some incompatibility with the Hive parquet.
I also found a Cloudera report about it (CDH Release Notes): they recommend creating the Hive table manually and then load data from a temporary table or by query.

Unable to query HIVE Parquet based EXTERNAL table from spark-sql

We have an External Hive Table which is stored as Parquet. I am not the owner of the schema in which this hive-parquet table is so don't have much info.
The Problem here is when in try to Query that table from spark-sql>(Shell prompt) Not by using scala like spark.read.parquet("path"), I am getting 0 records stating "Unable to infer schema". But when i created a Managed Table by using CTAS in my personal schema just for testing i was able to query it from the spark-sql>(Shell prompt)
When i try it from spark-shell> via spark.read.parquet("../../00000_0").show(10) , I was able to see the data.
So this clears that something is wrong between
External Hive table - Parquet - Spark-SQL(shell)
If locating Schema would be the issue then it should behave same while accessing through spark session (spark.read.parquet(""))
I am using MapR 5.2, Spark version 2.1.0
Please suggest what can be the issue

Resources