SparkSQL attempts to read data from non-existing path - apache-spark

I am having an issue with pyspark sql module. I created a partitioned table and saved it as parquet file into hive table by running spark job after multiple transformations.
Data load is successful into hive and also able to query the data. But when I try to query the same data from spark it says file path doesn't exist.
java.io.FileNotFoundException: File hdfs://localhost:8020/data/path/of/partition partition=15f244ee8f48a2f98539d9d319d49d9c does not exist
The partition which is mentioned in above error was the old partitioned column data which doesn't even exist now.
I have run the spark job which populates a new partition value.
I searched for solutions but all I can see is people say there was no issue in spark version 1.4 and there is an issue in 1.6
Can someone please suggest me the solution for this problem.

Related

Getting duplicate records while querying Hudi table using Hive on Spark Engine in EMR 6.3.1

I am querying a Hudi table using Hive which is running on Spark engine in EMR cluster 6.3.1
Hudi version is 0.7
I have inserted a few records and then updated the same using Hudi Merge on Read. This will internally create new files under the same partition with updated data/records.
Now, when I query the same table using Spark SQL, it works fine and does not return any duplicates. Basically, it only honours the latest records/parquet files for processing. It also works fine when I use Tez as the underlying engine for Hive.
But, when I run the same query on Hive prompt with Spark as underlying execution engine, it returns all the records and does not filter the previous parquet files.
Have tried setting the property spark.sql.hive.convertMetastoreParquet=false, still it did not work.
Please help.
This is a known issue in Hudi.
Still, using the below property, I am able to remove the duplicates in RO (read optimised) Hudi tables. The issue still persists in RT table (real time).
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat

Parquet column issue while loading data to SQL Server using spark-submit

I am facing the following issue while migrating the data from hive to SQL Server using spark job with query given through JSON file.
Caused by: org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file.
Column: [abc], Expected: string, Found: INT32
Now from what I understand is the parquet file contains different column structure than the Hive view. I am able to retrieve data using tools like Teradata, etc. While loading to different server causes the issue.
Can anyone help me understand the problem and give a workaround for the same?
Edit:
spark version 2.4.4.2
Scala version 2.11.12
Hive 2.3.6
SQL Server 2016

Hive Table or view not found although the Table exists

I am trying to run a spark job written in Java, on the Spark cluster to load records as dataframe into a Hive Table i created.
df.write().mode("overwrite").insertInto(dbname.tablename);
Although the table and database exists in Hive, it throws below error:
org.apache.spark.sql.AnalysisException: Table or view not found: dbname.tablename, the database dbname doesn't exist.;
I also tried reading from an existing hive table different than the above table thinking there might be an issue while my table creation.
I also checked if my user has permission to the hdfs folder where the hive is storing the data.
It all looks fine, not sure what could be the issue.
Please suggest.
Thanks
I think it is searching for that table in spark instead of hive.

External Table not getting updated from parquet files written by spark streaming

I am using spark streaming to write the aggregated output as parquet files to the hdfs using SaveMode.Append. I have an external table created like :
CREATE TABLE if not exists rolluptable
USING org.apache.spark.sql.parquet
OPTIONS (
path "hdfs:////"
);
I had an impression that in case of external table the queries should fetch the data from newly parquet added files also. But, seems like the newly written files are not being picked up.
Dropping and recreating the table every time works fine but not a solution.
Please suggest how can my table have the data from newer files also.
Are you reading those tables with spark?
if so, spark caches parquet tables metadata (since schema discovery can be expensive)
To overcome this, you have 2 options:
Set the config spark.sql.parquet.cacheMetadata to false
refresh the table before the query: sqlContext.refreshTable("my_table")
See here for more details: http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-metastore-parquet-table-conversion

Spark DataFrame saveAsTable with partitionBy creates no ORC file in HDFS

I have a Spark dataframe which I want to save as Hive table with partitions. I tried the following two statements but they don't work. I don't see any ORC files in HDFS directory, it's empty. I can see baseTable is there in Hive console but obviously it's empty because of no files inside HDFS.
The following two lines saveAsTable() and insertInto()do not work. registerDataFrameAsTable() method works but it creates in memory table and causing OOM in my use case as I have thousands of Hive partitions to process. I am new to Spark.
dataFrame.write().mode(SaveMode.Append).partitionBy("entity","date").format("orc").saveAsTable("baseTable");
dataFrame.write().mode(SaveMode.Append).format("orc").partitionBy("entity","date").insertInto("baseTable");
//the following works but creates in memory table and seems to be reason for OOM in my case
hiveContext.registerDataFrameAsTable(dataFrame, "baseTable");
Hope you have already got your answer , but posting this answer for others reference, partitionBy was only supported for Parquet till Spark 1.4 , support for ORC ,JSON, text and avro was added in version 1.5+ please refer the doc below
https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/DataFrameWriter.html

Resources