java.io.FileNotFoundException: Item not found Concurrent read/write on ORC table - apache-spark

When I try concurrent read/write on a table using spark application, I get the following error:
19/10/28 15:26:49 WARN TaskSetManager: Lost task 213.0 in stage 6.0 (TID 407, prod.internal, executor 3): java.io.FileNotFoundException: Item not found: 'gs://bucket/db_name/table_name/p1=xxx/part-1009-54ad3fbb-5eed-43ba-a7da-fb875382897c.c000'. If you enabled STRICT generation consistency, it is possible that the live version is still available but the intended generation is deleted.
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.getFileNotFoundException(GoogleCloudStorageExceptions.java:38)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.open(GoogleCloudStorageImpl.java:631)
I am using Google Cloud Dataproc Version 1.4 and stock hadoop component versions.
I was previously writing and reading from same partition of a PARQUET table but it used to throw a refresh table error. Now I'm using an ORC format table, but the error stays somewhat same. Any solutions for concurrent read/write on hive tables using spark applications?

You can try running;
spark.sql("refresh table your_table")
statement before your read/write operation it can work "occasionally".

First error line indicates that you were file was not found in your bucket, you may want to look into this. Make sure to check the existence of your folders and make sure the files and the requested versions are accessible.
For the “STRICT generation consistency”, this is most probably related to Cloud Storage and produced by the connector, more precisely related to “Strongly consistent operation’.
https://cloud.google.com/storage/docs/consistency
Have you looked into your error logs to see why this error occurs? What type of environment are you running your application on?
This may be more of a Hive issue relating to the concurrency mechanism in which you want to implement.
https://cwiki.apache.org/confluence/display/Hive/Locking
Also, I would advise you to look more into the recommendations and functionalities of using Apache Hive on Cloud Dataproc. You can also consider using a multi-regional bucket if the Hive data needs to be accessed from Hive servers that are located in multiple locations.
https://cloud.google.com/solutions/using-apache-hive-on-cloud-dataproc

Related

Can spark ignore the a task failure due to an account data issue and continue the job process for other accounts?

I want spark to ignore some failed tasks due to data issues. Also, I want spark not to stop the whole job due to some insert failures.
if you are using databricks, you can handle bad records and files as explained in this article.
https://docs.databricks.com/spark/latest/spark-sql/handling-bad-records.html
From the documentation:
Databricks provides a unified interface for handling bad records and
files without interrupting Spark jobs. You can obtain the exception
records/files and reasons from the exception logs by setting the data
source option badRecordsPath. badRecordsPath specifies a path to store
exception files for recording the information about bad records for
CSV and JSON sources and bad files for all the file-based built-in
sources (for example, Parquet).
You can also use some data cleansing library like Pandas,Optimus, sparkling.data,spark vanilla, dora etc. That will give you an insight into the bad data and let you fix your data before running analysis on it.

CDH 6.2 Hive cannot execute queries neither on Spark nor MapReduce

I'm trying to run a simple select count(*) from table query on Hive, but it fails with the following error:
FAILED: Execution Error, return code 30041 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Failed to create Spark client for Spark session 5414a8a4-5252-4ccf-b63e-2ee563f7d772_0: java.lang.ClassNotFoundException: org.apache.spark.SparkConf
This is happening since I've moved to CDH 6.2 and enabled Spark (version 2.4.0-cdh6.2.0) as the execution engine of Hive (version 2.1.1-cdh6.2.0).
My guess is that Hive is not correctly configured to launch Spark. I've tried setting the spark.home property of the hive-site.xml to /opt/cloudera/parcels/CDH/lib/spark/, and setting the SPARK_HOME environment variable to the same value, but it made no difference.
A similar issue was reported here, but the solution (i.e., to put the spark-assembly.jar file in Hive's lib directory) cannot be applied (as the file is no longer built in latest Spark's versions).
A previous question addressed a similar but different issue, related to memory limits on YARN.
Also, switching to MapReduce as the execution engine still fails, but with a different error:
FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org/apache/hadoop/hdfs/protocol/SystemErasureCodingPolicies
Looking for the latest error on Google shows no result at all.
UPDATE: I discovered that queries do work when connecting to Hive through other tools (e.g., Beeline, Hue, Spark) and independently of the underlying execution engine (i.e., MapReduce or Spark). Thus, the error may lie within the Hive CLI, which is currently deprecated.
UPDATE 2: the same problem actually happened on Beeline and Hue with a CREATE TABLE query; I was able to execute it only with the Hive interpreter of Zeppelin

How Do I monitor progess and recover in a long-running Spark map job?

We're using Spark to run an ETL process by which data gets loaded in from a massive (500+GB) MySQL database and converted into aggregated JSON files, then gets written out to Amazon S3.
My question is two-fold:
This job could take a long time to run, and it would be nice to know how that mapping is going. I know Spark has a built in log manager. Is it as simple as just putting a log statement inside of each map? I'd like to know when each record gets mapped.
Suppose this massive job fails in the middle (maybe it chokes on a DB record or the MYSQL connection drops). Is there an easy way to recover from this in Spark? I've heard that caching/checkpointing can potentially solve this, but I'm not sure how?
Thanks!
Seems like 2 questions with lost of answers and detail. Anyway, assuming non-SPARK Streaming answer and referencing other based on my own reading / research, a limited response:
The following on logging progress checking of stages, tasks, jobs:
Global Logging via log4j and tailoring of this by using under the template stored under SPARK_HOME/conf folder, the template log4j.properties.template file which serves as a basis for defining logging requirements for ones own purposes but at SPARK level.
Programmtically by using Logger, using import org.apache.log4j.{Level, Logger}.
REST API to get status of SPARK Jobs. See this enlightening blog: http://arturmkrtchyan.com/apache-spark-hidden-rest-api
There is also a Spark Listener that can be used
:http://:8080 to see progress via Web UI.
Depends on type of failure. Graceful vs. non-graceful, fault tolerance aspects or memory usage issues and things like serious database duplicate key errors depending on API used.
See How does Apache Spark handles system failure when deployed in YARN? SPARK handles its own failures by looking at DAG and attempting to reconstruct a partition by re-execution of what is needed. This all encompasses aspects under fault tolerance for which nothing needs to be done.
Things outside of SPARK's domain and control mean it's over. E.g. memory issues that may result from exceeding various parameters on at large scale computations, DF JDBC write against a store with a duplicate error, JDBC connection outages. This means re-execution.
As an aside, some aspects are not logged as failures even though they are, e.g. duplicate key inserts on some Hadoop Storage Managers.

Spark as Data Ingestion/Onboarding to HDFS

While exploring various tools like [Nifi, Gobblin etc.], I have observed that Databricks is now promoting for using Spark for data ingestion/on-boarding.
We have a spark[scala] based application running on YARN. So far we are working on a hadoop and spark cluster where we manually place required data files in HDFS first and then run our spark jobs later.
Now when we are planning to make our application available for the client we are expecting any type and number of files [mainly csv, jason, xml etc.] from any data source [ftp, sftp, any relational and nosql database] of huge size [ranging from GB to PB].
Keeping this in mind we are looking for options which could be used for data on-boarding and data sanity before pushing data into HDFS.
Options which we are looking for based on priority:
1) Spark for data ingestion and sanity: As our application is written and is running on spark cluster, we are planning to use the same for data ingestion and sanity task as well.
We are bit worried about Spark's support for many datasources/file types/etc. Also, we are not sure if we try to copy data from let's say any FTP/SFTP then will all workers will write data on HDFS in parallel? Is there any limitation while using it? Is there any Audit trail maintained by Spark while this data copy?
2) Nifi in clustered mode: How good Nifi would be for this purpose? Can it be used for any datasource and for any size of file? Will be maintain the Audit trail? Would Nifi we able to handle such large files? How large cluster would be required in case we try to copy GB - PB of data and perform certain sanity on top of that data before pushing it to HDFS?
3) Gobblin in clustered mode: Would like to hear similar answers as that for Nifi?
4) If at all there is any other good option available for this purpose with lesser infra/cost involved and better performance?
Any guidance/pointers/comparisions for above mentioned tools and technologies would be appreciated.
Best Regards,
Bhupesh
After doing certain R&D and considering the fact that using NIFI or goblin will demand for more infrastructure cost. I have started testing Spark for data on-boarding.
SO far I have tried using Spark job for importing data [present at a remote staging area/node] into my HDFS and I am able to do that by mounting that remote location with all my spark cluster worker nodes. Doing this made that location local to those workers, hence spark job ran properly and data is on-boarded to my HDFS.
Since my whole project is going to be on Spark, hence keeping data on-boarding part on spark would not cost anything extra to me. So far I am going good. Hence I would suggest to others as well, if you already have spark cluster and hadoop cluster up and running then instead of adding extra cost [where cost could be a major constraint] go for spark job for data on-boarding.

WARN TaskSetManager: Lost task com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to parse XML document with handler

Anyone have experience with this kind of error? I'm seeing this error when running spark 2.0.1 jobs using the s3a protocol.
I'm also seeing sporadic failures of saveAsTextFile to s3. I think it's recovering at least sometimes.
Trying to find a direction, if it's the pyspark implementation, or s3a properties, limits, timeouts, or something else.
thank you!
The problem was after running spark jobs for almost a year we accumulated a lot of files under the same S3 path. S3 performance was the issue. All I did was change the top level "subdir" so paths on newly created files were different and performance dramatically improved.
Good to hear this fix.
If you see it again, can you add the stack trace to a JIRA at issues apache org, project HADOOP, component fs/s3? This may show us where we can do a bit more retry logic on failing operations.

Resources