I am reading a file from Ignite File System using spark.
File system URI igfs://myfs#hostname:4500/path/to/file.
Spark jobs able to read a file by few jobs, but some of the jobs says FileNotFoundException.
Finally execution ends up with FileNotFoundException.
I tested the same code with small files and I was working fine.
Thanks in advance.
Related
Running any spark job from talend remote I get this error. The same jobs run locally do not generate errors. Does anyone have any suggestions?
The job should write a parquet file but it generates this error
after a long run
I have a Spark batch job, which reads some json files writes them to Hive and then queries some other Hive tables, does computation and writes output in Orc format back to Hive.
What I experience is job gets stuck with one stage in pending state.
The DAG looks as follows:
I'm using Hadoop 2.7.3.2.6.5.0-292 and Spark is running on YARN.
I looked at the yarn logs, spark event logs, but do not see an issue.
Just rerunning the job results in same behavior.
The question is: what unknown state in stage means, how to debug why job is in it ?
What will happen if for a running Spark JOB another process deletes .hiveStaging directory?
Will it cause JOB Failure
Data Loss but JOB Success
Not Data loss and JOB Success
Or, are there any HDFS locks that will cause the Directory to be not getting deleted
Thanks
Hive uses temporary folders both on the machine running the Hive client and the default HDFS instance. These folders are used to store per-query temporary/intermediate data sets and are normally cleaned up by the hive client when the query is finished.
Once, the query execution completes the data is moved to the output HDFS location.
When you delete the .hiveStaging directory your hive query/driver code will fail with java.io.IOException.
When running in a single node no-cluster mode, whenever I do rdd.saveAsTextFile("file://...") or df.write().csv("file://...") it creates a folder at that path with part-files and a file called _SUCCESS.
But when I use the same code for cluster mode, it doesn't work. I doesn't throw any errors but there are no part-files created in that folder. Though the folder and the _SUCCESS file are created, the actual part files data is not.
I am not sure what exactly the problem is here. Any suggestions on how to solve this are greatly appreaciated.
Since in cluster mode, tasks are performed in worker machines
You should try to save the file in hadoop or S3 or some fileserver, like ftp if you are running in cluster mode.
I am new to Spark and needed help in figuring out why my Hive databases are not accessible to perform a data load through Spark.
Background:
I am running Hive, Spark, and my Java program on a single machine. It's a Cloudera QuickStart VM, CDH5.4x, on a VirtualBox.
I have downloaded pre-built Spark 1.3.1.
I am using the Hive bundled with the VM and can run hive queries through Spark-shell and Hive cmd line without any issue. This includes running the command:
LOAD DATA INPATH 'hdfs://quickstart.cloudera:8020/user/cloudera/test_table/result.parquet/' INTO TABLE test_spark.test_table PARTITION(part = '2015-08-21');
Problem:
I am writing a Java program to read data from Cassandra and load it into Hive. I have saved the results of the Cassandra read in parquet format in a folder called 'result.parquet'.
Now I would like to load this into Hive. For this, I
Copied the Hive-site.xml to the Spark conf folder.
I made a change to this xml. I noticed that I had two hive-site.xml - one which was auto generated and another which had Hive execution parameters. I combined both into a single hive-site.xml.
Code used (Java):
HiveContext hiveContext = new
HiveContext(JavaSparkContext.toSparkContext(sc));
hiveContext.sql("show databases").show();
hiveContext.sql("LOAD DATA INPATH
'hdfs://quickstart.cloudera:8020/user/cloudera/test_table/result.parquet/'
INTO TABLE test_spark.test_table PARTITION(part = '2015-08-21')").show();
So, this worked. And I could load data into Hive. Except, after I restarted my VM, it has stopped working.
When I run the show databases Hive query, I get a result saying
result
default
instead of the databases in Hive, which are
default
test_spark
I also notice a folder called metastore_db being created in my Project Folder. From googling around, I know this happens when Spark can't connect to the Hive metastore, so it creates one of its own.I thought I had fixed that, but clearly not.
What am I missing?