How to handle reading non-existing file in spark - apache-spark

Am trying to read some files from HDFS using spark sc.wholeTextFiles, I pass a list of the required files, yet the job keeps throwing
py4j.protocol.Py4JJavaError: An error occurred while calling o98.showString.
: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist:
if one of the files didn't exist.
how can I bypass the not found files and only read found ones ?

To know if a file exists (and delete it, in my case) I do the following:
import org.apache.hadoop.fs.{FileSystem, Path}
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
if (fs.exists(new Path(fullPath))) {
println("Output directory already exists. Deleting it...")
fs.delete(new Path(fullPath), true)
}

Use jvm File system from spark context to check file
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("path/to/file.csv"))
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("test.csv"))
True
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("fail_test.csv"))
False

Related

How to use wildcard in hdfs file path while list out files in nested folder

I'm using below code to list out files in nested folder
val hdfspath="/data/retail/apps/*/landing"
import org.apache.hadoop.fs. (FileSystem, Path) val fs= org.apache.hadoop.fs.FileSystem.get (spark.sparkContext.hadoopConfiguration) fs.listStatus (new Path (s"S(hdfspath)")).filter(.isDirectory).map(_.getPath).foreach (printin)
If I use path as below
hdfspath="/data/retail/apps"
getting results but if I use val hdfspath="/data/retail/apps/*/landing" then I'm getting error it's showing path not exist error.plese help me out.
Error Image
according to this answer, you need to use globStauts instead of listStatus:
fs.globStatus(new Path (s"S(hdfspath)")).filter(.isDirectory).map(_.getPath).foreach (println)

How to specify path to read a .txt file in assets directory

I want to get details about a selected harbour,by taking val s from a list extracted using readLines from a .txt file, where each harbour has a .txt file in the assets directory. I generate the file name but when the app is run in the emulator I get a file not found error.
In this case I am trying to get at a file called Brehatharm.txt
var portChosen = "Brehat"
//"tide2a/app/src/main/assets/"+//various paths to try
fileName = "assets/"+portChosen+"harm.txt"
val harmConsList:List<String> = File(fileName).readLines()
val portDisplayName = harmConsList[0]
val longTude = harmConsList[1]
val MTL =harmConsList[2]
etc,
The log cat reads :-
2021-01-10 15:40:34.044 7108-7108/com.example.tide2a E/AndroidRuntime: FATAL EXCEPTION: main
Process: com.example.tide2a, PID: 7108
java.lang.RuntimeException: Unable to start activity ComponentInfo{com.example.tide2a/com.example.tide2a.MainActivity}: java.io.FileNotFoundException: assets/Brehatharm.txt: open failed: ENOENT (No such file or directory)
The full windows path to the file is :-
C:\Users.......\OneDrive\Coding projects\tide2a\app\src\main\assets\Brehatharm.txt
I am sure the file is there, and spelled correctly, so I suspect I am specifying the path incorrectly. Please advise me.
Files in assets/ cannot be accessed using the File class. Use context.assets to get the AssetManager, and you can open InputStreams to the files.

CentOS | error apache spark file already exists Sparkcontext

I am unable to write to the file which I create. In windows it's working fine. In centos it says file already exists and does not write anything.
File tempFile= new File("temp/tempfile.parquet");
tempFile.createNewFile();
parquetDataSet.write().parquet(tempFile.getAbsolutePath());
Following is the error: file already exists
2020-02-29 07:01:18.007 ERROR 1 --- [nio-8090-exec-1] c.gehc.odp.util.JsonToParquetConverter : Stack Trace: {}org.apache.spark.sql.AnalysisException: path file:/temp/myfile.parquet already exists.;
2020-02-29 07:01:18.007 ERROR 1 --- [nio-8090-exec-1] c.gehc.odp.util.JsonToParquetConverter : sparkcontext close
The default savemode in spark is ErrorIfExists. This means that if the file with the same filename you intend to write already exists, it will give an exception similar to the one you got above. This is happening in your case because you are creating the file yourself rather than leaving that task to spark. There are 2 ways in which you can resolve the situation:
1) You can either mention savemode as "overwrite" or "append" in the write command:
parquetDataSet.write.mode("overwrite").parquet(tempFile.getAbsolutePath());
2) Or, you can simply remove the create new file command and straightaway pass the destination path in your spark write command as follows:
parquetDataSet.write.parquet("temp/tempfile.parquet");

pySpark local mode - loading text file with file:/// vs relative path

I am just getting started with spark and I am trying out examples in local mode...
I noticed that in some examples when creating the RDD the relative path to the file is used and in others the path starts with "file:///". The second option did not work for me at all - "Input path does not exist"
Can anyone explain what the difference is between using the file path and putting 'file:///' in front of it ?
I am using Spark 2.2 on Mac running in local mode
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("test")
sc = SparkContext(conf = conf)
#This will work providing the relative path
lines = sc.textFile("code/test.csv")
#This will not work
lines = sc.textFile("file:///code/test.csv")
sc.textFile("code/test.csv") means test.csv in /<hive.metastore.warehouse.dir>/code/test.csv on HDFS.
sc.textFile("hdfs:///<hive.metastore.warehouse.dir>/code/test.csv") is equal to above.
sc.textFile("file:///code/test.csv") means test.csv in /code/test.csv on local file system.

MLlib not saving the model data in Spark 2.1

We have a machine learning model that looks roughly like this:
sc = SparkContext(appName = "MLModel")
sqlCtx = SQLContext(sc)
df = sqlCtx.createDataFrame(data_res_promo)
#where data_res promo comes from a pandas dataframe
indexer = StringIndexer(inputCol="Fecha_Code", outputCol="Fecha_Index")
train_indexer = indexer.fit(df)
train_indexer.save('ALSIndexer') #This saves the indexer architecture
In my machine, when I run it as a local, it generates a folder ALSIndexer/ that has the parquet and all the information on the model.
When I run it in our Azure cluster of Spark, it does not generate the folder in the main node (nor in the slaves). However, if we try to rewrite it, it says:
cannot overwrite folder
Which means is somewhere, but we can't find it.
Would you have any pointers?
Spark will by default save files to the distributed filesystem (probably HDFS). The files will therefore not be visible on the nodes themselves but, as they are present, you get the "cannot overwrite folder" error message.
You can easily access the files through the HDFS to copy them to the main node. This can be done in the command line by one of these commands:
1.hadoop fs -get <HDFS file path> <Local system directory path>
2.hadoop fs -copyToLocal <HDFS file path> <Local system directory path>
It can also be done by importing the org.apache.hadoop.fs.FileSystem and utilize the commands available there.

Resources