Pyspark - Load file: Path does not exist - apache-spark

I am a newbie to Spark. I'm trying to read a local csv file within an EMR cluster. The file is located in: /home/hadoop/. The script that I'm using is this one:
spark = SparkSession \
.builder \
.appName("Protob Conversion to Parquet") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()\
df = spark.read.csv('/home/hadoop/observations_temp.csv, header=True)
When I run the script raises the following error message:
pyspark.sql.utils.AnalysisException: u'Path does not exist:
hdfs://ip-172-31-39-54.eu-west-1.compute.internal:8020/home/hadoop/observations_temp.csv
Then, I found out that I have to add file:// in the file path so it can read the file locally:
df = spark.read.csv('file:///home/hadoop/observations_temp.csv, header=True)
But this time, the above approach raised a different error:
Lost task 0.3 in stage 0.0 (TID 3,
ip-172-31-41-81.eu-west-1.compute.internal, executor 1):
java.io.FileNotFoundException: File
file:/home/hadoop/observations_temp.csv does not exist
I think is because the file// extension just read the file locally and it does not distribute the file across the other nodes.
Do you know how can I read the csv file and make it available to all the other nodes?

You are right about the fact that your file is missing from your worker nodes thus that raises the error you got.
Here is the official documentation Ref. External Datasets.
If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
So basically you have two solutions :
You copy your file into each worker before starting the job;
Or you'll upload in HDFS with something like : (recommended solution)
hadoop fs -put localfile /user/hadoop/hadoopfile.csv
Now you can read it with :
df = spark.read.csv('/user/hadoop/hadoopfile.csv', header=True)
It seems that you are also using AWS S3. You can always try to read it directly from S3 without downloading it. (with the proper credentials of course)
Some suggest that the --files tag provided with spark-submit uploads the files to the execution directories. I don't recommend this approach unless your csv file is very small but then you won't need Spark.
Alternatively, I would stick with HDFS (or any distributed file system).

I think what you are missing is explicitly setting the master node while initializing the SparkSession, try something like this
spark = SparkSession \
.builder \
.master("local") \
.appName("Protob Conversion to Parquet") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
and then read the file in the same way you have been doing
df = spark.read.csv('file:///home/hadoop/observations_temp.csv')
this should solve the problem...

Might be useful for someone running zeppelin on mac using Docker.
Copy files to custom folder : /Users/my_user/zeppspark/myjson.txt
docker run -p 8080:8080 -v /Users/my_user/zeppspark:/zeppelin/notebook --rm --name zeppelin apache/zeppelin:0.9.0
On Zeppelin you can run this to get your file:
%pyspark
json_data = sc.textFile('/zeppelin/notebook/myjson.txt')

Related

Spark --archives file not found error, exception from executor

I am submitting a Spark job using Dataproc Serverless. My Spark code uses a few .yaml files as configuration and I pass them as --archives to the code.
Command to run the code:
gcloud dataproc batches submit pyspark src/mapper.py \
--project=$PROJECT_ID \
--region=$REGION \
--deps-bucket=$DEPS_BUCKET \
--container-image=$CONTAINER_IMAGE \
--service-account=$SERVICE_ACCOUNT \
--subnet=$SUBNETWORK_URI \
--py-files=dist/src.zip \
--archives=dist/config.zip \
-- --arg1="value1"
But I am getting an error with the below message:
An exception was thrown from the Python worker. Please see the stack trace below.
FileNotFoundError: [Errno 2] No such file or directory: '/var/tmp/spark/work/config/config.yaml'
Code used to access config:
with open("config/config.yaml", "r") as f:
COUNTRY_CODES = yaml.load(f, yaml.SafeLoader)
How can I submit dependent files to Dataproc so that they will be available inside /var/tmp/spark/work/ folder inside the executor?
When you access files in the archive that are passed via --archives parameter to Spark job, you do not need to specify full path to these files, instead you need to use current working directory (.). In your specific case it probably will be ./config/config.yaml (depends on folder structure inside your archive).
You can read more about Python package management in Spark docs: https://spark.apache.org/docs/3.3.1/api/python/user_guide/python_packaging.html

Spark external jars and files on hdfs

I have a spark job that I run using the spark-submit command.
The jar that I use is hosted on hdfs and I call it from there directly in the spark-submit query using its hdfs file path.
With this same logic, I'm trying to do the same when for the --jars options, the files options and also the extraClassPath option (in the spark.conf) but it seems that there is an issue with the fact that it point to a hdfs file path.
My command looks like this:
spark-submit \
--class Main \
--jars 'hdfs://path/externalLib.jar' \
--files 'hdfs://path/log4j.xml' \
--properties-file './spark.conf' \
'hdfs://path/job_name.jar
So not only when I call a method that refers the externalLib.jar, spark raises an exception telling me that it doesn't find the method but also from the starts I have the warning logs:
Source and destination file systems are the same. Not copying externalLib.jar
Source and destination file systems are the same. Not copying log4j.xml
It must come from the fact that I precise a hdfs path because it works flawlessly when I refers to those jar in the local file system.
Maybe it isn't possible ? What can I do ?

Spark read csv file submitted from --files

I'm submitting a Spark job to a remote spark cluster on yarn and including a file in the spark-submit --file I want to read the submitted file as a dataframe. But I'm confused about how to go about this without having to put the file in HDFS:
spark-submit \
--class com.Employee \
--master yarn \
--files /User/employee.csv \
--jars SomeJar.jar
spark: SparkSession = // create the Spark Session
val df = spark.read.csv("/User/employee.csv")
spark.sparkContext.addFile("file:///your local file path ")
Add file using addFile so that it can be available at your worker nodes. Since you want to read local file in cluster mode.
You may need to do a slight change according to scala and the spark version your are using.
employee.csv is in the work directory of executor, just reading it as follows:
val df = spark.read.csv("employee.csv")

Read files sent with spark-submit by the driver

I am sending a Spark job to run on a remote cluster by running
spark-submit ... --deploy-mode cluster --files some.properties ...
I want to read the content of the some.properties file by the driver code, i.e. before creating the Spark context and launching RDD tasks. The file is copied to the remote driver, but not to the driver's working directory.
The ways around this problem that I know of are:
Upload the file to HDFS
Store the file in the app jar
Both are inconvenient since this file is frequently changed on the submitting dev machine.
Is there a way to read the file that was uploaded using the --files flag during the driver code main method?
Yes, you can access files uploaded via the --files argument.
This is how I'm able to access files passed in via --files:
./bin/spark-submit \
--class com.MyClass \
--master yarn-cluster \
--files /path/to/some/file.ext \
--jars lib/datanucleus-api-jdo-3.2.6.jar,lib/datanucleus-rdbms-3.2.9.jar,lib/datanucleus-core-3.2.10.jar \
/path/to/app.jar file.ext
and in my Spark code:
val filename = args(0)
val linecount = Source.fromFile(filename).getLines.size
I do believe these files are downloaded onto the workers in the same directory as the jar is placed, which is why simply passing the filename and not the absolute path to Source.fromFile works.
After the investigation, I found one solution for above issue. Send the any.properties configuration during spark-submit and use it by spark driver before and after SparkSession initialization. Hope it will help you.
any.properties
spark.key=value
spark.app.name=MyApp
SparkTest.java
import com.typesafe.config.Config;
import com.typesafe.config.ConfigFactory;
public class SparkTest{
public Static void main(String[] args){
String warehouseLocation = new File("spark-warehouse").getAbsolutePath();
Config conf = loadConf();
System.out.println(conf.getString("spark.key"));
// Initialize SparkContext and use configuration from properties
SparkConf sparkConf = new SparkConf(true).setAppName(conf.getString("spark.app.name"));
SparkSession sparkSession =
SparkSession.builder().config(sparkConf).config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport().getOrCreate();
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkSession.sparkContext());
}
public static Config loadConf() {
String configFileName = "any.properties";
System.out.println(configFileName);
Config configs = ConfigFactory.load(ConfigFactory.parseFile(new java.io.File(configFileName)));
System.out.println(configs.getString("spark.key")); // get value from properties file
return configs;
}
}
Spark Submit:
spark-submit --class SparkTest --master yarn --deploy-mode client --files any.properties,yy-site.xml --jars ...........
use spark-submit --help, will find that this option is only for working directory of executor not driver.
--files FILES: Comma-separated list of files to be placed in the working directory of each executor.
The --files and --archives options support specifying file names with the # , just like Hadoop.
For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into Spark worker directory, but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.
this works for my spark streaming application in both yarn/client and yarn/cluster mode.
In pyspark, I find it really interesting to achieve this easily, first arrange your working directory like this:
/path/to/your/workdir/
|--code.py
|--file.txt
and then in your code.py main function, just read the file as usual:
if __name__ == "__main__":
content = open("./file.txt").read()
then submit it without any specific configurations as follows:
spark-submit code.py
it runs correctly which amazes me. I suppose the submit process archives any files and sub-dir files altogether and sends them to the driver in pyspark, while you should archive them yourself in scala version. By the way, both --files and --archives options are working in worker not the driver, which means you can only access these files in RDD transformations or actions.
Here's a nice solution I developed in Python Spark in order to integrate any data as a file from outside to your Big Data platform.
Have fun.
# Load from the Spark driver any local text file and return a RDD (really useful in YARN mode to integrate new data at the fly)
# (See https://community.hortonworks.com/questions/38482/loading-local-file-to-apache-spark.html)
def parallelizeTextFileToRDD(sparkContext, localTextFilePath, splitChar):
localTextFilePath = localTextFilePath.strip(' ')
if (localTextFilePath.startswith("file://")):
localTextFilePath = localTextFilePath[7:]
import subprocess
dataBytes = subprocess.check_output("cat " + localTextFilePath, shell=True)
textRDD = sparkContext.parallelize(dataBytes.split(splitChar))
return textRDD
# Usage example
myRDD = parallelizeTextFileToRDD(sc, '~/myTextFile.txt', '\n') # Load my local file as a RDD
myRDD.saveAsTextFile('/user/foo/myTextFile') # Store my data to HDFS
A way around the problem is that you can create a temporary SparkContext simply by calling SparkContext.getOrCreate() and then read the file you passed in the --files with the help of SparkFiles.get('FILE').
Once you read the file retrieve all necessary configuration you required in a SparkConf() variable.
After that call this function:
SparkContext.stop(SparkContext.getOrCreate())
This will distroy the existing SparkContext and than in the next line simply initalize a new SparkContext with the necessary configurations like this.
sc = SparkContext(conf=conf).getOrCreate()
You got yourself a SparkContext with the desired settings

How to avoid "Not a file" exceptions when reading from HDFS with spark

I copy a tree of files from S3 to HDFS with S3DistCP in an initial EMR step. hdfs dfs -ls -R hdfs:///data_dir shows the expected files, which look something like:
/data_dir/year=2015/
/data_dir/year=2015/month=01/
/data_dir/year=2015/month=01/day=01/
/data_dir/year=2015/month=01/day=01/data01.12345678
/data_dir/year=2015/month=01/day=01/data02.12345678
/data_dir/year=2015/month=01/day=01/data03.12345678
The 'directories' are listed as zero-byte files.
I then run a spark step which needs to read these files. The loading code is thus:
sqlctx.read.json('hdfs:///data_dir, schema=schema)
The job fails with a java exception
java.io.IOException: Not a file: hdfs://10.159.123.38:9000/data_dir/year=2015
I had (perhaps naively) assumed that spark would recursively descend the 'dir tree' and load the data files. If I point to S3 it loads the data successfully.
Am I misunderstanding HDFS? Can I tell spark to ignore zero-byte files? Can i use S3DistCp to flatten the tree?
In Hadoop configuration for current spark context, configure "recursive" read for Hadoop InputFormat before to get the sql ctx
val hadoopConf = sparkCtx.hadoopConfiguration
hadoopConf.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
This will give the solution for "not a file".
Next, to read multiple files:
Hadoop job taking input files from multiple directories
or union the list of files into single dataframe :
Read multiple files from a directory using Spark
Problem solved with :
spark-submit ...
--conf spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive=true \
--conf spark.hive.mapred.supports.subdirectories=true \
...
The parameters must be set in this way in spark version 2.1.0 :
.set("spark.hive.mapred.supports.subdirectories","true")
.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true")

Resources