Load props file in EMR Spark Application - apache-spark

I am trying to load custom properties in my spark application using :-
command-runner.jar,spark-submit,--deploy-mode,cluster,--properties-file,s3://spark-config-test/myprops.conf,--num-executors,5,--executor-cores,2,--class,com.amazon.Main,#{input.directoryPath}/SWALiveOrderModelSpark-1.0-super.jar
However, I am getting the following exception:-
Exception in thread "main" java.lang.IllegalArgumentException: Invalid
properties file 's3://spark-config-test/myprops.conf''. at
org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:241)
at
org.apache.spark.launcher.AbstractCommandBuilder.loadPropertiesFile(AbstractCommandBuilder.java:284)
at
org.apache.spark.launcher.AbstractCommandBuilder.getEffectiveConfig(AbstractCommandBuilder.java:264)
at
org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:233)
at org
Is this the correct way to load file from S3?

You can't load a properties file directly from S3. Instead you will need to download the properties file to your master node somewhere, then submit the spark job referencing the local path on that node. You can do the download by using command-runner.jar to run the aws cli utility.

Related

Get correct path for sparkdid.cassandra.connection.config.cloud.path in AWS EMR

I checked the entire web. but could not found the solution.
I am trying to connect astra cassandra using bundle in AWS EMR its able to download the bundle file but not loading it.
spark.conf.set("sparkdid.cassandra.connection.config.cloud.path", "secure-connect-app.zip")
this is how I am giving path I know this is wrong path since its not loading the correct config returning connection refused with localhost.
I don't what is the correct path in EMR.
If you look into documentation, then you see that the file could be either specified as URL of file on S3, or you can use —files option when submitting with spark-submit or spark-shell , then it will be available as just a file name, like you’re doing

Can't access files via Local file API on Databricks

I'm trying to access small text file stored directly on dbfs using local file API.
I'm getting the following error.
No such file or directory
My code:
val filename = "/dbfs/test/test.txt"
for (line <- Source.fromFile(filename).getLines()) {
println(line)
}
At the same time I can access this file without any problems using dbutils or load it to RDD via spark context.
I've tried specifying the path starting with dbfs:/ or /dbfs/ or just with the test folder name, both in Scala and Python, getting the same error each time. I'm running the code from the notebook. Is it some problem with the cluster configuration?
Check if your cluster has Credentials Passthrough enabled. If so, local file Api is not available.
https://docs.azuredatabricks.net/data/databricks-file-system.html#local-file-apis

Read CSV file from AWS S3

I have an EC2 instance running pyspark and I'm able to connect to it (ssh) and run interactive code within a Jupyter Notebook.
I have a S3 bucket with a csv file that I want to read, when I attempt to read it with:
spark = SparkSession.builder.appName('Basics').getOrCreate()
df = spark.read.csv('https://s3.us-east-2.amazonaws.com/bucketname/filename.csv')
Which throws a long Python error message and then something related to:
Py4JJavaError: An error occurred while calling o131.csv.
Specify S3 path along with access key and secret key as following:
's3n://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>#my.bucket/folder/input_data.csv'
Access key-related information can be introduced in the typical username + password manner for URLs. As a rule, the access protocol should be s3a, the successor to s3n (see Technically what is the difference between s3n, s3a and s3?). Putting this together, you get
spark.read.csv("s3a://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>#bucketname/filename.csv")
As an aside, some Spark execution environments, e.g., Databricks, allow S3 buckets to be mounted as part of the file system. You can do the same when you build a cluster using something like s3fs.

spark cassandra connector can't read ssl trust store file present in hdfs

I am trying to configure SSL between Spark and Cassandra. Passing a local filepath for trust store works, whereas passing hdfs filepath doesn't work. It throws an error as file not Found, both in Yarn client and cluster mode.
sparkConf.set("spark.cassandra.connection.ssl.enabled", "true");
sparkConf.set("spark.cassandra.connection.ssl.trustStore.password", "password");
sparkConf.set("spark.cassandra.connection.ssl.trustStore.path", "jks file path");
Any idea why does it happen? The same file works through sc.textfile()
Exception:
About to save to Cassandra.16/07/22 08:56:55 ERROR org.apache.spark.streaming.scheduler.JobScheduler: Error running job streaming job 1469177810000 ms.0
java.io.FileNotFoundException: hdfs:/abc/ssl.jks (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
Thanks
Hema
This happens because SSL parameters are used by Java driver that doesn't know anything about HDFS. You need to put truststore & keystore to every node into the same location, and specify it in config parameters.
I'll flag this issue to developers

Apache Spark FileNotFoundException

I am trying to play a little bit with apache-spark cluster mode.
So my cluster consists of a driver in my machine and a worker and manager in host machine(separate machine).
I send a textfile using sparkContext.addFile(filepath) where the filepath is the path of my text file in local machine for which I get the following output:
INFO Utils: Copying /home/files/data.txt to /tmp/spark-b2e2bb22-487b-412b-831d-19d7aa96f275/userFiles-147c9552-1a77-427e-9b17-cb0845807860/data.txt
INFO SparkContext: Added file /home/files/data.txt at http://192.XX.XX.164:58143/files/data.txt with timestamp 1457432207649
But when I try to access the same file using SparkFiles.get("data.txt"), I get the path to file in my driver instead of worker.
I am setting my file like this
SparkConf conf = new SparkConf().setAppName("spark-play").setMaster("spark://192.XX.XX.172:7077");
conf.setJars(new String[]{"jars/SparkWorker.jar"});
JavaSparkContext sparkContext = new JavaSparkContext(conf);
sparkContext.addFile("/home/files/data.txt");
List<String> file =sparkContext.textFile(SparkFiles.get("data.txt")).collect();
I am getting FileNotFoundException here.
I have recently faced the same issue and hopefully my solution can help other people solve this issue.
We know that when you use SparkContext.addFile(<file_path>), it sends the file to the automatically created working directories in the driver node (in this case, your machine) as well as the worker nodes of the Spark cluster.
The block of code that you shared where you are using SparkFiles.get("data.txt") is being executed on the driver, so it returns the path to the file on the driver, instead of the worker. But, the task is being run on the worker and path to the file on the driver does not match the path to the file on the worker because the driver and worker nodes have different working directory paths. Hence, you get the FileNotFoundException.
There is a workaround to this problem without using any distributed file system or ftp server. You should put the file in your working directory on your host machine. Then, instead of using SparkContext.get("data.txt"), you use "./data.txt".
List<String> file = sparkContext.textFile("./data.txt").collect();
Now, even though there is a mismatch of working directory paths between the spark driver and worker nodes, you will NOT face FileNotFoundException since you are using a relative path to access the file.
I think that the main issue is that you are trying to read the file via the textFile method. What is inside the brackets of the textFile method is executed in the driver program. In the worker node only the code tobe run against an RDD is performed. When you type textFile what happens is that in your driver program it is created a RDD object with a trivial associated DAG.But nothing happens in the worker node.
Thus, when you're trying to collect the data, the worker is asked to read the file at the URL you've passed to textFile, which is told by the driver. Since your file is in the local filesystem of the driver and the worker node doesn't have access to it, you get the FileNotFoundException.
The solution is to make the file available to the worker node by putting it into a distributed filesystem as HDFS or via (s)ftp or you have to trasfer the file into the worker node before running the Spark job and then you have to put as an argument of textFile the path of the file in the worker filesystem.

Resources