Loading a file in spark in standalone cluster - apache-spark

I have a four node spark cluster . One node is both master and slave, other three slave node. I have written a sample application which load file and created a data frame and running some spark SQL. When i am submitting the application like below from master node , it is producing output:-
./spark-submit /root/sample.py
But When i am submitting with master like below , it says "File does not exists error.
./spark-submit --master spark://<IP>:PORTNO /root/sample.py
I am creating an RDD from sample text file :-
lines = sc.textFile("/root/testsql.txt");
Do i need to copy the file to all the nodes?? How it will work for the production systems , eg. if have to process some CDRS , where should i receive these CDRS .

You are right, it is not able to read that file, because it doesn't exist on your server.
You need to make sure that file is accessible via same url/path to all the nodes of spark.
That is where distributed file system like hdfs makes thing little easier, but you can do it even without them.
When you submit spark job to master, master will allocate the required executors and workers. Each of them will try to parallelize the task, which is what sc.textFile is telling it to do.
So, the file path needs to be accessible from all nodes.
You can either mount the file on all nodes at same location, or instead use a url based location to read the file. Basic thing is file needs to be available and readable from all nodes.

Related

How to make sure that spark.write.parquet() writes the data frame on to the file specified using relative path and not in HDFS, on EMR?

My problem is as below:
A pyspark script that runs perfectly on a local machine and an EC2 is ported on to an EMR for scaling up. There's a config file with relative locations for outputs mentioned.
An example:
Config
feature_outputs= /outputs/features/
File structure:
classifier_setup
feature_generator.py
model_execution.py
config.py
utils.py
logs/
models/
resources/
outputs/
Code reads the config, generates features and writes them into the path mentioned above. On EMR, this is getting saved in to the HDFS. (spark.write.parquet writes into the HDFS, on the hand, df.toPandas().to_csv() writes to the relative output path mentioned). The next part of the script, reads the same path mentioned in the config, tries to read the parquet from the mentioned location, and fails.
How to make sure that the outputs are created in the relative that is specified in the code ?
If that's not possible, how can I make sure that I read it from the HDFS in the subsequent steps.
I referred these discussions: HDFS paths ,enter link description here, however, it's not very clear to me. Can someone help me with this.
Thanks.
Short Answer to your question:
Writing using Pandas and Spark are 2 different things. Pandas doesn't utilize Hadoop to process, read and write; it writes into the standard EMR file system, which is not HDFS. On the other hand, Spark utilizes distributed computing for getting things into multiple machines at the same time and it's built on top of Hadoop so by default when you write using Spark it writes into HDFS.
While writing from EMR, you can choose to write either into
EMR local filesystem,
HDFS, or
EMRFS (which is s3 buckets).
Refer AWS documentation
If at the end of your job, you are writing using Pandas dataframe and you want to write it into HDFS location (maybe because your next step Spark job is reading from HDFS, or for some reason) you might have to use PyArrow for that, Refer this
If at the end fo your job, you are writing into HDFS using Spark dataframe, in next step you can read it by using hdfs://<feature_outputs> like that to read in next step.
Also while you are saving data into EMR HDFS, you will have to keep in mind that if you are using default EMR storage, it's volatile i.e. all the data will be lost once the EMR goes down i.e. gets terminated, and if you want to keep your data stored in EMR you might have to get an External EBS volume attached to it that can be used in other EMR also or some other storage solution that AWS provides.
The best way is if you are writing your data and you need it to be persisted to write it into S3 instead of EMR.

Spark concurrent writes on same HDFS location

I have a spark code which saves a dataframe to a HDFS location (date partitioned location) in Json format using append mode.
df.write.mode("append").format('json').save(hdfsPath)
sample hdfs location : /tmp/table1/datepart=20190903
I am consuming data from upstream in NiFi cluster. Each node in NiFi cluster will create a flow file for consumed data. My spark code is processing that flow file.As NiFi is distributed, my spark code is getting executed from different NiFi nodes in parallel trying to save data into same HDFS location.
I cannot store output of spark job in different directories as my data is partitioned on date.
This process is running daily once from last 14 days and my spark job failed 4 times with different errors.
First Error:
java.io.IOException: Failed to rename FileStatus{path=hdfs://tmp/table1/datepart=20190824/_temporary/0/task_20190824020604_0000_m_000000/part-00000-101aa2e2-85da-4067-9769-b4f6f6b8f276-c000.json; isDirectory=false; length=0; replication=3; blocksize=268435456; modification_time=1566630365451; access_time=1566630365034; owner=hive; group=hive; permission=rwxrwx--x; isSymlink=false} to hdfs://tmp/table1/datepart=20190824/part-00000-101aa2e2-85da-4067-9769-b4f6f6b8f276-c000.json
Second Error:
java.io.FileNotFoundException: File hdfs://tmp/table1/datepart=20190825/_temporary/0 does not exist.
Third Error:
java.io.FileNotFoundException: File hdfs://tmp/table1/datepart=20190901/_temporary/0/task_20190901020450_0000_m_000000 does not exist.
Fourth Error:
java.io.FileNotFoundException: File hdfs://tmp/table1/datepart=20190903/_temporary/0 does not exist.
Following are the problems/issue:
I am not able to recreate this scenario again. How to do that?
On all 4 occasions, errors are related to _temporary directory. Is is because 2 or more jobs are parallelly trying to save the data in same HDFS location and whiling doing that Job A might have deleted _temporary directory of Job B? (Because of the same location and all folders have common name /_directory/0/)
If it is concurrency problem then I can run all NiFi processor from primary node but then I will loose the performance.
Need your expert advice.
Thanks in advance.
It seems the problem is that two spark nodes are independently trying to write to the same place, causing conflicts as the fastest one will clear up the working directory before the second one expects it.
The most straightforward solution may be to avoid this.
As I understand how you use Nifi and spark, the node where Nifi runs also determines the node where spark runs (there is a 1-1 relationship?)
If that is the case you should be able to solve this by routing the work in Nifi to nodes that do not interfere with each other. Check out the load balancing strategy (property of the queue) that depends on attributes. Of course you would need to define the right attribute, but something like directory or table name should go a long way.
Try to enable outputcommitter v2:
spark.conf.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
It doesn't use shared temp directory for files , but creates .sparkStaging-<...> independent temp directories for each write
It also speeds up write, but allow some rear hypothetical cases of partial data write
Try to check this doc for more info:
https://spark.apache.org/docs/3.0.0-preview/cloud-integration.html#recommended-settings-for-writing-to-object-stores

Spark Master filling temporary directory

I have a simple Spark app that reads some data, computes some metrics, and then saves the result (input and output are Cassandra table). This piece of code runs at regular intervals (i.e., every minute).
I have a Cassandra/Spark (Spark 1.6.1) and after a few minutes, my temporary directory on the master node of the Spark cluster is filled, and the master refuses to run any more jobs. I am submitting the job with spark-submit.
What is it that I am missing? How do I make sure that the master nodes removes the temporary folder?
Spark uses this directory as the scratch space and outputs temp map output files in there. This can be changed. You should take a look into spark.local.dir.
Every time you submit your app, the jar is copied to all the workers in a new app directory. How big is your jar? Are you building a fat jar including the datastax driver jar? In that case I am guessing your app would be a few MB. Running it every minute will fill up your disk very quickly.
Spark has two parameters to control the cleaning of the app directories:
spark.worker.cleanup.interval which control how often spark is going to clean
spark.worker.cleanupDataTtl which control how long an app directory should stay before being cleaned.
Both parameters are in seconds.
Hope this help!

Does Spark support accessing data from master node or worker node?

Is it possible to create a RDD using data from master or worker? I know that there is a option SC.textFile() which sources the data from local system (driver) similarly can we use something like "master:file://input.txt" ? because I am accessing a remote cluster and my input data size is large and cannot login to remote cluster.
I am not looking for S3 or HDFS. Please suggest if there is any other option.
Data in an RDD is always controlled by the Workers, whether it is in memory or located in a data-source. To retrieve the data from the Workers into the Driver you can call collect() on your RDD.
You should put your file on HDFS or a filesystem that is available to all nodes.
The best way to do this is to as you stated use sc.textFile. To do that you need to make the file available on all nodes in the cluster. Spark provides an easy way to do this via the --files option for spark-submit. Simply pass the option followed by the path to the file that you need copied.
You can access the hadoop, file by creating hadoop configuration.
import org.apache.spark.deploy.SparkHadoopUtil
import java.io.{File, FileInputStream, FileOutputStream, InputStream}
val hadoopConfig = SparkHadoopUtil.get.conf
val fs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI(fileName), hadoopConfig)
val fsPath = new org.apache.hadoop.fs.Path(fileName)
Once you get the path you can copy, delete, move or perform any operations.

Where does spark look for text files?

I thought that loading text files is done only from workers / within the cluster (you just need to make sure all workers have access to the same path, either by having that text file available on all nodes, or by use some shared folder mapped to the same path)
e.g. spark-submit / spark-shell can be launched from anywhere, and connect to a spark master, and the machine where you launched spark-submit / spark-shell (which is also where our driver runs, unless you are in "cluster" deploy mode) has nothing to do with the cluster. Therefore any data loading should be done only from the workers, not on the driver machine, right? e.g. there should be no way that sc.textFile("file:///somePath") will cause spark to look for a file on the driver machine (again, the driver is external to the cluster, e.g. in "client" deploy mode / standalone mode), right?
Well, this is what I thought too...
Our cast
machine A: where the driver runs
machine B: where both spark master and one of the workers run
Act I - The Hope
When I start a spark-shell from machine B to spark master on B I get this:
scala> sc.master
res3: String = spark://machinB:7077
scala> sc.textFile("/tmp/data/myfile.csv").count()
res4: Long = 976
Act II - The Conflict
But when I start a spark-shell from machine A, pointing to spark master on B I get this:
scala> sc.master
res2: String = spark://machineB:7077
scala> sc.textFile("/tmp/data/myfile.csv").count()
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/tmp/data/myfile.csv
And indeed /tmp/data/myfile.csv does not exist on machine A, but machine A is not on the cluster, it's just where the driver runs
Act III - The Amazement
What’s even weirder is that if I make this file available on machine A, it doesn’t throw this error anymore. (Instead it creates a job, but no tasks, and just fails due to a timeout, which is another issue that deserves a separate question)
Is there something in the way that Spark behaves that I’m missing? I thought that spark shell when connected to a remote, has nothing to do with the machine you are running on. So why does the error stops when I put that file available on machine A? It means that the location of sc.textFile includes the location of where spark-shell or spark-submit were initiated (in my case also where the driver runs)? This makes zero sense to me. but again, I'm open to learn new things.
Epilogue
tl;dr - sc.textFile("file:/somePath") running form a driver on machine A to a cluster on machines B,C,D... (driver not part of cluster)
It seems like it's looking for path file:/somePath also on the driver, is that true (or is it just me)? is that known? is that as designed?
I have a feeling that this is some weird network / VPN topology issue unique to my workplace network, but still this is what happens to me, and I'm utterly confused whether it is just me or a known behavior. (or I'm simply not getting how Spark works, which is always an option)
So the really short version of it the answer is, if you reference "file://..." it should be accessible on all nodes in your cluster including the dirver program. Sometimes some bits of work happen on the worker. Generally the way around this is just not using local files, and instead using something like S3, HDFS, or another network filesystem. There is the sc.addFile method which can be used to distribute a file from the driver to all of the other nodes (and then you use SparkFiles.get to resolve the download location).
Spark can look for files both locally or on HDFS.
If you'd like to read in a file using sc.textFile() and take advantage of its RDD format, then the file should sit on HDFS. If you just want to read in a file the normal way, it is the same as you do depending on the API (Scala, Java, Python).
If you submit a local file with your driver, then addFile() distributes the file to each node and SparkFiles.get() downloads the file to a local temporary file.

Resources