Issues running a python script with Pyspark - apache-spark

I am new to spark and just wanted to check on a problem that i am facing. My objective was to read a nested xml file, flatten it out and save it as a csv file. I wrote the code. It works really fine in pyspark in my cluster. When i write the code line by line in pyspark i can see the executors from different nodes being assigned the worker processes. Now the problem is that when i run the same code as a python script, the executors from different nodes are not allotted. The worker process starts on the node i am running the script on and does not get parallelized. Consequently it is taking much longer time to do the processing. I am attaching the screenshot of the warning with this post.
Has anyone faced it as well? Thank you in anticipation.
Also i dont own this cluster but i am working on it for someone. So i have no idea about how many nodes are there.

I did get it working. I was not initializing the configuration properly.
I had specified setmaster to local in the spark configuration.
I just removed that property tag and the app started distributing across the available executors even while running the script.

Related

Unable to infer schema for Parquet only on Scheduled Job Run

I am running a notebook that executes other notebooks through the dbutils.notebooks.run() command. Whenever I run this job manually it executes without any issues. Whenever the job runs nightly the ephemeral notebook runs returns the error
org.apache.spark.sql.AnalysisException: Unable to infer schema for
Parquet. It must be specified manually.;
Some other notebooks that ran into the error I was able to resolve by increasing the number of workers on the cluster. I've tried doing that on this workflow as well without any luck, and I can't find any documentation that indicates that should be necessary anyway.
Any insight would be helpful.
Increasing the number of works on the cluster pool fixed the problem. Not certain the correct number of workers needed per ephemeral run, it would seem the minimum of 2 per run would needed, and they are necessarily returned immediately when the run is completed.

Spark - Is there a way to cleanup orphaned RDD files and block manager folders (using pyspark)?

I am currently running/experimenting with Spark in a Windows environment and have noticed a large number of orphaned blockmgr folders and rdd files. These are being created when I have insufficient memory to cache a full data set.
I suspect that they are being left behind when processes fail.
At the moment, I am manually deleting them from time to time (when I run out of disk space..... ). I have also toyed around with a simple file operation script.
I was wondering, is there any pyspark functions or scripts available that would clean these up, or any way to check for them when a process is initiated?
Thanks
As per #cronoik, this was solved by setting the following properties:
spark.worker.cleanup.enabled true
In my instance, using both 'local' and 'standalone' modes on a single node Windows environment, I have set this within spark-defaults.conf file.
Refer to the documentation for more information: Spark Standalone Mode

Why am I getting out of memory errors only after several runs of my Spark Application?

I ran my spark application successfully twice after spinning up a fresh EMR cluster. After running a different Spark Application several times that DOES have out of memory issues, I ran the first spark application again and got out of memory errors.
I repeated this sequence of events three times and it happens every time. What could be happening? Shouldn't Spark free all memory between runs?
After a spark program completed, It generates temporary directories and it's remain in the temp directory so after runs several spark applications it might be gives out of memory error. There is some clean up options which can solve this issue.
spark.worker.cleanup.enabled (Default value is false),
spark.worker.cleanup.interval, spark.worker.cleanup.appDataTtl for more details about these kindly go through this document.
http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts

Output file is getting generated on slave machine in apache spark

I am facing some issue while running a spark java program that reads a file, do some manipulation and then generates output file at a given path.
Every thing works fine when master and slaves are on same machine .ie: in Standalone-cluster mode.
But problem started when I deployed same program in multi machine multi node cluster set up. That means the master is running at x.x.x.102 and slave is running on x.x.x.104.
Both the master -slave have shared their SSH keys and are reachable from each other.
Initially slave was not able to read input file , for that I came to know I need to call sc.addFile() before sc.textFile(). that solved issue. But now I see output is being generated on slave machine in a _temporary folder under the output path. ie: /tmp/emi/_temporary/0/task-xxxx/part-00000
In local cluster mode it works fine and generates output file in /tmp/emi/part-00000.
I came to know that i need to use SparkFiles.get(). but i am not able to understand how and where to use this method.
till now I am using
DataFrame dataobj = ...
dataObj.javaRDD().coalesce(1).saveAsTextFile("file:/tmp/emi");
Can any one please let me know how to call SparkFiles.get()?
In short how can I tell slave to create output file in the machine where driver is running?
Please help.
Thanks a lot in advance.
There is nothing unexpected here. Each worker writes its own part of the data separately. Using file scheme only means that data is writer to a file in the file system local from the worker perspective.
Regarding SparkFiles it is not applicable in this particular case. SparkFiles can be used to distribute common files to the worker machines not to deal with the results.
If for some reason you want to perform writes on the machine used to run driver code you'll have to fetch data to the driver machine first (either collect which requires enough memory to fit all data or toLocalIterator which collects partition at the time and requires multiple jobs) and use standard tools to write results to local file system. In general though writing to driver is not a good practice and most of the time is simply useless.

Spark: hdfs cluster mode

I'm just getting started using Apache Spark. I'm using cluster mode (master, slave1, slave2) and I want to process a big file which is kept in Hadoop (hdfs). I am using the textFile method from SparkContext; while the file is being processing I monitorize the nodes and I can see that just the slave2 is working. After processing, slave2 has tasks but slave1 has no task.
If instead of using a hdfs I use a local file then both slaves work simultaneously.
I don't get why this behaviour. Please, can anybody give me a clue?
The main reason of that behavior is the concept of data locality. When Spark's Application Master asks for the creation of new executors, they are tried to be allocated in the same node where data resides.
I.e. in your case, HDFS is likely to have written all the blocks of the file on the same node. Thus Spark will instantiate the executors on that node. Instead, if you use a local file, it will be present in all nodes, so data locality won't be an issue anymore.

Resources