Spark master/worker not writing logs within History Server configuration - apache-spark

Quite new to Spark setup. I want to persist event logs for each separate spark cluster I run. In my setup, /history-logs dir is mounted from different locations based on cluster name. The directory's permissions allow read-write for spark user(751).
Config file under $SPARK_HOME/conf/spark-defaults.conf for both master and workers is as follows:
spark.eventLog.enabled true
spark.eventLog.dir file:///history-logs
spark.history.fs.logDirectory file:///history-logs
I'm connecting from Zeppelin and running a simple piece of code:
val rdd = sc.parallelize(1 to 5)
println(rdd.sum())
No files are being written to the folder though.
If I configure similar parameters in the Zeppellin interpreter itself, at least I see that application log file is created in the directory.
Is it possible to save the logs from master/workers on per-cluster basis? I might be missing something obvious.
Thanks!

Related

Communicate to cluster that Spark History server is running

I have a working Spark cluster, with a master node and some worker nodes running on Kubernetes. This cluster has been used for multiple spark submit jobs and is operational.
On the master node, I have started up a Spark History server using the $SPARK_HOME/sbin/start-history-server.sh script and some configs to determine where the History Server's logs should be written:
spark.eventLog.enabled=true
spark.eventLog.dir=...
spark.history.fs.logDirectory=...
spark.hadoop.fs.s3a.access.key=...
spark.hadoop.fs.s3a.secret.key=...
spark.hadoop.fs.s3a.endpoint=...
spark.hadoop.fs.s3a.path.style.access=true
This was done a while after the cluster was operational. The server is writing the logs to an external DB (minIO using the s3a protocol).
Now, whenever I submit spark jobs it seems like nothing is being written away in the location I'm specifying.
I'm wondering about the following: How can the workers know I have started up the spark history server on the master node? Do I need to communicate this to the workers somehow?
Possible causes that I have checked:
No access/permissions to write to minIO: This shouldn't be the case as I'm running spark submit jobs that read/write files to the same minIO using the same settings
Logs folder does not exist: I was getting these errors before, but then I created a location for the files to be written away and since then I'm not getting issues
spark.eventLog.dir should be the same as spark.history.fs.logDirectory: they are
Just found out the answer: the way your workers will know where to store the logs is by supplying the following configs to your spark-submit job:
spark.eventLog.enabled=true
spark.eventLog.dir=...
spark.history.fs.logDirectory=...
It is probably also enough to have these in your spark-defaults.conf on the driver program, which is why I couldn't find a lot of info on this as I didn't add it to my spark-defaults.conf.

Which directory spark applications on yarn output their logs to? spark.eventLog.dir or var/log/ in each node?

I am building a log analysis planform to monitor spark jobs on a yarn cluster and I want to get a clear idea about spark/yarn logging.
I have searched a lot about this and these are the confusions I have.
The directory specified in spark.eventLog.dir or spark.history.fs.logDirectory get stored all the
application master logs and through log4j.properties in spark conf we can customize those logs ?
In default all data nodes output their executor logs to a folder in /var/log/. with log-aggregation enabled you can get those executer logs to the spark.eventLog.dir location as well?
I've managed to set up a 3 node virtual hadoop yarn cluster, spark installed in the master node. When I'm running spark in client mode I'm thinking this node becomes the application master node.
I'm a beginner to Big data and appreciate any effort to help me out with these confusions.
Spark log4j logging is written to the Yarn container stderr logs. The directory for these is controlled by yarn.nodemanager.log-dirs configuration parameter (default value on EMR is /var/log/hadoop-yarn/containers).
(spark.eventLog.dir is only used by the Spark History Server to display the Web UI after a job has finished. Here, Spark writes events that encode the information displayed in the UI to persisted storage).

Why do Apache Spark nodes need access to datafile path?

I am new to Apache Spark.
I have a cluster with a master and one worker. I am connected to master with pyspark (all are on Ubuntu VM).
I am reading this documentation: RDD external-datasets
in particular I have executed:
distFile = sc.textFile("data.txt")
I understand that this creates an RDD from file, which should be managed by the driver, hence by pyspark app.
But the doc states:
If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
Question is why do workers need access to the file path if the RDD is created by the driver only (afterwards distributed to the nodes)?

Loading Spark Config for testing Spark Applications

I've been trying to test a spark application on my local laptop before deploying it to a cluster (to avoid having to package and deploy my entire application every time) but struggling on loading the spark config file.
When I run my application on a cluster, I am usually providing a spark config file to the application (using spark-submit's --conf). This file has a lot of config options because this application interacts with Cassandra and HDFS. However, when I try to do the same on my local laptop, I'm not sure exactly how to load this config file. I know I can probably write a piece of code that takes the file path of the config file and just goes through and parses all the values and sets them in the config, but I'm just wondering if there are easier ways.
Current status:
I placed the desired config file in the my SPARK_HOME/conf directory and called it spark-defaults.conf ---> This didn't get applied, however this exact same file runs fine using spark-submit
For local mode, when I create the spark session, I'm setting Spark Master as "local[2]". I'm doing this when creating the spark session, so I'm wondering if it's possible to create this session with a specified config file.
Did you added --properties-file flag with spark-defaults.conf value in your IDE as an argument for JVM?
In official documentation (https://spark.apache.org/docs/latest/configuration.html) there is continuous reference to 'your default properties file'. Some options can not be set inside your application, because the JVM has already started. And since conf directory is read only through spark-submit, I suppose you have to explicitly load configuration file when running locally.
This problem has been discussed here:
How to use spark-submit's --properties-file option to launch Spark application in IntelliJ IDEA?
Not sure if this will help anyone, but I ended up reading the conf file from a test resource directory and then setting all the values as system properties (copied this from Spark Source Code):
//_sparkConfs is just a map of (String,String) populated from reading the conf file
for {
(k, v) ← _sparkConfs
} {
System.setProperty(k, v)
}
This is essentially emulating the --properties-file option of spark-submit to a certain degree. By doing this, I was able to keep this logic in my test setup, and not need to modify the existing application code.

Spark: hdfs cluster mode

I'm just getting started using Apache Spark. I'm using cluster mode (master, slave1, slave2) and I want to process a big file which is kept in Hadoop (hdfs). I am using the textFile method from SparkContext; while the file is being processing I monitorize the nodes and I can see that just the slave2 is working. After processing, slave2 has tasks but slave1 has no task.
If instead of using a hdfs I use a local file then both slaves work simultaneously.
I don't get why this behaviour. Please, can anybody give me a clue?
The main reason of that behavior is the concept of data locality. When Spark's Application Master asks for the creation of new executors, they are tried to be allocated in the same node where data resides.
I.e. in your case, HDFS is likely to have written all the blocks of the file on the same node. Thus Spark will instantiate the executors on that node. Instead, if you use a local file, it will be present in all nodes, so data locality won't be an issue anymore.

Resources