Write spark event log to local filesystem instead of hdfs - apache-spark

I want to redirect event log of my spark applications to a local directory like "/tmp/spark-events" instead of "hdfs://user/spark/applicationHistory".
I set the "spark.eventLog.dir" variable to "file:///tmp/spark-events" in cloudera manager (Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-defaults.con).
But when I restart spark, spark-conf conatains (spark.eventLog.dir=hdfs://nameservice1file:///tmp/spark-eventstmp/spark) and this not works.

Related

spark.worker.cleanup does not work, logs are not deleted

I want periodically cleanup the log files that stored in the ${SPARK_HOME}/logs for our spark cluster (1 master + 4 workers).
The default log directory for spark logs should be ${SPARK_HOME}/logs, since I didn't configure the SPARK_LOG_DIR in spark-env, so all logs are being stored there.
In order to test it, I have added the conf below (spark.worker.cleanup.enabled) in one of the worker node.
export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true
-Dspark.worker.cleanup.interval=300 -Dspark.worker.cleanup.appDataTtl=300"
And then executed stop-slave.sh to stop the worker node, and start the worker with start-slave.sh.
But those log files in ${SPARK_HOME}/logs are not being deleted after the configured interval time.
I would like to know am I doing the right step ? Or something more has to be done? I also put that spark.worker.cleanup conf in the master node's spark-env.sh. And I don't see any effect there also.
I think I was a bit confused on which folder to cleanup. In the spark document, it mentioned the spark.worker.cleanup.enabled is only cleanup worker "APPLICATION" directory.
And our application directory is located at "spark-2.3.3-bin-hadoop2.7/work", and that directory did get cleaned up .
So after changing the spark-env.sh, and then stop-slave, and then start-slave again. Everything working good.

PySpark logging from the executor in a standalone cluster

This question has answers related to how to do this on a YARN cluster. But what if I am running a standalone spark cluster? How can I log from executors? Logging from the driver is easy using the log4j logger that we can derive from spark-context.
But how can I log from within an RDD's foreach or a foreachPartition? Is there any way I can collect these logs and print?
The answer to this is to import python logging and to write the messages using logging and the logged messages will be in the work directory which is created under the spark installation location
There is nothing else which is needed
I went crazy modifying log4j.properties file and adding driver-java-option and spakrk.executor.extraJavaOptions
In your spark program, import logging add log messages straightaway as
logging.warning(whatever is your message and variable values you want to check)
Then if you navigate to the work directory - if i have installed spark at /home/vagrant/spark then we are talking about /home/vagrant/spark/work directory
There will be a directory for each application.And the workers used for the application will have numbers 0, 1, 2, 3 etc.You have to check in each worker.And whichever worker your executor was created to execute the task in the stderr you will see the logging messages
Hope this helps to see the user logged messages on the executor when using the spark standalone cluster mode

Apache Spark: resulting file being created at worker node instead of master node

I configure one master at local pc and a worker node inside virtualbox and the result file has been creating at worker node, instread of sending back to master node, I wonder why is that.
Because my worker node cannot send result back to master node? how to verify that?
I use spark2.2.
I use same username for master and worker node.
I also configured ssh without password.
I tried --deploy-mode client and --deploy-mode cluster
I tried once then I switched the master/worker node and I got the same result.
val result = joined.distinct()
result.write.mode("overwrite").format("csv")
.option("header", "true").option("delimiter", ";")
.save("file:///home/data/KPI/KpiDensite.csv")
also, for input file, I load like this:
val commerce = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true")
.option("delimiter", "|").load("file:///home/data/equip-serv-commerce-infra-2016.csv").distinct()
but why must I presend the file both at master and worker node at the same position? I don't use yarn or mesos right now.
You are exporting to a local file system, which tells Spark to write it on the file system of the machine running the code. On the worker, that will be the file system of the worker machine.
If you want the data to be stored on the file system of the driver (not master, you'll need to know where the driver is running on your yarn cluster), then you need to collect the RDD or data frame and use normal IO code to write the data to a file.
The easiest option, however, is to use a distributed storage system, such as HDFS (.save("hdfs://master:port/data/KPI/KpiDensite.csv")) or export to a database (writing to a JDBC or using a nosql db); if you're running your application in cluster mode.

Apache Spark FileNotFoundException

I am trying to play a little bit with apache-spark cluster mode.
So my cluster consists of a driver in my machine and a worker and manager in host machine(separate machine).
I send a textfile using sparkContext.addFile(filepath) where the filepath is the path of my text file in local machine for which I get the following output:
INFO Utils: Copying /home/files/data.txt to /tmp/spark-b2e2bb22-487b-412b-831d-19d7aa96f275/userFiles-147c9552-1a77-427e-9b17-cb0845807860/data.txt
INFO SparkContext: Added file /home/files/data.txt at http://192.XX.XX.164:58143/files/data.txt with timestamp 1457432207649
But when I try to access the same file using SparkFiles.get("data.txt"), I get the path to file in my driver instead of worker.
I am setting my file like this
SparkConf conf = new SparkConf().setAppName("spark-play").setMaster("spark://192.XX.XX.172:7077");
conf.setJars(new String[]{"jars/SparkWorker.jar"});
JavaSparkContext sparkContext = new JavaSparkContext(conf);
sparkContext.addFile("/home/files/data.txt");
List<String> file =sparkContext.textFile(SparkFiles.get("data.txt")).collect();
I am getting FileNotFoundException here.
I have recently faced the same issue and hopefully my solution can help other people solve this issue.
We know that when you use SparkContext.addFile(<file_path>), it sends the file to the automatically created working directories in the driver node (in this case, your machine) as well as the worker nodes of the Spark cluster.
The block of code that you shared where you are using SparkFiles.get("data.txt") is being executed on the driver, so it returns the path to the file on the driver, instead of the worker. But, the task is being run on the worker and path to the file on the driver does not match the path to the file on the worker because the driver and worker nodes have different working directory paths. Hence, you get the FileNotFoundException.
There is a workaround to this problem without using any distributed file system or ftp server. You should put the file in your working directory on your host machine. Then, instead of using SparkContext.get("data.txt"), you use "./data.txt".
List<String> file = sparkContext.textFile("./data.txt").collect();
Now, even though there is a mismatch of working directory paths between the spark driver and worker nodes, you will NOT face FileNotFoundException since you are using a relative path to access the file.
I think that the main issue is that you are trying to read the file via the textFile method. What is inside the brackets of the textFile method is executed in the driver program. In the worker node only the code tobe run against an RDD is performed. When you type textFile what happens is that in your driver program it is created a RDD object with a trivial associated DAG.But nothing happens in the worker node.
Thus, when you're trying to collect the data, the worker is asked to read the file at the URL you've passed to textFile, which is told by the driver. Since your file is in the local filesystem of the driver and the worker node doesn't have access to it, you get the FileNotFoundException.
The solution is to make the file available to the worker node by putting it into a distributed filesystem as HDFS or via (s)ftp or you have to trasfer the file into the worker node before running the Spark job and then you have to put as an argument of textFile the path of the file in the worker filesystem.

real time log processing using apache spark streaming

I want to create a system where I can read logs in real time, and use apache spark to process it. I am confused if I should use something like kafka or flume to pass the logs to spark stream or should I pass the logs using sockets. I have gone through a sample program in the spark streaming documentation- Spark stream example. But I will be grateful if someone can guide me a better way to pass logs to spark stream. Its kind of a new turf to me.
Apache Flume may help to read the logs in real time.
Flume provides logs collection and transport to the application where Spark Streaming is used to analyze required information.
1. Download Apache Flume from official site or follow the instructions from here
2. Setup and run Flume
modify flume-conf.properties.template from the directory where Flume is installed (FLUME_INSTALLATION_PATH\conf), here you need to provide logs source, channel and sinks (output). More details about setup here
There is an example of launching flume which collects log information from ping comand running on windows host and writes it to a file:
flume-conf.properties
agent.sources = seqGenSrc
agent.channels = memoryChannel
agent.sinks = loggerSink
agent.sources.seqGenSrc.type = exec
agent.sources.seqGenSrc.shell = powershell -Command
agent.sources.seqGenSrc.command = for() { ping google.com }
agent.sources.seqGenSrc.channels = memoryChannel
agent.sinks.loggerSink.type = file_roll
agent.sinks.loggerSink.channel = memoryChannel
agent.sinks.loggerSink.sink.directory = D:\\TMP\\flu\\
agent.sinks.loggerSink.serializer = text
agent.sinks.loggerSink.appendNewline = false
agent.sinks.loggerSink.rollInterval = 0
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 100
To run the example go to FLUME_INSTALLATION_PATH and execute
java -Xmx20m -Dlog4j.configuration=file:///%CD%\conf\log4j.properties -cp .\lib\* org.apache.flume.node.Application -f conf\flume-conf.properties -n agent
OR you may create your java application that has flume libraries in a classpath and call org.apache.flume.node.Application instance from the application passing corresponding arguments.
How to setup Flume to collect and transport logs?
You can use some script for gathering logs from the specified location
agent.sources.seqGenSrc.shell = powershell -Command
agent.sources.seqGenSrc.command = your script here
instead of windows script you also can launch java application (put 'java path_to_main_class arguments' in field) which provides smart logs collection. For example, if the file is modified in real-time you can use Tailer from Apache Commons IO.
To configure the Flume to transport the log infromation read this article
3. Get the Flume stream from your source code and analyze it with Spark.
Take a look on a code sample from github https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaFlumeEventCount.java
You can use Apache Kafka as queue system for your logs. The system that generated your logs e.g websever will send logs to Apache KAFKA. Then you can use apache storm or spark streaming library to read from KAFKA topic and process logs at real time.
You need to create stream of logs , which you can create using Apache Kakfa. There are integration available for kafka with storm and apache spark. both has its pros and cons.
For Storm Kafka Integration look here
For Apache Spark Kafka Integration take a look here
Although this is a old question, posting a link from Databricks, which has a great step by step article for log analysis with Spark considering many areas.
https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/index.html
Hope this helps.

Resources