How can I prune executors' logs in spark streaming - apache-spark

I'm working on a spark streaming job which runs on standalone mode. The executors by default append the logs in $SPARK_HOME/work/app_idxxxx/stderr and stdout files. Now the problem comes when app runs for a long time say a month or more and it generates a lot of logs inside stderr file. I would like to rollup the stderr daily for a week and archive(delete) that after that. I changed the log4j.properties with org.apache.log4j.RollingFileAppender and directed the logs to a file instead of stderr but the file doesn't respect the rolling and keeps growing.
Creating a cron job to do that is also not working since spark has a pointer to that specific file and changing the name probably not working.
I could't find any documentations for these specific logs. I really appreciate for any help.

After digging more, I finally found how to resolve the issue and I post it here so that the next person don't go through all this suffer and trial/error.
The setting for those logs are in two different places. One in $SPARK_HOME/conf/spark-default.conf add these three lines below in each executor:
spark.executor.logs.rolling.time.interval daily
spark.executor.logs.rolling.strategy time
spark.executor.logs.rolling.maxRetainedFiles 7
The other file that you need to change in each executor is $SPARK_HOME/conf/spark-env.sh add the following line:
SPARK_WORKER_OPTS="$SPARK_WORKER_OPTS -Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=1800
-Dspark.worker.cleanup.appDataTtl=864000
-Dspark.executor.logs.rolling.strategy=time
-Dspark.executor.logs.rolling.time.interval=daily
-Dspark.executor.logs.rolling.maxRetainedFiles=7 "
export SPARK_WORKER_OPTS
After these changes it started working properly. Hope this helps some people :)

if you are in standalone mode, just export an environment is enough:
export SPARK_WORKER_OPTS="-Dspark.executor.logs.rolling.strategy=time -Dspark.executor.logs.rolling.time.interval=daily -Dspark.executor.logs.rolling.maxRetainedFiles=7"
you can also refer to: http://apache-spark-user-list.1001560.n3.nabble.com/Executor-Log-Rotation-Is-Not-Working-td18024.html

Related

How can I install flashtext on every executor?

I am using the flashtext library in a couple of UDFs. It works when I run it locally in Client mode, but once I try to run it in the Cloudera Workbench with several executors, I get an ModuleNotFoundError.
After some research I found that it is possible to add archives (and packages?) to a SparkSession when creating it, so I tried:
SparkSession.builder.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz')
but it didn't help, the same error remains.
According to Spark Configuration doc, there are other configs I could try, e.g. spark.submit.pyFiles, but I don't understand how these py-files to be added would have to look like.
Would it be enough to just create a pyton script with this content?
from flashtext import KeywordProcessor
Could you tell me the easiest way how I can install flashtext on every node?
Edit:
In the meantime, I figured that not only Flashtext was causing issues, but also every relative import from other scripts that I intended to use in a UDF. In order to fix it, I followed this article. I also took the source code from Flashtext and imported it to the main file without installing the actual library.
I think in order to point Spark executors to python modules extracted from your archive, you will need to add another config setting, that adds their location to PYTHONPATH. Something like this:
SparkSession.builder \
.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz#myUDFs') \
.config('spark.executorEnv.PYTHONPATH', './myUDFs')
Citing from the same link you have in the question:
spark.executorEnv.[EnvironmentVariableName]...Add the environment
variable specified by EnvironmentVariableName to the Executor process.
The user can specify multiple of these to set multiple environment
variables.
There are no environment details in your question (or I'm simply not familiar with Cloudera Workbench) but if you're trying to run Spark on YARN, you may need to use slightly different setting spark.yarn.dist.archives.
Also, please make sure that your driver log contains message confirming that an archive was actually uploaded, as in:
:
22/11/08 INFO yarn.Client: Uploading resource file:/absolute/path/to/your/archive.zip -> hdfs://nameservice/user/<your-user-id>/.sparkStaging/<application-id>/archive.zip
:

Junk Spark output file on S3 with dollar signs

I have a simple spark job that reads a file from s3, takes five and writes back in s3.
What I see is that there is always additional file in s3, next to my output "directory", which is called output_$folder$.
What is it? How I can prevent spark from creating it?
Here is some code to show what I am doing...
x = spark.sparkContext.textFile("s3n://.../0000_part_00")
five = x.take(5)
five = spark.sparkContext.parallelize(five)
five.repartition(1).saveAsTextFile("s3n://prod.casumo.stu/dimensions/output/")
After the job I have s3 "directory" called output which contains results and another s3 object called output_$folder$ which I don't know what it is.
Changing S3 paths in the application from s3:// to s3a:// seems to have done the trick for me. The $folder$ files are no longer getting created since I started using s3a://.
Ok, it seems I found out what it is.
It is some kind of marker file, probably used for determining if the S3 directory object exists or not.
How I reached this conclusion?
First, I found this link that shows the source of
org.apache.hadoop.fs.s3native.NativeS3FileSystem#mkdir
method: http://apache-spark-user-list.1001560.n3.nabble.com/S3-Extra-folder-files-for-every-directory-node-td15078.html
Then I googled other source repositories to see if I am going to find different version of the method. I didn't.
At the end, I did an experiment and rerun the same spark job after I removed the s3 output directory object but left output_$folder$ file. Job failed saying that output directory already exists.
My conclusion, this is hadoop's way to know if there is a directory in s3 with given name and I will have to live with that.
All the above happens when I run the job from my local, dev machine - i.e. laptop. If I run the same job from a aws data pipeline, output_$folder$ does not get created.
s3n:// and s3a:// doesn't generate marker directory like <output>_$folder$
If you are using hadoop with AWS EMR., I found moving from s3 to s3n is straight forward since they both use same file system implementation, whereas s3a involves AWS credential related code change.
('fs.s3.impl', 'com.amazon.ws.emr.hadoop.fs.EmrFileSystem')
('fs.s3n.impl', 'com.amazon.ws.emr.hadoop.fs.EmrFileSystem')
('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')

Generate auto increment sequence in logstash

I am pushing logs to Elastic Search from Logstash and then i need to get back the logs in the order they were written. Sorting by time stamp does not help because there could me multiple log statements in the same time. I followed the solution in Include monotonically increasing value in logstash field? and it worked perfectly in my windows system.
But when the code was moved to the linux production environment, logstash is not starting up. Failing with the below error
reason=>"Couldn't find any filter plugin named 'seq'. Are you sure
this is correct? Trying to load the seq filter plugin resulted in this
error: no such file to load -- logstash/filters/seq", :level=>:error}
Check if the seq.rb file is in the filter folder.
Also check if the line ending of your seq.rb are linux. If you transferred the file from a windows machine to a linux, the problem might come from here.

Best way to manually periodically import log files into Graylog using logstash

I'm currently using logstash to import dozens of log files from different webapps into Graylog. It works great the files are tagged so I know from wich webapp they originate.
I can't change the webapp thus I can't add a GELF appender to the log4j conf of the webapp. The idea is to periodically retrieve the log files, parse them and import them with logstash into Graylog.
My problem is how do I make sure I don't import a log event I've already imported.
For example, I have a log file that has a log pattern that increments: log.1, log.2, etc. So I'll have log events that could be in log.1 the first time and 2 weeks later when I reimport them they'll maybe be in log.3.
I'm afraid I can't handle that with logstash's file input "sincedb_path" and "start_position".
So here are a few options I've gathered and I'd like your input about them, if anyone encountered the same issue:
Use a logstash filter dropping all events before a certain date,
requires to keep an index of every last log date of every file
imported (potentially 50+) and a lot of configuration writing
Use of a drool rule in GrayLog to refuse logs with timestamps prior
to last log received for a given type
Ask to change the log pattern to be something like log.date instead
of a log pattern that renames files (but I'd rather avoid this one)
Any other idea?

Old logs are not imported into ES by logstash

When I start logstash, the old logs are not imported into ES.
Only the new request logs are recorded in ES.
Now I've see this in the doc.
Even if I set the start_position=>"beginning", old logs are not inserted.
This only happens when I run logstash on linux.
If I run it with the same config, old logs are imported.
I don't even need to set start_position=>"beginning" on windows.
Any idea about this ?
When you read an input log to Logstash, Logstash will keep an record about the position it read on this file, that's call sincedb.
Where to write the sincedb database (keeps track of the current position of monitored log files).
The default will write sincedb files to some path matching "$HOME/.sincedb*"
So, if you want to import old log files, you must delete all the .sincedb* at your $HOME.
Then, you need to set
start_position=>"beginning"
at your configuration file.
Hope this can help you.
Please see this line also.
This option only modifies "first contact" situations where a file is new and not seen before. If a file has already been seen before, this option has no effect.

Resources