How to change Flink's log directory - log4j

I understand Flink uses log4j to manage log. So I change log setting in log4j.property, where I set the output location. However, when I start job master, it says that the log location is changed, not the default location. So how could I change the log location of Flink gracefully?

The default lib directory is set via bin/config.sh. Look for FLINK_LOG_DIR. You can just update the script to change the default log directory.

Add the following line in flink-conf.yaml that can be found in conf directory of Flink installation:
env.log.dir: /var/log/flink
Where /var/log/flink is the directory you want to use for logs.
Note that Flink does not seem to support full YML syntax, so
env:
log:
dir: /var/log/flink
will not work!

Since 1.0.3 you can set env.log.dir to change the directory where the logs are saved.

Related

how do we copy file from hadoop to abfs remotely

how do we copy files from Hadoop to abfs (azure blob file system)
I want to copy from Hadoop fs to abfs file system but it throws an error
this is the command I ran
hdfs dfs -ls abfs://....
ls: No FileSystem for scheme "abfs"
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem not found
any idea how this can be done ?
In the core-site.xml you need to add a config property for fs.abfs.impl with value org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem, and then add any other related authentication configurations it may need.
More details on installation/configuration here - https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html
the abfs binding is already in core-default.xml for any release with the abfs client present. however, the hadoop-azure jar and dependency is not in the hadoop common/lib dir where it is needed (it is in HDI, CDH, but not the apache one)
you can tell the hadoop script to pick it and its dependencies up by setting the HADOOP_OPTIONAL_TOOLS env var; you can do this in ~/.hadoop-env; just try on your command line first
export HADOOP_OPTIONAL_TOOLS="hadoop-azure,hadoop-aws"
after doing that, download the latest cloudstore jar and use its storediag command to attempt to connect to an abfs URL; it's the place to start debugging classpath and config issues
https://github.com/steveloughran/cloudstore

How to properly set spark.driver.log.dfsDir parameter?

Using Spark 3.1.1
How to properly set this spark.driver.log.dfsDir?
My spark-defaults.conf:
spark.eventLog.dir hdfs://namenode:9000/shared/spark-logs
spark.history.fs.logDirectory hdfs://namenode:9000/shared/spark-logs
spark.history.fs.update.interval 30s
spark.history.ui.port 8099
spark.history.fs.cleaner.enabled true
spark.history.fs.cleaner.maxAge 30d
spark.driver.log.persistToDfs.enabled true
spark.driver.log.dfsDir hdfs://namenode:9000/shared/driver-logs
I get the following error when using spark-submit on my spark driver.
21/05/19 15:05:34 ERROR DriverLogger: Could not persist driver logs to dfs
java.lang.IllegalArgumentException: Pathname /home/app/odm-spark/hdfs:/namenode:9000/shared/driver-logs from /home/app/odm-spark/hdfs:/namenode:9000/shared/driver-logs is not a valid DFS filename.
Why does it prefix the app location to the URL?
The proper way to set it is:
spark.driver.log.dfsDir /shared/driver-logs
There could be an error in earlier "implementations" of how spark.driver.log.dfsDir is handled (yet cannot confirm it) since the official documentation says:
spark.driver.log.dfsDir Base directory in which Spark driver logs are synced, if spark.driver.log.persistToDfs.enabled is true. Within this base directory, each application logs the driver logs to an application specific file.
There is this section also:
If your applications persist driver logs in client mode by enabling spark.driver.log.persistToDfs.enabled, the directory where the driver logs go (spark.driver.log.dfsDir) should be manually created with proper permissions.
The gives this "feeling" that the directory is the root directory of any driver logs to be copied to.
This line in the source code (the DriverLogger that is responsible for copying driver logs) leaves no doubts to me:
val rootDir = conf.get(DRIVER_LOG_DFS_DIR).get

JAVA_HOME for Logstash

I am trying to setup ELK stack for my Web Services Log Monitoring.
So I have setup all the parts for ELK Stack.
I am facing one issue in Log-stash. When I am running Log-stash, I am facing error, could not load Java binary
Although the simple fix it set the JAVA_HOME in environment variable.
But I don't want to set an environment variable, but what I want to set JAVA_HOME just for Log-stash. I have tried adding in startup.options, but to enable I must run system-install. When I am running system-install, I am facing the same error again.
I have added
export JAVA_HOME=/opt/jre8
then system-install file runs, but still on starting log-stash, I am getting the same error. What should I do to resolve this error?
You can config in startup.options (logstash5.4 version):
Ex:
JAVA_HOME=/.../jdk1.8.0_121
JAVACMD=/.../jdk1.8.0_121/bin/java
Then use root role to start: system-install.
(You can use update-java-alternatives --list to list installed java versions with paths)
You can add this configuration to the file- /etc/sysconfig/logstash, this file is read during startup by logstash.
This is what you should add:
export JAVA_HOME=/opt/jre8

Spark cluster with Jupyter Notebook tmp directory settings

I have a problem (rather a requirement) that all temporary files are written to a specific directory.
I currently set:
spark.local.dir /path/to/my/other/tmp/directory
spark.eventLog.dir /path/to/foo/bar
This kinda works, but I still get some files in the default /tmp folder.
Some <some hash>_resources, a folder called hive, a lot of liblz4-java<some hash>.so and snappy-<version>-<hash>-libsnappyjava.so files.
I would like to set the path for these temporary files, any ideas? And what would be the best practice for something like this?

Changing data file directories Cassandra

I'm trying to change the Cassandra data, commit log and saved caches directories by defining a custom shell script for CASANDRA_INCLUDE. I'm modifying the properties in the script as follows :
***
data_file_directories = "/usr/pic1/kearanky/cassandra/data"
commitlog_directory = "/usr/pic1/kearanky/cassandra/commitlog"
saved_caches_directory: "/usr/pic1/kearanky/cassandra/saved_caches"
***
When I run cassandra I get the error "data_file_directories: command not found". How can I modify the directories correctly?
PS: I don't have write access to cassandra.yaml and can't create the default directories it uses.
referrer to this answer Make your own cassandra.yaml with your custom directories and then run cassandra with with -d flag and cassandra.config=directory
or set $CASSANDRA_HOME variable in your .bashrc and then run cassandra

Resources