Turn off pyspark logging through python sript - apache-spark

How can I turn off pyspark logging from a python script?
Pls Note : I do not want to make any changes in the spark logger properties file.

To remove (or modify) logging from a python script:
conf = SparkConf()
conf.set('spark.logConf', 'true') # necessary in order to be able to change log level
... # other stuff and configuration
# create the session
spark = SparkSession.builder\
.config(conf=conf) \
.appName(app_name) \
.getOrCreate()
# set the log level to one of ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN
spark.sparkContext.setLogLevel("OFF")
docs configuration
docs setLogLevel
Hope this helps, good luck!
Edit: For earlier versions, e.g. 1.6, you can try something like the following, taken from here
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel(logger.Level.OFF)
# or
logger.LogManager.getRootLogger().setLevel(logger.Level.OFF)
I haven't tested it unfortunately, please, let me know if it works.

Related

PySpark config through airflow

I'm trying to pass packages org.apache.spark:spark-avro_2.12:2.4.3 through SparkSubmitOperator to the config as described here: https://spark.apache.org/docs/2.4.3/sql-data-sources-avro.html As I'm trying to use spark to read Avro files.
This is what I did in airflow dag, but it didn't work. Could someone please help to point out what I did wrong? Many thanks.
conf = Variable.get("spark_conf", deserialize_json = True)
conf_sp = conf.update({"spark.jars.packages":"org.apache.spark:spark-avro_2.12:2.4.3"})
op = SparkSubmitOperator(
application = "my_app",
conf = conf_sp
....
)
The SparkSubmitOperator relies on the SparkSubmitHook which at the end composes a spark-submit CLI command to be executed.
In the CLI command form, you need to specify a dependency on packages with the package option so that they can be fetched from Maven and not in the configuration option.
op = SparkSubmitOperator(
application = "my_app",
packages = "org.apache.spark:spark-avro_2.12:2.4.3"
)

spark-submit overriding default application.conf not working

I am building a jar which has application.conf under src/main/resources folder. However, I am trying to overwrite that while doing spark-submit. However it's not working.
following is my command
$spark_submit $spark_params $hbase_params \
--class com.abc.xyz.MYClass \
--files application.conf \
$sandbox_jar flagFile/test.FLG \
--conf "spark.executor.extraClassPath=-Dconfig.file=application.conf"
application.conf - is located in same directory my jar file is.
-Dconfig.file=path/to/config-file mayn't work due to internal cache on ConfigFactory. The documentation suggest to run ConfigFactory.invalidateCaches().
The other way is following, which merges the supplied properties with existing properties available.
ConfigFactory.invalidateCaches()
val c = ConfigFactory.parseFile(new File(path-to-file + "/" + "application.conf"))
val config : Config = c.withFallback(ConfigFactory.load()).resolve
I think the best way to override the properties would be to supply them using -D. Typesafe gives highest priority to system properties, so -D will override reference.conf and application.conf.
Considering application.conf is properties file. There is other option, which can solve the same purpose of using properties file.
Not sure but packaging properties file with jar might not provide flexibility? Here keeping properties file separate from jar packaging, this will provide flexibility as, whenever if any property changes just replace new properties file instead of building and deploying whole jar.
This can be achieved as, keep your properties in properties file are prefix your property key with "spark."
spark.inputpath /input/path
spark.outputpath /output/path
Spark Submit command would be like,
$spark_submit $spark_params $hbase_params \
--class com.abc.xyz.MYClass \
--properties-file application.conf \
$sandbox_jar flagFile/test.FLG
Getting properties in code like,
sc.getConf.get("spark.inputpath") // /input/path
sc.getConf.get("spark.outputpath") // /output/path
Not nessesary it will solve your problem though. But here just try to put another approach to work.

How to get the SparkSession to find added python files

After running pip install BigDL==0.8.0, running from bigdl.util.common import * from python completed without issue.
However, with either of the following SparkSessions:
spark = (SparkSession.builder.master('yarn')
.appName('test')
.config("spark.jars", "/BigDL/spark/dl/target/bigdl-0.8.0-jar-with-dependencies-and-spark.jar")
.config('spark.submit.pyFiles', '/BigDL/pyspark/bigdl/util.zip')
.getOrCreate()
)
or
spark = (SparkSession.builder.master('local')
.appName('test')
.config("spark.jars", "/BigDL/spark/dl/target/bigdl-0.8.0-jar-with-dependencies-and-spark.jar")
.config('spark.submit.pyFiles', '/BigDL/pyspark/bigdl/util.zip')
.getOrCreate()
)
I get the following error.
ImportError: ('No module named bigdl.util.common', <function subimport at 0x7fd442a36aa0>, ('bigdl.util.common',))
In addition of the 'spark.submit.pyFiles' config above, after the SparkSession successfully starts, I have tried spark.sparkContext.addPyFile("util.zip") where "util.zip" contains all of the python files in https://github.com/intel-analytics/BigDL/tree/master/pyspark/bigdl/util .
I have also zipped all of the contents in this folder https://github.com/intel-analytics/BigDL/tree/master/pyspark/bigdl (branch-0.8) and pointed to that file in the .config('spark.submit.pyFiles', '/path/to/bigdl.zip'), but this also does not work.
How do I get the SparkSession to see these files?
Figured it out. The only thing that worked was spark.sparkContext.addPyFile("bigdl.zip") after the SparkSesssion has started. Where "bigdl.zip" contained all of the files in https://github.com/intel-analytics/BigDL/tree/master/pyspark/bigdl (branch-0.8).
Not sure why .config('spark.submit.pyFiles', 'bigdl.zip') would not work.

Pyspark - FileInputDStream: Error finding new files

Hi I'm new to Python Spark and I'm trying out this example from Spark github in order to Counts words in new text files created in the given directory :
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: hdfs_wordcount.py <directory>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonStreamingHDFSWordCount")
ssc = StreamingContext(sc, 1)
lines = ssc.textFileStream("hdfs:///home/my-logs/")
counts = lines.flatMap(lambda line: line.split(" "))\
.map(lambda x: (x, 1))\
.reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
And this is what I get :
a warning saying : WARN FileInputDStream: Error finding new files
a warning message saying : WARN FileInputDStream: Error finding new files.
and I got empty results even i'm adding files in this dir :/
Any suggested solution for this ?
thanks.
The issue is spark streaming will not read old files from directory..since all logs files exist before your streaming job started
so what you need to do once you started your streaming job then put/copy input files in hdfs directory either manually or by an script.
I think you are referring to this example. Are you able to run it without modifying as I see you are setting directory to "hdfs:///" in program? You can run the example like below.
For example Spark is at /opt/spark-2.0.2-bin-hadoop2.7. You can run hdfs_wordcount.py available in example directory like below. We are using /tmp as directory to pass as argument to program.
user1#user1:/opt/spark-2.0.2-bin-hadoop2.7$ bin/spark-submit examples/src/main/python/streaming/hdfs_wordcount.py /tmp
Now while this program is running, open another terminal and copy some file to /tmp folder
user1#user1:~$ cp test.txt /tmp
You will see the word count in first terminal.
Solved!
The issue is the build, i use to build like that using maven depending on their readme file from github :
build/mvn -DskipTests clean package
I've build that way depending on their documentation :
build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
Someone know what those params are ?

WARN No appenders could be found for logger (org.apache.accumulo.start.classloader.AccumuloClassLoader)

Does anyone know how to get rid of the following warnings when starting accumulo:
log4j:WARN No appenders could be found for logger (org.apache.accumulo.start.classloader.AccumuloClassLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
I am running accumulo 1.4.0 hadoop 0.20.2 and zookeeper 3.3.3. I understand this warning happens because the class can not find the log4j.properties file and yes I have read http://logging.apache.org/log4j/1.2/manual.html. My log4j.properties file has the following lines copied from an accumulo 1.4.3 log4j file (I dont have the option to upgrade my system to 1.4.3):
# default logging properties:
# by default, log everything at INFO or higher to the console
log4j.rootLogger=INFO,A1
# hide Jetty junk
log4j.logger.org.mortbay.log=WARN,A1
# hide "Got brand-new compresssor" messages
log4j.logger.org.apache.hadoop.io.compress=WARN,A1
# hide junk from TestRandomDeletes
log4j.logger.org.apache.accumulo.server.test.TestRandomDeletes=WARN,A1
# hide almost everything from zookeeper
log4j.logger.org.apache.zookeeper=ERROR,A1
# hide AUDIT messages in the shell, alternatively you could send them to a different logger
log4j.logger.org.apache.accumulo.core.util.shell.Shell.audit=WARN,A1
# Send most things to the console
log4j.appender.A1=org.apache.log4j.ConsoleAppender
log4j.appender.A1.layout.ConversionPattern=%d{ISO8601} [%-8c{2}] %-5p: %m%n
log4j.appender.A1.layout=org.apache.log4j.PatternLayout
I have put this log4j file everyone. In the accumulo/bin folder, in the accumulo/conf folder, in the accumulo/lib folder but can not get rid of this warning (I know it has to go on the accumulo class path but dont know where that is). I also can't pass a log4j.configuration option to the java compiler because the accmulo executable comes pre-compiled (I just run it).
Thanks in advance for the help.
EDIT: Below is the result of an "accumulo classpath" command on my system:
[admin-cloud#NODE1 bin]$ echo $ACCUMULO_HOME
/accumulo/accumulo-1.4.0
[admin-cloud#NODE1 bin]$ accumulo classpath
Accumulo List of classpath items are:
file:/accumulo/accumulo-1.4.0/lib/commons-collections-3.2.jar
file:/accumulo/accumulo-1.4.0/lib/commons-configuration-1.5.jar
file:/accumulo/accumulo-1.4.0/lib/log4j-1.2.16.jar
file:/accumulo/accumulo-1.4.0/lib/libthrift-0.6.1.jar
file:/accumulo/accumulo-1.4.0/lib/commons-jci-core-1.0.jar
file:/accumulo/accumulo-1.4.0/lib/commons-lang-2.4.jar
file:/accumulo/accumulo-1.4.0/lib/commons-logging-api-1.0.4.jar
file:/accumulo/accumulo-1.4.0/lib/accumulo-server-1.4.0.jar
file:/accumulo/accumulo-1.4.0/lib/accumulo-start-1.4.0.jar
file:/accumulo/accumulo-1.4.0/lib/commons-jci-fam-1.0.jar
file:/accumulo/accumulo-1.4.0/lib/jline-0.9.94.jar
file:/accumulo/accumulo-1.4.0/lib/examples-simple-1.4.0.jar
file:/accumulo/accumulo-1.4.0/lib/cloudtrace-1.4.0.jar
file:/accumulo/accumulo-1.4.0/lib/commons-logging-1.0.4.jar
file:/accumulo/accumulo-1.4.0/lib/accumulo-core-1.4.0.jar
file:/accumulo/accumulo-1.4.0/lib/commons-io-1.4.jar
file:/zookeeper/zookeeper-3.3.6/zookeeper-3.3.6.jar
file:/hadoop/hadoop-0.20.2/conf/
file:/hadoop/hadoop-0.20.2/hadoop-0.20.2-examples.jar
file:/hadoop/hadoop-0.20.2/hadoop-0.20.2-test.jar
file:/hadoop/hadoop-0.20.2/hadoop-0.20.2-tools.jar
file:/hadoop/hadoop-0.20.2/hadoop-0.20.2-ant.jar
file:/hadoop/hadoop-0.20.2/hadoop-0.20.2-core.jar
file:/hadoop/hadoop-0.20.2/lib/log4j-1.2.15.jar
file:/hadoop/hadoop-0.20.2/lib/jasper-runtime-5.5.12.jar
file:/hadoop/hadoop-0.20.2/lib/slf4j-log4j12-1.4.3.jar
file:/hadoop/hadoop-0.20.2/lib/commons-httpclient-3.0.1.jar
file:/hadoop/hadoop-0.20.2/lib/mockito-all-1.8.0.jar
file:/hadoop/hadoop-0.20.2/lib/jetty-6.1.14.jar
file:/hadoop/hadoop-0.20.2/lib/oro-2.0.8.jar
file:/hadoop/hadoop-0.20.2/lib/servlet-api-2.5-6.1.14.jar
file:/hadoop/hadoop-0.20.2/lib/junit-3.8.1.jar
file:/hadoop/hadoop-0.20.2/lib/commons-logging-api-1.0.4.jar
file:/hadoop/hadoop-0.20.2/lib/commons-codec-1.3.jar
file:/hadoop/hadoop-0.20.2/lib/core-3.1.1.jar
file:/hadoop/hadoop-0.20.2/lib/jets3t-0.6.1.jar
file:/hadoop/hadoop-0.20.2/lib/hsqldb-1.8.0.10.jar
file:/hadoop/hadoop-0.20.2/lib/slf4j-api-1.4.3.jar
file:/hadoop/hadoop-0.20.2/lib/jasper-compiler-5.5.12.jar
file:/hadoop/hadoop-0.20.2/lib/jetty-util-6.1.14.jar
file:/hadoop/hadoop-0.20.2/lib/commons-net-1.4.1.jar
file:/hadoop/hadoop-0.20.2/lib/commons-logging-1.0.4.jar
file:/hadoop/hadoop-0.20.2/lib/commons-cli-1.2.jar
file:/hadoop/hadoop-0.20.2/lib/xmlenc-0.52.jar
file:/hadoop/hadoop-0.20.2/lib/kfs-0.2.2.jar
file:/hadoop/hadoop-0.20.2/lib/commons-el-1.0.jar
Line 84 of bin/accumulo in Apache Accumulo 1.4.0 sets the variable XML_FILES to $ACCUMULO_HOME/conf and then adds XML_FILES to the CLASSPATH variable which is later passed to the java command.
https://svn.apache.org/repos/asf/accumulo/tags/1.4.0/bin/accumulo
It sounds you have a misconfiguration of ACCUMULO_HOME either through your shell environment or in $ACCUMULO_HOME/conf/accumulo-env.sh.
I was troubleshooting an installation someone else set up that was having the same problem. My solution to this problem was simply that there actually was no log4j.properties in the conf directory! So I just copied up one of the log4j.properties from the conf/examples directory, restarted and everything worked like it should!

Resources