I have looked into:
log4j2.properties in /etc/spark/conf
yarn-site.xml
yarn-env.sh (via YARN_LOG_DIR it is not getting set. In fact while running a job there is no env variable YARN_LOG_DIR in my executors)
log4j.properties in /etc/hadoop/conf
Where can I find and modify the yarn.nodemanager.log-dirs property?
To find this, we need to traverse some of Hadoop's source code:
yarn.nodemanager.log-dirs defaults to ${yarn.log.dir}/userlogs.
yarn.log.dir defaults to $HADOOP_LOG_DIR
$HADOOP_LOG_DIR defaults to ${HADOOP_HOME}/logs
So, have a look at $HADOOP_HOME/logs/userlogs to see whether you find something in there!
If you want to edit it, you can do either of the following:
edit $HADOOP_HOME
edit $HADOOP_LOG_DIR
add -Dyarn.log.dir=<your_chosen_value> to your spark application
add -Dyarn.nodemanager.log-dirs=<your_chosen_value> to your spark application
Related
I am using the flashtext library in a couple of UDFs. It works when I run it locally in Client mode, but once I try to run it in the Cloudera Workbench with several executors, I get an ModuleNotFoundError.
After some research I found that it is possible to add archives (and packages?) to a SparkSession when creating it, so I tried:
SparkSession.builder.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz')
but it didn't help, the same error remains.
According to Spark Configuration doc, there are other configs I could try, e.g. spark.submit.pyFiles, but I don't understand how these py-files to be added would have to look like.
Would it be enough to just create a pyton script with this content?
from flashtext import KeywordProcessor
Could you tell me the easiest way how I can install flashtext on every node?
Edit:
In the meantime, I figured that not only Flashtext was causing issues, but also every relative import from other scripts that I intended to use in a UDF. In order to fix it, I followed this article. I also took the source code from Flashtext and imported it to the main file without installing the actual library.
I think in order to point Spark executors to python modules extracted from your archive, you will need to add another config setting, that adds their location to PYTHONPATH. Something like this:
SparkSession.builder \
.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz#myUDFs') \
.config('spark.executorEnv.PYTHONPATH', './myUDFs')
Citing from the same link you have in the question:
spark.executorEnv.[EnvironmentVariableName]...Add the environment
variable specified by EnvironmentVariableName to the Executor process.
The user can specify multiple of these to set multiple environment
variables.
There are no environment details in your question (or I'm simply not familiar with Cloudera Workbench) but if you're trying to run Spark on YARN, you may need to use slightly different setting spark.yarn.dist.archives.
Also, please make sure that your driver log contains message confirming that an archive was actually uploaded, as in:
:
22/11/08 INFO yarn.Client: Uploading resource file:/absolute/path/to/your/archive.zip -> hdfs://nameservice/user/<your-user-id>/.sparkStaging/<application-id>/archive.zip
:
I'm trying to save a dataframe as a table and I'm wondering if there is a default path configuration I can set to make my life easier.
I understand that this works:
df.write.saveAsTable("mytable", path='s3a://mybucket/mybucketlocation')
but is it possible to have this command
df.write.saveAsTable("mytable")
achieve the same role with spark configurations?
Currently I have this configuration set, but it's not doing the trick.
('spark.sql.warehouse.dir', 's3a://mybucket/mybucketlocation')
This was already the object of discussion in previous post, however, I'm not convinced with the answers as the Google docs specify that it is possible to create a cluster setting the fs.defaultFS property. Moreover, even if possible to set this property programmatically, sometimes, it's more convenient to set it from command line.
So I wanted to know why the following option when passed to my cluster creation command does not work: --properties core:fs.defaultFS=gs://my-bucket? Please note I haven't included all parameters as I ran the command without the previous flag and it succeeded to create the cluster. However, when passing this, I get: "failed: Cannot start master: Insufficientnumber of DataNodes reporting."
If anyone managed to create a dataproc cluster by setting the fs.defaultFS that'd be great? Thanks.
It's true there are still known issues due to certain dependencies on actual HDFS; the docs were not intended to imply that setting fs.defaultFS to a GCS path at cluster-creation time would work, but to simply provide a convenient example of a property that appears in core-site.xml; in theory it would work to set fs.defaultFS to a different preexisting HDFS cluster, for example. I've filed a ticket to change the example in the documentation to avoid confusion.
Two options:
Just override fs.defaultFS at job-submission time using per-job properties
Workaround some of the known issues by setting fs.defaultFS explicitly using an initialization action instead of cluster properties.
Option 1 is better understood to work because cluster-level HDFS dependencies won't change. Option 2 works because most of the incompatibilities occur during initial startup only, and initialization actions run after the relevant daemons start up already. To override the setting in an init action, you'd use bdconfig:
bdconfig set_property \
--name 'fs.defaultFS' \
--value 'gs://my-bucket' \
--configuration_file /etc/hadoop/conf/core-site.xml \
--clobber
Basically, I want to check a property of Spark's configuration, such as "spark.local.dir" through command line, that is, without writing a program. Is there a method to do this?
There is no option of viewing the spark configuration properties from command line.
Instead you can check it in spark-default.conf file. Another option is to view from webUI.
The application web UI at http://driverIP:4040 lists Spark properties in the “Environment” tab. Only values explicitly specified through spark-defaults.conf, SparkConf, or the command line will appear. For all other configuration properties, you can assume the default value is used.
For more details, you can refer Spark Configuration
Following command print your conf properties on console
sc.getConf.toDebugString
We can check in Spark shell using below command :
scala> spark.conf.get("spark.sql.shuffle.partitions")
res33: String = 200
Based on http://spark.apache.org/docs/latest/configuration.html. Spark provides three locations to configure the system:
Spark properties control most application parameters and can be set
by using a SparkConf object, or through Java system properties.
Environment variables can be used to set per-machine settings, such the IP address, through the conf/spark-env.sh script on each
node.
Logging can be configured through log4j.properties.
I haven't heard about method through command line.
Master command to check spark config from CLI
sc._conf.getAll()
I am new to Spark and working on JavaSqlNetworkWordCount example to append the word count in a persistent table. I understand that I can only do it via HiveContext. HiveContext, however, keeps trying to save the table in /user/hive/warehouse/. I have tried changing the path by adding
hiveContext.setConf("hive.metastore.warehouse.dir", "/home/user_name");
and by adding the property
<property><name>hive.metastore.warehouse.dir</name>
<value>/home/user_name</value></property>
$SPARK_HOME/conf/hive-site.xml, but nothing seems to work. If anyone else has faced this problem, please let me know if/how you resolved it. I am using Spark1.4 on my local RHEL5 machine.
I think I solved the problem. It looks like spark-submit was creating a metastore_db directory in root directory of the jar file. If metastore_db exists, then hive-stie.xml values are ignored. As soon as I removed that directory, code picked up values from hive-site.xml. I still cannot set the value of the hive.metastore.warehouse.dir property from the code, though.