Here is some piece of code I run before connection to spark in Jupyter:
%%configure -f
{
"conf" : {
"spark.driver.memory":"8G",
some other config
},
"name": my_spark_application_name,
"queue":yarn_source_queue_to use
}
I have some question about this usage:
what does magic command %%configure -f do,could anyone please explain this command?I cannot find ralated document
Is this configure contents designed by other engineers to act as a template to start a pyspark session,or something provided by JupyterHub?What runs behind the whole command?
Related
I have configuration and cluster set in GCP and i can submit a spark job, but I am trying to run cloud dataproc job submit spark from my CLI for the same configuration.
I've set the service account in my local, I am just unable to build the equivalent command for the console configuration.
console config:
"sparkJob": {
"mainClass": "main.class",
"properties": {
"spark.executor.extraJavaOptions": "-DARGO_ENV_FILE=gs://file.properties",
"spark.driver.extraJavaOptions": "-DARGO_ENV_FILE=gs://file.properties"
},
"jarFileUris": [
"gs://my_jar.jar"
],
"args": [
"arg1",
"arg2",
"arg3"
]
}
And the equivalent command that I built is-
cloud dataproc job submit spark
-t spark
-p spark.executor.extraJavaOptions:-DARGO_ENV_FILE=gs://file.properties,spark.driver.extraJavaOptions-DARGO_ENV_FILE=gs://file.properties
-m main.class
-c my_cluster
-f gs://my_jar.jar
-a ‘arg1’,‘arg2’,‘arg3’
It's not reading the file.properties files and giving this error-
error while opening file spark.executor.extraJavaOptions=-DARGO_ENV_FILE=gs://file.properties,spark.driver.extraJavaOptions=-DARGO_ENV_FILE=gs://file.properties: error: open spark.executor.extraJavaOptions=-DARGO_ENV_FILE=gs://file.properties,spark.driver.extraJavaOptions=-DARGO_ENV_FILE=gs://file.properties: no such file or directory
And when I run command without mentioning the -p (properties) flag and those files, it runs but eventually fails because of those missing properties files.
Where I am doing something wrong, I can't figure it out.
ps: I'm trying to run dataproc command from CLI something like a spark-submit command-
spark-submit --conf "spark.driver.extraJavaOptions=-Dkafka.security.config.filename=file.properties"
--conf "spark.executor.extraJavaOptions=-Dkafka.security.config.filename=file.properties"
--class main.class my_jar.jar
--arg1
--arg2
--arg3
I have followed this guide:
https://support.nagios.com/kb/article/nagios-core-performance-graphs-using-influxdb-nagflux-grafana-histou-802.html#Nagflux_Config
Already have pnp4nagios running on the server (Debian 9). But I can't get any further, busy for weeks to try to get this fixed.
I am stuck at this point:
Verify Nagflux Is Working
Execute the following query to verify that InfluxDB is being populated with Nagios performance data:
curl -G "http://localhost:8086/query?db=nagflux&pretty=true" --data-urlencode "q=show series"
When I execute that command I get this:
{
"results": [
{}
]
}
Already done this on another distro (CentOS 8), still not results.
But when I execute this command (earlier in the documentation)
curl -G "http://localhost:8086/query?pretty=true" --data-urlencode "q=show databases"
This works:
{
"results": [
{
"series": [
{
"name": "databases",
"columns": [
"name"
],
"values": [
[
"_internal"
],
[
"nagflux"
]
]
}
]
}
]
}
I can add the InfluxDB datasource succesfully in Grafana but I can not select any data when I try so select it from the field "FROM".
It's only showing:
Default
Autogen
So I am very curious what am I doing wrong, normally the documentation from Nagios support works very good.
Thank you big time for reading my issue :).
As you already have PNP4Nagios installed, https://support.nagios.com/kb/article/nagios-core-using-grafana-with-pnp4nagios-803.html would be more apropriate solution for you.
/usr/local/nagios/etc/nagios.cfg has different host_perfdata_file_processing_command when you try to fill influxdb (with nagflux) instead of using Grafana with PNP4Nagios.
You don't need another server. I have Nagios Core, InfluxDB, Nagflux, Histou and Grafana working on same machine.
And you don't have to uninstal PNP4Nagios, just stop & disable service on boot: systemctl stop npcd.service && systemctl disable npcd.service.
After that you have to edit nagios.cfg according to: https://support.nagios.com/kb/article/nagios-core-performance-graphs-using-influxdb-nagflux-grafana-histou-802.html#Nagios_Command_Config to change host_perfdata_file_processing_command value, and change format of *_perfdata_file_template.
Then define process-host-perfdata-file-nagflux & process-service-perfdata-file-nagflux commands in commands.cfg.
If you did like described above, after minute you should see changes in your nagflux database.
Install influxdb-client, then:
influx
use nagflux
SELECT * FROM METRICS
You should see your database loading :)
I am using databricks and writing my code in python notebook. Recently we deployed it in prod. However sometimes the notebook is getting failed.
I am looking for notebook command execution log file however there is no option to generate the log file in databricks.
I want to store log files in DBFS with timestamp so i can refer these log files if it fails.
Is there anyway we can achieve this? Thanks in advance for your help.
Yes there is a way to do this. You would utilize the Databricks API. This is taken from their website.
Create a cluster with logs delivered to a DBFS location
The following cURL command creates a cluster named “cluster_log_dbfs” and requests Databricks to sends its logs to dbfs:/logs with the cluster ID as the path prefix.
curl -n -H "Content-Type: application/json" -X POST -d #- https://<databricks-
instance>/api/2.0/clusters/create <<JSON
{
"cluster_name": "cluster_log_dbfs",
"spark_version": "5.2.x-scala2.11",
"node_type_id": "i3.xlarge",
"num_workers": 1,
"cluster_log_conf": {
"dbfs": {
"destination": "dbfs:/logs"
}
}
}
I am trying to add an external package in Jupyter of Azure Spark.
%%configure -f
{ "packages" : [ "com.microsoft.azure:spark-streaming-eventhubs_2.11:2.0.4" ] }
Its output :
Current session configs: {u'kind': 'spark', u'packages': [u'com.microsoft.azure:spark-streaming-eventhubs_2.11:2.0.4']}
But when I tried to import:
import org.apache.spark.streaming.eventhubs.EventHubsUtils
I got an error:
The code failed because of a fatal error: Invalid status code '400'
from
http://an0-o365au.zdziktedd3sexguo45qd4z4qhg.xx.internal.cloudapp.net:8998/sessions
with error payload: "Unrecognized field \"packages\" (class
com.cloudera.livy.server.interactive.CreateInteractiveRequest), not
marked as ignorable (15 known properties: \"executorCores\", \"conf\",
\"driverMemory\", \"name\", \"driverCores\", \"pyFiles\",
\"archives\", \"queue\", \"kind\", \"executorMemory\", \"files\",
\"jars\", \"proxyUser\", \"numExecutors\",
\"heartbeatTimeoutInSecond\" [truncated]])\n at [Source:
HttpInputOverHTTP#5bea54d; line: 1, column: 32] (through reference
chain:
com.cloudera.livy.server.interactive.CreateInteractiveRequest[\"packages\"])".
Some things to try: a) Make sure Spark has enough available resources
for Jupyter to create a Spark context. For instructions on how to
assign resources see http://go.microsoft.com/fwlink/?LinkId=717038 b)
Contact your cluster administrator to make sure the Spark magics
library is configured correctly.
I also tried:
%%configure
{ "conf": {"spark.jars.packages": "com.microsoft.azure:spark-streaming-eventhubs_2.11:2.0.4" }}
Got the same error.
Could someone point me a correct way to use external package in Jupyter of Azure Spark?
If you're using HDInsight 3.6, then use the following. Also, be sure to restart your kernel before executing this:
%%configure -f
{"conf":{"spark.jars.packages":"com.microsoft.azure:spark-streaming-eventhubs_2.11:2.0.4"}}
Also, ensure that your package name, version and scala version are correct. Specifically, the JAR that you're trying to use has changed names since the posting of this question. More information on what it is called now can be found here: https://github.com/Azure/azure-event-hubs-spark.
As far as I can tell, when setting / using spark.driver.extraClassPath and spark.executor.extraClassPath on AWS EMR within the spark-defaults.conf or elsewhere as a flag, I would have to first get the existing value that [...].extraClassPath is set to, then append :/my/additional/classpath to it in order for it to work.
is there a function in Spark that allows me to to just append an additional class path to it where it retains/respects the existing paths set by EMR in /etc/spark/conf/spark-defaults.conf?
No such "function" in Spark but:
On EMR AMI's you can write a bootstrap that will append/set whatever you want in spark-defaults, will of course affect all Spark jobs.
When EMR moved to the newer "release-label" this stopped working as bootstrap-steps were replaced with configuration JSONs and the manual bootstraps run before applications are installed ( At least when I tried it )
We use SageMaker Studio connected to EMR clusters via Apache Livy and none of the approaches suggested so far worked for me - we need to add a custom JAR on the classpath for a custom filesystem.
Following this guide:
https://aws.amazon.com/premiumsupport/knowledge-center/emr-spark-classnotfoundexception/
I made it work by adding an EMR Step to cluster set up with the following code (this is TypeScript CDK, passed into steps parameter of CfnCluster CDK construct:
//We are using this approach to add CkpFS JAR (in Docker image additional-jars) onto the class path because
// the spark.jars configuration in spark-defaults is not respected by Apache Livy on EMR when it creates a Spark session
// AWS guide followed: https://aws.amazon.com/premiumsupport/knowledge-center/emr-spark-classnotfoundexception/
private createStepToAddCkpFSJAROntoEMRSparkClassPath() {
return {
name: "AddCkpFSJAROntoEMRSparkClassPath",
hadoopJarStep: {
jar: "command-runner.jar",
args: [
"bash",
"-c",
"while [ ! -f /etc/spark/conf/spark-defaults.conf ]; do sleep 1; done;" +
"sudo sed -i '/spark.*.extraClassPath/s/$/:\\/additional-jars\\/\\*/' /etc/spark/conf/spark-defaults.conf",
],
},
actionOnFailure: "CONTINUE",
};
}