How to use external package in Jupyter of Azure Spark - apache-spark

I am trying to add an external package in Jupyter of Azure Spark.
%%configure -f
{ "packages" : [ "com.microsoft.azure:spark-streaming-eventhubs_2.11:2.0.4" ] }
Its output :
Current session configs: {u'kind': 'spark', u'packages': [u'com.microsoft.azure:spark-streaming-eventhubs_2.11:2.0.4']}
But when I tried to import:
import org.apache.spark.streaming.eventhubs.EventHubsUtils
I got an error:
The code failed because of a fatal error: Invalid status code '400'
from
http://an0-o365au.zdziktedd3sexguo45qd4z4qhg.xx.internal.cloudapp.net:8998/sessions
with error payload: "Unrecognized field \"packages\" (class
com.cloudera.livy.server.interactive.CreateInteractiveRequest), not
marked as ignorable (15 known properties: \"executorCores\", \"conf\",
\"driverMemory\", \"name\", \"driverCores\", \"pyFiles\",
\"archives\", \"queue\", \"kind\", \"executorMemory\", \"files\",
\"jars\", \"proxyUser\", \"numExecutors\",
\"heartbeatTimeoutInSecond\" [truncated]])\n at [Source:
HttpInputOverHTTP#5bea54d; line: 1, column: 32] (through reference
chain:
com.cloudera.livy.server.interactive.CreateInteractiveRequest[\"packages\"])".
Some things to try: a) Make sure Spark has enough available resources
for Jupyter to create a Spark context. For instructions on how to
assign resources see http://go.microsoft.com/fwlink/?LinkId=717038 b)
Contact your cluster administrator to make sure the Spark magics
library is configured correctly.
I also tried:
%%configure
{ "conf": {"spark.jars.packages": "com.microsoft.azure:spark-streaming-eventhubs_2.11:2.0.4" }}
Got the same error.
Could someone point me a correct way to use external package in Jupyter of Azure Spark?

If you're using HDInsight 3.6, then use the following. Also, be sure to restart your kernel before executing this:
%%configure -f
{"conf":{"spark.jars.packages":"com.microsoft.azure:spark-streaming-eventhubs_2.11:2.0.4"}}
Also, ensure that your package name, version and scala version are correct. Specifically, the JAR that you're trying to use has changed names since the posting of this question. More information on what it is called now can be found here: https://github.com/Azure/azure-event-hubs-spark.

Related

How to read a csv file from an FTP using PySpark in Databricks Community

I am trying to fetch a file using FTP (kept on Hostinger) using Pyspark in Databricks community.
Everything works fine until I try to read that file using spark.read.csv('MyFile.csv').
Following is the code and an error,
PySpark Code:
res=spark.sparkContext.addFile('ftp://USERNAME:PASSWORD#URL/<folder_name>/MyFile.csv')
myfile=SparkFiles.get("MyFile.csv")
spark.read.csv(path=myfile) # Errors out here
print(myfile, sc.getRootDirectory()) # Outputs almost same path (except dbfs://)
Error:
AnalysisException: Path does not exist: dbfs:/local_disk0/spark-ce4b313e-00cf-4f52-80d8-dff98fc3eea5/userFiles-90cd99b4-00df-4d59-8cc2-2e18050d395/MyFile.csv
As the spark.addfile downloads file on driver while databricks uses dbfs as default storage you are getting the error please try below code to see if it fixes your issue
res=spark.sparkContext.addFile('ftp://USERNAME:PASSWORD#URL/<folder_name>/MyFile.csv')
myfile=SparkFiles.get("MyFile.csv")
spark.read.csv(path='file:'+myfile)
print(myfile, sc.getRootDirectory())

MLflow remote execution on databricks from windows creates an invalid dbfs path

I'm researching the use of MLflow as part of our data science initiatives and I wish to set up a minimum working example of remote execution on databricks from windows.
However, when I perform the remote execution a path is created locally on windows in the MLflow package which is sent to databricks. This path specifies the upload location of the '.tar.gz' file corresponding to the Github repo containing the MLflow Project. In cmd this has a combination of '\' and '/', but on databricks there are no separators at all in this path, which raises the 'rsync: No such file or directory (2)' error.
To be more general, I reproduced the error using an MLflow standard example and following this guide from databricks. The MLflow example is the sklearn_elasticnet_wine, but I had to add a default value to a parameter so I forked it and the MLproject which can be executed remotely can be found at (forked repo).
The Project can be executed remotely by the following command (assuming a databricks instance has been set up)
mlflow run https://github.com/aestene/mlflow#examples/sklearn_elasticnet_wine -b databricks -c db-clusterconfig.json --experiment-id <insert-id-here>
where "db-clusterconfig.json" correspond to the cluster to set up in databricks and is in this example set to
{
"autoscale": {
"min_workers": 1,
"max_workers": 2
},
"spark_version": "5.5.x-scala2.11",
"node_type_id": "Standard_DS3_v2",
"driver_node_type_id": "Standard_DS3_v2",
"ssh_public_keys": [],
"custom_tags": {},
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
}
}
When running the project remotely, this is the output in cmd:
2019/10/04 10:09:50 INFO mlflow.projects: === Fetching project from https://github.com/aestene/mlflow#examples/sklearn_elasticnet_wine into C:\Users\ARNTS\AppData\Local\Temp\tmp2qzdyq9_ ===
2019/10/04 10:10:04 INFO mlflow.projects.databricks: === Uploading project to DBFS path /dbfs\mlflow-experiments\3947403843428882\projects-code\aa5fbb4769e27e1be5a983751eb1428fe998c3e65d0e66eb9b4c77355076f524.tar.gz ===
2019/10/04 10:10:05 INFO mlflow.projects.databricks: === Finished uploading project to /dbfs\mlflow-experiments\3947403843428882\projects-code\aa5fbb4769e27e1be5a983751eb1428fe998c3e65d0e66eb9b4c77355076f524.tar.gz ===
2019/10/04 10:10:05 INFO mlflow.projects.databricks: === Running entry point main of project https://github.com/aestene/mlflow#examples/sklearn_elasticnet_wine on Databricks ===
2019/10/04 10:10:06 INFO mlflow.projects.databricks: === Launched MLflow run as Databricks job run with ID 8. Getting run status page URL... ===
2019/10/04 10:10:18 INFO mlflow.projects.databricks: === Check the run's status at https://<region>.azuredatabricks.net/?o=<databricks-id>#job/8/run/1 ===
Where the DBFS path has a leading '/' before the remaining are '\'.
The command spins up a cluster in databricks and is ready to execute the job, but ends up with the following error message on the databricks side:
rsync: link_stat "/dbfsmlflow-experiments3947403843428882projects-codeaa5fbb4769e27e1be5a983751eb1428fe998c3e65d0e66eb9b4c77355076f524.tar.gz" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1183) [sender=3.1.1]
Where we can see the same path but without the '\' inserted. I narrowed down the creation of this path to this file in the MLflow Github repo, where the following code creates the path (line 133):
dbfs_path = os.path.join(DBFS_EXPERIMENT_DIR_BASE, str(experiment_id),
"projects-code", "%s.tar.gz" % tarfile_hash)
dbfs_fuse_uri = os.path.join("/dbfs", dbfs_path)
My current hypothesis is that os.path.join() in the first line joins the string together in a "windows fashion" such that they have backslashes. Then the following call to os.path.join() adds a '/'. The databricks file system is then unable to handle this path and something causes the 'tar.gz' file to not be properly uploaded or to be accessed at the wrong path.
It should also be mentioned that the project runs fine locally.
I'm running the following versions:
Windows 10
Python 3.6.8
MLflow 1.3.0 (also replicated the fault with 1.2.0)
Any feedback or suggestions are greatly appreciated!
Thanks for the catch, you're right that using os.path.join when working with DBFS paths is incorrect, resulting in a malformed path that breaks project execution. I've filed to https://github.com/mlflow/mlflow/issues/1926 track this, if you're interested in making a bugfix PR (see the MLflow contributor guide for info on how to do this) to replace os.path.join here with os.posixpath.join I'd be happy to review :)
Thanks for putting this issue.
I also encountered the same at windows 10.
I resolved this issue, with replacing all 'os.path' to 'posixpath' at 'databricks.py' file.
It worked perfectly fine for me.

How to troubleshoot package loading error in spark

I'm using spark in HDInsight with Jupyter notebook. I'm using the %%configure "magic" to import packages. Every time there is a problem with the package, spark crashes with the error:
The code failed because of a fatal error: Status 'shutting_down' not
supported by session..
or
The code failed because of a fatal error: Session 28 unexpectedly
reached final status 'dead'. See logs:
Usually the problem was with me mistyping the name of the package, so after a few attempts I could solve it. Now I'm trying to import spark-streaming-eventhubs_2.11 and I think I got the name right, but I still receive the error. I looked at all kinds of logs but still couldn't find the one which shows any relevant info. Any idea how to troubleshoot similar errors?
%%configure -f
{ "conf": {"spark.jars.packages": "com.microsoft.azure:spark-streaming-eventhubs_2.11:2.0.5" }}
Additional info: when I run
spark-shell --conf spark.jars.packages=com.microsoft.azure:spark-streaming-eventhubs_2.11:2.0.5
The shell starts fine, and downloads the package
I finally was able to find the log files which contain the error. There are two log files which could be interesting
Livy log: livy-livy-server.out
Yarn log
On my HDInsight cluster, I found the livy log by connecting to one of the Head nodes with SSH and downloading a a file at this path (this log didn't contain useful info):
/var/log/livy/livy-livy-server.out
The actual error was in the yarn log file accessible from YarnUI. In HDInsight Azure Portal, go to "Cluster dashboard" -> "Yarn", find your session (KILLED status), click on "Logs" in the table, find "Log Type: stderr", click "click here for full log".
The problem in my case was Scala version incompatibility between one of the dependencies of spark-streaming_2.11 and Livy. This is supposed to be fixed Livy 0.4. More info here

getting & setting spark.driver/executor.extraClassPath on EMR

As far as I can tell, when setting / using spark.driver.extraClassPath and spark.executor.extraClassPath on AWS EMR within the spark-defaults.conf or elsewhere as a flag, I would have to first get the existing value that [...].extraClassPath is set to, then append :/my/additional/classpath to it in order for it to work.
is there a function in Spark that allows me to to just append an additional class path to it where it retains/respects the existing paths set by EMR in /etc/spark/conf/spark-defaults.conf?
No such "function" in Spark but:
On EMR AMI's you can write a bootstrap that will append/set whatever you want in spark-defaults, will of course affect all Spark jobs.
When EMR moved to the newer "release-label" this stopped working as bootstrap-steps were replaced with configuration JSONs and the manual bootstraps run before applications are installed ( At least when I tried it )
We use SageMaker Studio connected to EMR clusters via Apache Livy and none of the approaches suggested so far worked for me - we need to add a custom JAR on the classpath for a custom filesystem.
Following this guide:
https://aws.amazon.com/premiumsupport/knowledge-center/emr-spark-classnotfoundexception/
I made it work by adding an EMR Step to cluster set up with the following code (this is TypeScript CDK, passed into steps parameter of CfnCluster CDK construct:
//We are using this approach to add CkpFS JAR (in Docker image additional-jars) onto the class path because
// the spark.jars configuration in spark-defaults is not respected by Apache Livy on EMR when it creates a Spark session
// AWS guide followed: https://aws.amazon.com/premiumsupport/knowledge-center/emr-spark-classnotfoundexception/
private createStepToAddCkpFSJAROntoEMRSparkClassPath() {
return {
name: "AddCkpFSJAROntoEMRSparkClassPath",
hadoopJarStep: {
jar: "command-runner.jar",
args: [
"bash",
"-c",
"while [ ! -f /etc/spark/conf/spark-defaults.conf ]; do sleep 1; done;" +
"sudo sed -i '/spark.*.extraClassPath/s/$/:\\/additional-jars\\/\\*/' /etc/spark/conf/spark-defaults.conf",
],
},
actionOnFailure: "CONTINUE",
};
}

Error while running Zeppelin paragraphs in Spark on Linux cluster in Azure HdInsight

I have been following this tutorial in order to set up Zeppelin on a Spark cluster (version 1.5.2) in HDInsight, on Linux. Everything worked fine, I have managed to successfully connect to the Zeppelin notebook through the SSH tunnel. However, when I try to run any kind of paragraph, the first time I get the following error:
java.io.IOException: No FileSystem for scheme: wasb
After getting this error, if I try to rerun the paragraph, I get another error:
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
These errors occur regardless of the code I enter, even if there is no reference to the hdfs. What I'm saying is that I get the "No FileSystem" error even for a trivial scala expression, such as parallelize.
Is there a missing configuration step?
I am download the tar ball that the script that you pointed to as I type. But want I am guessing is that your zeppelin install and spark install are not complete to work with wasb. In order to get spark to work with wasb you need to add some jars to the Class path. To do this you need to add something like this to your spark-defaults.conf (the paths might be different in HDInsights, this is from HDP on IaaS)
spark.driver.extraClassPath /usr/hdp/2.3.0.0-2557/hadoop/lib/azure-storage-2.2.0.jar:/usr/hdp/2.3.0.0-2557/hadoop/lib/microsoft-windowsazure-storage-sdk-0.6.0.jar:/usr/hdp/2.3.0.0-2557/hadoop/hadoop-azure-2.7.1.2.3.0.0-2557.jar
spark.executor.extraClassPath /usr/hdp/2.3.0.0-2557/hadoop/lib/azure-storage-2.2.0.jar:/usr/hdp/2.3.0.0-2557/hadoop/lib/microsoft-windowsazure-storage-sdk-0.6.0.jar:/usr/hdp/2.3.0.0-2557/hadoop/hadoop-azure-2.7.1.2.3.0.0-2557.jar
Once you have spark working with wasb, or next step is make those sames jar in zeppelin class path. A good way to test your setup is make a notebook that prints your env vars and class path.
sys.env.foreach(println(_))
val cl = ClassLoader.getSystemClassLoader
cl.asInstanceOf[java.net.URLClassLoader].getURLs.foreach(println)
Also looking at the install script, it trying to pull the zeppelin jar from wasb, you might want to change that config to somewhere else while you try some of these changes out. (zeppelin.sh)
export SPARK_YARN_JAR=wasb:///apps/zeppelin/zeppelin-spark-0.5.5-SNAPSHOT.jar
I hope this helps, if you are still have problems I have some other ideas, but would start with these first.

Resources