When trying to register a UDF using Python on I get an error about Spark BUILD with HIVE - apache-spark

Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o54))
This happens whenever I create a UDF on a second notebook in Jupyter on IBM Bluemix Spark as a Service.

If you are using IBM Bluemix Spark as a Service, execute the following command in a cell of the python notebook :
!rm -rf /gpfs/global_fs01/sym_shared/YPProdSpark/user/spark_tenant_id/notebook/notebooks/metastore_db/*.lck
Replace spark_tenant_id with the actual one. You can find the tenant id using the following command in a cell of the notebook:
!whoami

I've run into these errors as well. Only the first notebook you launch will have access to the hive context. From here
By default Hive(Context) is using embedded Derby as a metastore. It is intended mostly for testing and supports only one active user.

Related

Azure Synapse Spark LIVY_JOB_STATE_ERROR

i'm experimenting the following error when executing any cell in my notebook:
LIVY_JOB_STATE_ERROR: Livy session has failed. Session state: Killed. Error code: LIVY_JOB_STATE_ERROR. [(my.synapse.spark.pool.name) WorkspaceType: CCID:<(hexcode)>] [Monitoring] Livy Endpoint=[https://hubservice1.westeurope.azuresynapse.net:8001/api/v1.0/publish/8dda5837-2f37-4a5d-97b9-0994b59e17f0]. Livy Id=[3] Job failed during run time with state=[error]. Source: Dependency.
My notebook was working ok till yesterday, the thing that i changed is the spark pool that was using spark 2.4 to spark 3.2(preview). Such change was made by a terraform template deploy, could this be the source of the issue? if so how to prevent it?
The issue was fixed by deleting and creating my spark pool again via the azure portal, still not sure what configuration inside my terraform template created the issue but at least this fixes the problem for now.

Using databricks-connect debugging a notebook that runs another notebook

I am able to connect to the Azure Databricks cluster from my Linux Centos VM, using visual studio code.
Below code even works without any issue
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
print("Cluster access test - ",spark.range(100).count())
setting = spark.conf.get("spark.master") # returns local[*]
if "local" in setting:
from pyspark.dbutils import DBUtils
dbutils = DBUtils().get_dbutils(spark)
else:
print("Do nothing - dbutils should be available already")
out = dbutils.fs.ls('/FileStore/')
print(out)
I have a notebook in my local which run another notebook using %run path/anothernotebook.
Since the %run string is commented # python is not executing it.
So i tried to include the dbutils.notebook.run('pathofnotebook') but it errors out stating notebook
Exception has occurred: AttributeError
'SparkServiceClientDBUtils' object has no attribute 'notebook'
Is it possible to locally debug a notebook that invokes another notebook?
It’s impossible - dbutils implementation included into Databricks Connect supports only ‘fs’ and ‘secrets’ subcommands (see docs).
Databricks Connect is designed to work with code developed locally, not with notebooks. If you can package content of that notebook as Python package, then you’ll able to debug it.
P.S. please take into account that dbutils.notebook.run executes notebook as a separate job, in contrast with %run

Execute databricks magic command from PyCharm IDE

With databricks-connect we can successfully run codes written in Databricks or Databricks notebook from many IDE. Databricks has also created many magic commands to support their feature with regards to running multi-language support in each cell by adding commands like %sql or %md. One issue I am facing currently is when I try to execute Databricks notebooks in Pycharm is as follows:
How to execute Databricks specific magic command from PyCharm.
E.g.
Importing a script or notebook in Done in Databricks using this command-
%run
'./FILE_TO_IMPORT'
Where as in IDE from FILE_TO_IMPORT import XYZ works.
Again everytime I download Databricks notebook it comments out the magic commands and that makes it impossible to be used anywhere outside Databricks environment.
It's really inefficient to convert all databricks magic command everytime I want to do any developement.
Is there any configuration I could set which automatically detects Databricks specific magic commands?
Any solution to this will be helpful. Thanks in Advance!!!
Unfortunately, as per the databricks-connect version 6.2.0-
" We cannot use magic command outside the databricks environment directly. This will either require creating custom functions but again that will only work for Jupyter not PyCharm"
Again, since importing py files requires %run magic command so this also becomes a major issue. A solution to this is by converting the set of files to be imported as a python package and add it to the cluster via Databricks UI and then import and use it in PyCharm. But this is a very tedious process.

ADFV 2 Spark Activity with Scala throwing error with error code 2312

Using Azure Data Factory Version 2, we have created a Spark Activity ( a simple Hello World example ), but it throws Error with Error Code 2312
Our configuration is Hdinsight cluster with Azure Data Lake as primary storage.
We also tried spinning up an HDInsight cluster with Azure Blob Storage as primary storage and there as well we are facing same issue.
We further tried replacing Scala code with Python scrip ( simple hello world example ), But facing same issue.
Has anyone encountered this issue, are we missing any basic setting
Thanks in advance
May be its too late and you have already solved your issue . However , you can try below
Use azure databricks . Create a new instance of databricks and run your sample hello world in notebook . if its works in notebook then call the same notebook in adf .
hope it helps
#Yogesh, have you tried debugging the issue through ADF by opting Debug as the screenshot? That might help you get the exact root cause. I would suggest trying using the spark-submit with the jar in the Linux box to find out the exact cause.
Also, you can find more info on https://learn.microsoft.com/en-us/azure/data-factory/data-factory-troubleshoot-guide#error-code-2312

Issues while using setConf in SparkLauncher from Windows

I am trying to trigger Pyspark code using SparkLauncher from Windows.
When I use
.setConf(SparkLauncher.DRIVER_MEMORY, "1G")
or any other configuration, the following error message is thrown,
--conf "spark.driver.memory' is not recognized as an internal or external command
Also, I need to add multiple dependency jars. For example, when I use
addJar("D:\\jars\\elasticsearch-spark-20_2.11-6.0.0-rc2.jar")
it is working. But when it is used multiple times
.addJar("D:\\jars\\elasticsearch-spark-20_2.11-6.0.0-rc2.jar")
.addJar("D:\\jars\\mongo-spark-connector_2.11-2.2.0.jar")
the following error is thrown
The filename, directory name, or volume label syntax is incorrect.
The same code works in Linux environment.
Could someone please help me on this?

Resources