Databricks, AzureCredentialNotFoundException - databricks

I have a High Concurency cluster with Active Directory integration turned on. Runtime: Latest stable (Scala 2.11), Python: 3.
I've mounted Azure Datalake and when I want to read the data, always the first time after cluster start I get:
com.databricks.backend.daemon.data.client.adl.AzureCredentialNotFoundException: Could not find ADLS Gen1 Token
When I rerun it works fine. I read data in the following way:
df = spark.read.option("inferSchema","true").option("header","true").json(path)
Any idea what is wrong?
Thanks!
Tomek

I believe you can only run the command using a high concurrency cluster. If you've attached your notebook to a standard cluster, the command won't work.

Related

Best Pyspark Testing : issue with databricks -connect

I'm currently using databricks and in order to test my databricks code I'm using databricks connect in VS code. While I'm using databricks connect, since yesterday suddenly it started behaving strange, while I'm submitting a code from VS Code, the databricks-connect is taking the access of the person who has created the databricks cluster, now he doesn't have adequate access over all the resources and my test cases/ code is failing due to access issue.
Some more inputs: I have a function which does an update operation in a delta table, so this means i can't use a normal hive or temp table as they doesn't support update operation.
I have tried delta-lake in local, but that seems tobe not working and also not that convenient, as i wouldn't be able to access my ADLs location (until i specifically do the configuration change)
So, mu question is , how you guys are doing a spark specific testing? (I'm using pytest).
I found we don't have much material for a databricks code testing in the internet..any help?

Read a Databricks table via Databricks api in Python?

Using Python-3, I am trying to compare an Excel (xlsx) sheet to an identical spark table in Databricks. I want to avoid doing the compare in Databricks. So I am looking for a way to read the spark table via the Databricks api. Is this possible? How can I go on to read a table: DB.TableName?
There is no way to read the table from the DB API as far as I am aware unless you run it as a job as LaTreb already mentioned. However, if you really wanted to, you could use either the ODBC or JDBC drivers to get the data through your databricks cluster.
Information on how to set this up can be found here.
Once you have the DSN set up you can use pyodbc to connect to databricks and run a query. At this time the ODBC driver will only allow you to run Spark-SQL commands.
All that being said, it will probably still be easier to just load the data into Databricks, unless you have some sort of security concern.
I can recomend you write pyspark code in notebook, call the notebook from previously defined job, and establish connection between your local machine and databricks workspace.
You could perfom comaprision directly on spark or convert data frames to pandas if you wish. If noteebok will end comaprision, could retrun result from particular job. I think that sending all databricks tables could be impossible because of API limitation you have spark cluster to perform complex operation, API should be use to send small messages.
Officical documentation:
https://learn.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/jobs#--runs-get-output
Retrieve the output and metadata of a run. When a notebook task
returns a value through the dbutils.notebook.exit() call, you can use
this endpoint to retrieve that value. Azure Databricks restricts this
API to return the first 5 MB of the output. For returning a larger
result, you can store job results in a cloud storage service.

Do nodes in Spark Cluster share the same storage?

I a newbie to spark. I am using Azure Databricks and I am writing python code with PySpark. There is one particular topic which is confusing me:
Do nodes have separate storage memory (I don't mean RAM/cache)? Or they all share the same storage? If they share the same storage, then can two different applications running in different Spark Context exchange data accordingly?
I don't understand why sometime we refer to the storage by dbfs:/tmp/..., and other times we refer to it by /dbfs/tmp/... Example: If I am using the dbutils package from databricks, we use something like: dbfs:/tmp/... to refer to a directory in the file system. However, if I'm using regular python code, I say /dbfs/tmp/.
Your help is much appreciated!!
Each node has separate RAM memory and caching. For example, if you have a cluster with 4GB and 3 nodes. When you deploy your spark application, it will run worker processes depending on cluster configuration and query requirements and it will create virtual machines on separate nodes or on same node. These node memories are not shared between each other during the life of the application.
This is more about Hadoop resource sharing question and can find more information from YARN resource management. This is very brief overview
https://databricks.com/session/resource-management-and-spark-as-a-first-class-data-processing-framework-on-hadoop

HDInsight SparkHistory on Azure shows no applications

I have created a Spark HDInsight Cluster on Azure. The cluster was used to run different jobs (either Spark or Hive).
Until a month ago, the history of the jobs could be seen in the Spark History Server dashboard. It seems that following the update that introduced Spark 1.6.0, this dashboard is no longer showing any applications.
I have also tried to bypass this issue by executing the PowerShell cmdlet for get-azurehdinsightjob as sugested here. The output is again an empty list of applications.
I would appreciate any help as this dashboard used to work and now all my experiments are stalled.
I managed to solve the issue by deleting everything inside wasb:///hdp/spark-events. Maybe the issue was related to the size of the folder, as no other log files could be appended.
All the following jobs are now appearing successfully in the Spark History Server dashboard.

Enable Spark on Same Node As Cassandra

I am trying to test out Spark so I can summarize some data I have in Cassandra. I've been through all the DataStax tutorials and they are very vague as to how you actually enable spark. The only indication I can find is that it comes enabled automatically when you select "Analytics" node during install. However, I have an existing Cassandra node and I don't want to have to use a different machine for testing as I am just evaluating everything on my laptop.
Is it possible to just enable Spark on the same node and deal with any performance implications? If so how can I enable it so that it can be tested?
I see the folders there for Spark (although I'm not positive all the files are present) but when I check to see if it's set to Spark master, it says that no spark nodes are enabled.
dsetool sparkmaster
I am using Linux Ubuntu Mint.
I'm just looking for a quick and dirty way to get my data averaged and so forth and Spark seems like the way to go since it's a massive amount of data, but I want to avoid having to pay to host multiple machines (at least for now while testing).
Yes, Spark is also able to interact with a cluster even if it is not on all the nodes.
Package install
Edit the /etc/default/dse file, and then edit the appropriate line
to this file, depending on the type of node you want:
...
Spark nodes:
SPARK_ENABLED=1
HADOOP_ENABLED=0
SOLR_ENABLED=0
Then restart the DSE service
http://docs.datastax.com/en/datastax_enterprise/4.5/datastax_enterprise/reference/refDseServ.html
Tar Install
Stop DSE on the node and the restart it using the following command
From the install directory:
...
Spark only node: $ bin/dse cassandra -k - Starts Spark trackers on a cluster of Analytics nodes.
http://docs.datastax.com/en/datastax_enterprise/4.5/datastax_enterprise/reference/refDseStandalone.html
Enable spark by changing SPARK_ENABLED=1
using the command: sudo nano /usr/share/dse/resources/dse/conf/dse.default

Resources