How to get the whole cluster information in azure databricks at the runtime? - databricks

The below code was working for the older version and the version has changed the code is not working in databricks.
Latest Version :12.0 (includes Apache Spark 3.3.1, Scala 2.12)
dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags()
what is the alteranative for this code?
Error:py4j.security.Py4JSecurityException: Method public scala.collection.immutable.Map com.databricks.backend.common.rpc.CommandContext.tags() is not whitelisted on class class com.databricks.backend.common.rpc.CommandContext

I have created the following cluster with DBS runtime of 12.0 as shown below:
The above command has given the output as required without any error:
But if it is the cluster information that you need, then you can use Clusters 2.0 API. The following code would work:
import requests
import json
my_json = {"cluster_id": spark.conf.get("spark.databricks.clusterUsageTags.clusterId")}
auth = {"Authorization": "Bearer <your_access_token>"}
response = requests.get('https://<workspace_url>/api/2.0/clusters/get', json = my_json, headers=auth).json()
print(response)

You can get most of cluster info directly from Spark config:
%scala
val p = "spark.databricks.clusterUsageTags."
spark.conf.getAll
.collect{ case (k, v) if k.startsWith(p) => s"${k.replace(p, "")}: $v" }
.toList.sorted.foreach(println)
%python
p = "spark.databricks.clusterUsageTags."
conf = [f"{k.replace(p, '')}: {v}" for k, v in spark.sparkContext.getConf().getAll() if k.startswith(p)]
for l in sorted(conf): print(l)
[...]
clusterId: 0123-456789-0abcde1
clusterLastActivityTime: 1676449848620
clusterName: test
clusterNodeType: Standard_F4s_v2
[...]

Related

Run Pyspark application inside Spark application

Due to a specific need, I am solving the problem of running a python application inside a scala application.
Here is the sample code of my applications.
Parent application (Scala):
import scala.sys.process._
import java.io.File
import java.nio.file.{Path => JavaPath}
class SparkApplication(spark: SparkSession) {
def run(): Unit = {
val command =
"spark-submit" ::
s"--keytab $keytab" ::
"--principal my_username#DOMAIN.RU" ::
"--master yarn" ::
"--deploy-mode cluster" ::
s"$pyFile" :: Nil mkString " "
command.!!
}
val containerPath: JavaPath = new File(SparkFiles.getRootDirectory()).toPath.toAbsolutePath
val pyFile: File = moveToContainer("spark.py")
val keytab: File = moveToContainer("my_username.keytab")
def moveToContainer(fileName: String): File = {
val fileSourceStream = getClass.getResourceAsStream(fileName)
val file = containerPath.resolve(fileName).toFile
FileUtils.copyInputStreamToFile(fileSourceStream, file)
file
}
}
Child application (spark.py):
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.appName('PysparkSubProcess').enableHiveSupport().getOrCreate()
rdd = spark.sparkContext.parallelize([1,2,3,4,5,6])
print("RDD count")
print(rdd.count())
df = spark.sql("select 1 as col1")
df.write.format("parquet").saveAsTable("default.temp_pyspark_table")
So, as you can see above, I am successfully running a Scala application in cluster mode.
This scala application runs a python application inside itself (spark.py ).
Also, Kerberos is configured on our cluster. That's why I use my keytab file for authorization.
But that's not enough. While a Scala application has access to Have, Have remains unavailable for a Python application.
That is, I can't save my test table from spark.py.
And the question probably is, is there any way to use authorization from a Scala application for a Python application? So that I don't have to worry about the keytab file and the Hive configuration for the Python application.
I have heard that there are authorization tokens that are created and stored on drivers. Is it possible to reuse such tokens?
(--conf spark.executorEnv.HADOOP_TOKEN_FILE_LOCATION)
Or maybe there are workarounds?
I still can't finish this problem.

exception:java.lang.IllegalArgumentException:Wrong FS: abfs://container#storageaccount.dfs.core.windows.net/folder, expected: hdfs://master-node:port

I am trying to execute a scala file using Spark Submit in CDP CDH 7.2.9 hosted on Azure Platform. But I am getting below error
User class threw exception:java.lang.IllegalArgumentException:Wrong FS: abfs://container#storageaccount.dfs.core.windows.net/folder, expected: hdfs://master-node:port at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:773)
I have tried below options as suggested in stackoverflow but no luck !
a.conf.spark.hadoop.fs.defaultFS = “abfs://container#storageaccount.dfs.core.windows.net”
b.val hdfs = FileSystem.get(new java.net.URI(s"abfs://container#storageaccount.dfs.core.windows.net"), spark.sparkContext.hadoopConfiguration) in scala program.
c.val adlsURI = s"abfs://container#storageaccount.dfs.core.windows.net" FileSystem.setDefaultUri = (spark.sparkContext.hadoopConfiguration, new java.net.URI(adlsURI)) val hdfs: FileSystem = FileSystem.get(spark.sparkContext.hadoopConfiguration)
in Scala program.
d.conf.spark.yarn.access.hadoopFileSystems=abfs://container#storageaccount.dfs.core.windows.net
I am using Spark 2.4 version.

How to list the files in azure data lake using spark from pycharm(local IDE) which is connected using databricks-connect

I am working on some code on my local machine on pycharm.
The execution is done on a databricks cluster, while the data is stored on azure datalake.
basaically, I need to list down the files in azure datalake directory and then apply some reading logic on the files, for this I am using the below code
sc = spark.sparkContext
hadoop = sc._jvm.org.apache.hadoop
fs = hadoop.fs.FileSystem
conf = hadoop.conf.Configuration()
path = hadoop.fs.Path('adl://<Account>.azuredatalakestore.net/<path>')
for f in fs.get(conf).listStatus(path):
print(f.getPath(), f.getLen())
the above code runs fine on the databricks notebooks, but when i try to run the same code through pycharm using databricks-connect i get the following error.
"Wrong FS expected: file:///....."
on some digging it turns out, that the code is looking in my local drive to find the "path".
I had a similar issue with python libraries (os, pathlib)
I have no issue in running other code on the cluster.
Need help in figuring out how to run this so as to search the datalake and not my local machine.
Also, azure-datalake-store client is not an option due to certain restrictions.
You may use this.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{Path, FileSystem}
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.spark.sql.execution.datasources.InMemoryFileIndex
import java.net.URI
def listFiles(basep: String, globp: String): Seq[String] = {
val conf = new Configuration(sc.hadoopConfiguration)
val fs = FileSystem.get(new URI(basep), conf)
def validated(path: String): Path = {
if(path startsWith "/") new Path(path)
else new Path("/" + path)
}
val fileCatalog = InMemoryFileIndex.bulkListLeafFiles(
paths = SparkHadoopUtil.get.globPath(fs, Path.mergePaths(validated(basep), validated(globp))),
hadoopConf = conf,
filter = null,
sparkSession = spark)
fileCatalog.flatMap(_._2.map(_.path))
}
val root = "/mnt/{path to your file directory}"
val globp = "[^_]*"
val files = listFiles(root, globp)
files.toDF("path").show()

Spark dataframe returning only structure when connected to Phoenix query server

I am connecting to hbase ( ver 1.2) via phoenix (4.11) queryserver from Spark 2.2.0, but the dataframe is returning the only table structure with empty rows thoug data is present in table.
Here is the code I am using to connect to queryserver.
// ---jar ----phoenix-4.11.0-HBase-1.2-thin-client.jar<br>
val prop = new java.util.Properties
prop.setProperty("driver", "org.apache.phoenix.queryserver.client.Driver")
val url = "jdbc:phoenix:thin:url=http://localhost:8765;serialization=PROTOBUF"
val d1 = spark.sqlContext.read.jdbc(url,"TABLE1",prop)
d1.show()
Can anyone please help me in solving this issue. Thanks in advance
If you are using spark2.2 the better approach would be to load directly via pheonix as a dataframe.This way you would provide the zookeeper url only and you can provide a predicate so that you load only the data required and not the entire data.
import org.apache.phoenix.spark._
import org.apache.hadoop.conf.Configuration
import org.apache.spark.sql.SparkSession
val configuration = new Configuration()
configuration.set("hbase.zookeeper.quorum", "localhost:2181");
val spark = SparkSession.builder().master("local").enableHiveSupport().getOrCreate()
val df=spark.sqlContext.phoenixTableAsDataFrame("TABLE1",Seq("COL1","COL2"),predicate = Some("\"COL1\" = 1"),conf = configuration)
Read this for more info on getting table as rdd and saving dataframes and rdd's .

I don't get any result from notebook in Bluemix Spark

I tried to execute my scala code in Bluemix Spark service, once I can run it and get right result from my local virtual machine. When I ran it in Bluemix Spark, I can not get any response in notebook.
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.Matrix
val input = sc.textFile("swift://notebooks.spark/pca.csv")
val header = input.first()
val inputData = input.filter(x => x != header).map(line=>line.split(','))
val inputVector = input.map{d=>
Vectors.dense(
d(1).toDouble, d(2).toDouble, d(3).toDouble, d(4).toDouble, d(5).toDouble, d(6).toDouble,
d(7).toDouble, d(8).toDouble, d(9).toDouble, d(10).toDouble, d(11).toDouble)}
val rowMatrix = new RowMatrix(inputVector)
val pca: Matrix = rowMatrix.computePrincipalComponents(5)
When I execute the intput.take(2), I can get result well but no result for executing input.foreach(println). It's strange. How can I get result?
I have tested it on Bluemix in a Scala notebook.
val input = sc.textFile("swift://notebooks.spark/test.csv")
input.take(1) /** shows the first line */
input.foreach(println) /** nothing is displayed */
If you want to display the content of a RDD, then you can use the following code.
input.take(5).foreach(println) /** shows the first 5 lines */
input.collect().foreach(println) /** shows all lines */
I do not know how your local VM is set up, but I think you have to distinguish between running your code local or on a cluster.
Have look at this answer for more information: How to print the contents of RDD?

Resources