Run Pyspark application inside Spark application - apache-spark

Due to a specific need, I am solving the problem of running a python application inside a scala application.
Here is the sample code of my applications.
Parent application (Scala):
import scala.sys.process._
import java.io.File
import java.nio.file.{Path => JavaPath}
class SparkApplication(spark: SparkSession) {
def run(): Unit = {
val command =
"spark-submit" ::
s"--keytab $keytab" ::
"--principal my_username#DOMAIN.RU" ::
"--master yarn" ::
"--deploy-mode cluster" ::
s"$pyFile" :: Nil mkString " "
command.!!
}
val containerPath: JavaPath = new File(SparkFiles.getRootDirectory()).toPath.toAbsolutePath
val pyFile: File = moveToContainer("spark.py")
val keytab: File = moveToContainer("my_username.keytab")
def moveToContainer(fileName: String): File = {
val fileSourceStream = getClass.getResourceAsStream(fileName)
val file = containerPath.resolve(fileName).toFile
FileUtils.copyInputStreamToFile(fileSourceStream, file)
file
}
}
Child application (spark.py):
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.appName('PysparkSubProcess').enableHiveSupport().getOrCreate()
rdd = spark.sparkContext.parallelize([1,2,3,4,5,6])
print("RDD count")
print(rdd.count())
df = spark.sql("select 1 as col1")
df.write.format("parquet").saveAsTable("default.temp_pyspark_table")
So, as you can see above, I am successfully running a Scala application in cluster mode.
This scala application runs a python application inside itself (spark.py ).
Also, Kerberos is configured on our cluster. That's why I use my keytab file for authorization.
But that's not enough. While a Scala application has access to Have, Have remains unavailable for a Python application.
That is, I can't save my test table from spark.py.
And the question probably is, is there any way to use authorization from a Scala application for a Python application? So that I don't have to worry about the keytab file and the Hive configuration for the Python application.
I have heard that there are authorization tokens that are created and stored on drivers. Is it possible to reuse such tokens?
(--conf spark.executorEnv.HADOOP_TOKEN_FILE_LOCATION)
Or maybe there are workarounds?
I still can't finish this problem.

Related

How to get the whole cluster information in azure databricks at the runtime?

The below code was working for the older version and the version has changed the code is not working in databricks.
Latest Version :12.0 (includes Apache Spark 3.3.1, Scala 2.12)
dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags()
what is the alteranative for this code?
Error:py4j.security.Py4JSecurityException: Method public scala.collection.immutable.Map com.databricks.backend.common.rpc.CommandContext.tags() is not whitelisted on class class com.databricks.backend.common.rpc.CommandContext
I have created the following cluster with DBS runtime of 12.0 as shown below:
The above command has given the output as required without any error:
But if it is the cluster information that you need, then you can use Clusters 2.0 API. The following code would work:
import requests
import json
my_json = {"cluster_id": spark.conf.get("spark.databricks.clusterUsageTags.clusterId")}
auth = {"Authorization": "Bearer <your_access_token>"}
response = requests.get('https://<workspace_url>/api/2.0/clusters/get', json = my_json, headers=auth).json()
print(response)
You can get most of cluster info directly from Spark config:
%scala
val p = "spark.databricks.clusterUsageTags."
spark.conf.getAll
.collect{ case (k, v) if k.startsWith(p) => s"${k.replace(p, "")}: $v" }
.toList.sorted.foreach(println)
%python
p = "spark.databricks.clusterUsageTags."
conf = [f"{k.replace(p, '')}: {v}" for k, v in spark.sparkContext.getConf().getAll() if k.startswith(p)]
for l in sorted(conf): print(l)
[...]
clusterId: 0123-456789-0abcde1
clusterLastActivityTime: 1676449848620
clusterName: test
clusterNodeType: Standard_F4s_v2
[...]

How to list the files in azure data lake using spark from pycharm(local IDE) which is connected using databricks-connect

I am working on some code on my local machine on pycharm.
The execution is done on a databricks cluster, while the data is stored on azure datalake.
basaically, I need to list down the files in azure datalake directory and then apply some reading logic on the files, for this I am using the below code
sc = spark.sparkContext
hadoop = sc._jvm.org.apache.hadoop
fs = hadoop.fs.FileSystem
conf = hadoop.conf.Configuration()
path = hadoop.fs.Path('adl://<Account>.azuredatalakestore.net/<path>')
for f in fs.get(conf).listStatus(path):
print(f.getPath(), f.getLen())
the above code runs fine on the databricks notebooks, but when i try to run the same code through pycharm using databricks-connect i get the following error.
"Wrong FS expected: file:///....."
on some digging it turns out, that the code is looking in my local drive to find the "path".
I had a similar issue with python libraries (os, pathlib)
I have no issue in running other code on the cluster.
Need help in figuring out how to run this so as to search the datalake and not my local machine.
Also, azure-datalake-store client is not an option due to certain restrictions.
You may use this.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{Path, FileSystem}
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.spark.sql.execution.datasources.InMemoryFileIndex
import java.net.URI
def listFiles(basep: String, globp: String): Seq[String] = {
val conf = new Configuration(sc.hadoopConfiguration)
val fs = FileSystem.get(new URI(basep), conf)
def validated(path: String): Path = {
if(path startsWith "/") new Path(path)
else new Path("/" + path)
}
val fileCatalog = InMemoryFileIndex.bulkListLeafFiles(
paths = SparkHadoopUtil.get.globPath(fs, Path.mergePaths(validated(basep), validated(globp))),
hadoopConf = conf,
filter = null,
sparkSession = spark)
fileCatalog.flatMap(_._2.map(_.path))
}
val root = "/mnt/{path to your file directory}"
val globp = "[^_]*"
val files = listFiles(root, globp)
files.toDF("path").show()

I don't get any result from notebook in Bluemix Spark

I tried to execute my scala code in Bluemix Spark service, once I can run it and get right result from my local virtual machine. When I ran it in Bluemix Spark, I can not get any response in notebook.
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.Matrix
val input = sc.textFile("swift://notebooks.spark/pca.csv")
val header = input.first()
val inputData = input.filter(x => x != header).map(line=>line.split(','))
val inputVector = input.map{d=>
Vectors.dense(
d(1).toDouble, d(2).toDouble, d(3).toDouble, d(4).toDouble, d(5).toDouble, d(6).toDouble,
d(7).toDouble, d(8).toDouble, d(9).toDouble, d(10).toDouble, d(11).toDouble)}
val rowMatrix = new RowMatrix(inputVector)
val pca: Matrix = rowMatrix.computePrincipalComponents(5)
When I execute the intput.take(2), I can get result well but no result for executing input.foreach(println). It's strange. How can I get result?
I have tested it on Bluemix in a Scala notebook.
val input = sc.textFile("swift://notebooks.spark/test.csv")
input.take(1) /** shows the first line */
input.foreach(println) /** nothing is displayed */
If you want to display the content of a RDD, then you can use the following code.
input.take(5).foreach(println) /** shows the first 5 lines */
input.collect().foreach(println) /** shows all lines */
I do not know how your local VM is set up, but I think you have to distinguish between running your code local or on a cluster.
Have look at this answer for more information: How to print the contents of RDD?

Error calling `JValue.extract` from distributed operations in spark-shell

I am trying to use the case class extraction feature of json4s in Spark,
ie calling jvalue.extract[MyCaseClass]. It works fine if I bring the JValue objects into the master and do the extraction there, but the same calls fail in the workers:
import org.json4s._
import org.json4s.jackson.JsonMethods._
import scala.util.{Try, Success, Failure}
val sqx = sqlContext
val data = sc.textFile(inpath).coalesce(2000)
case class PageView(
client: Option[String]
)
def extract(json: JValue) = {
implicit def formats = org.json4s.DefaultFormats
Try(json.extract[PageView]).toOption
}
val json = data.map(parse(_)).sample(false, 1e-6).cache()
// count initial inputs
val raw = json.count
// count successful extractions locally -- same value as above
val loc = json.toLocalIterator.flatMap(extract).size
// distributed count -- always zero
val dist = json.flatMap(extract).count // always returns zero
// this throws "org.json4s.package$MappingException: Parsed JSON values do not match with class constructor"
json.map(x => {implicit def formats = org.json4s.DefaultFormats; x.extract[PageView]}).count
The implicit for Formats is defined locally in the extract function since DefaultFormats is not serializable and defining it at top level caused it to be serialized to for transmission to the workers rather than constructed there. I think the proble still has something to do with the remote initialization of DefaultFormats, but I am not sure what it is.
When I call the extract method directly, insted of my extract function, like in the last example, it no longer complains about serialization but just throws an error that the JSON does not match the expected structure.
How can I get the extraction to work when distributed to the workers?
Edit
#WesleyMiao has reproduced the problem and found that it is specific to spark-shell. He reports that this code works as a standalone application.
I got the same exception as yours when running your code in spark-shell. However when I turn your code into a real spark app and submit it to a standalone spark cluster, I got expected results with no exception.
Below is the code I put in a simple spark app.
val data = sc.parallelize(Seq("""{"client":"Michael"}""", """{"client":"Wesley"}"""))
val json = data.map(parse(_))
val dist = json.mapPartitions { jsons =>
implicit val formats = org.json4s.DefaultFormats
jsons.map(_.extract[PageView])
}
dist.collect() foreach println
And when I run it using spark-submit, I got the following result.
PageView(Some(Michael))
PageView(Some(Wesley))
And I am also sure that it is running not in "local[*]" mode.
Now I suspect the reason we got exceptions while running in spark-shell has something to do with the case class PageView definition in spark-shell and how spark-shell serialize / distribute it to executor.
As suggested here I would move object creation into the map. I.e. I would have function createPageViews that has extract as internal function and will pass createPageViews to workers.
More precisely I would use mapPartitions instead of map - so it would have to call createPageViews (and it's internal function definition part) only once per partition - and not once per every record.

Parquet file in Spark SQL

I am trying to use Spark SQL using parquet file formats. When I try the basic example :
object parquet {
case class Person(name: String, age: Int)
def main(args: Array[String]) {
val sparkConf = new SparkConf().setMaster("local").setAppName("HdfsWordCount")
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD
val people = sc.textFile("C:/Users/pravesh.jain/Desktop/people/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
people.saveAsParquetFile("C:/Users/pravesh.jain/Desktop/people/people.parquet")
val parquetFile = sqlContext.parquetFile("C:/Users/pravesh.jain/Desktop/people/people.parquet")
}
}
I get a null pointer exception :
Exception in thread "main" java.lang.NullPointerException
at org.apache.spark.parquet$.main(parquet.scala:16)
which is the line saveAsParquetFile. What's the issue here?
This error occurs when I was using Spark in eclipse in Windows. I tried the same on spark-shell and it works fine. I guess spark might not be 100% compatible with windows.
Spark is compatible with Windows. You can run your program in a spark-shell session in Windows or you can run it using spark-submit with necessary argument such as "-master" (again, in Windows or other OS).
You cannot just run your Spark program as an ordinary Java program in Eclispe without properly setting up the Spark environment and so on. You problem has nothing to do with Windows.

Resources