How to extract application ID from the PySpark context

How to extract application ID from the PySpark context - apache-spark

A previous question recommends sc.applicationId, but it is not present in PySpark, only in scala.
So, how do I figure out the application id (for yarn) of my PySpark process?

You could use Java SparkContext object through the Py4J RPC gateway:
>>> sc._jsc.sc().applicationId()
u'application_1433865536131_34483'
Please note that sc._jsc is internal variable and not the part of public API - so there is (rather small) chance that it may be changed in the future.
I'll submit pull request to add public API call for this.

In Spark 1.6 (probably 1.5 according to #wladymyrov in comment on the other answer)
In [1]: sc.applicationId
Out[1]: u'local-1455827907865'

For PySpark 2.0.0+
spark_session = SparkSession \
.builder \
.enableHiveSupport() \
.getOrCreate()
app_id = spark_session._sc.applicationId

Looks like its available in 3.0.1 at least:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName('Overriding defaults app name') \
.getOrCreate()
print(f'--- {spark.sparkContext.applicationId} ---')
Results in:
--- application_1610550667906_166057 ---

Related

Read /Write delta lake tables on S3 using AWS Glue jobs

I am trying to access Delta lake tables underlying on S3 using AWS glue jobs however getting error as "Module Delta not defined"
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
spark = SparkSession.builder.appName("MyApp").config("spark.jars.packages", "io.delta:delta-core_2.11:0.6.0").getOrCreate()
from delta.tables import *
data = spark.range(0, 5)
data.write.format("delta").save("S3://databricksblaze/data")
Added the necessary Jar ( delta-core_2.11-0.6.0.jar ) too in the dependency jars of the glue job.
Can anyone help me on this
Thanks

I have had success in using Glue + Deltalake. I added the Deltalake dependencies to the section "Dependent jars path" of the Glue job.
Here you have the list of them (I am using Deltalake 0.6.1):
com.ibm.icu_icu4j-58.2.jar
io.delta_delta-core_2.11-0.6.1.jar
org.abego.treelayout_org.abego.treelayout.core-1.0.3.jar
org.antlr_antlr4-4.7.jar
org.antlr_antlr4-runtime-4.7.jar
org.antlr_antlr-runtime-3.5.2.jar
org.antlr_ST4-4.0.8.jar
org.glassfish_javax.json-1.0.4.jar
Then in your Glue job you can use the following code:
from pyspark.context import SparkContext
from awsglue.context import GlueContext
sc = SparkContext()
sc.addPyFile("io.delta_delta-core_2.11-0.6.1.jar")
from delta.tables import *
glueContext = GlueContext(sc)
spark = glueContext.spark_session
delta_path = "s3a://your_bucket/folder"
data = spark.range(0, 5)
data.write.format("delta").mode("overwrite").save(delta_path)
deltaTable = DeltaTable.forPath(spark, delta_path)

You need to pass the additional configuration properties
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

Setting spark.jars.packages in SparkSession.builder.config doesn't work. spark.jars.packages is handled by org.apache.spark.deploy.SparkSubmitArguments/SparkSubmit. So it must be passed as an argument of the spark-submit or pyspark script. When SparkSession.builder.config is called, SparkSubmit has done its job. So spark.jars.packages is no-op at this moment. See https://issues.apache.org/jira/browse/SPARK-21752 for more details.

How to export a Datastax graph based on a specific traversal using DseGraphFrame

I would like to export a DSE graph via a spark job , as per
https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/graph/graphAnalytics/dseGraphFrameExport.html
All this works fine within the spark-shell ,
I want to be doing this in Java using DseGraphFrame .
Unfortunately there is not much in the documentation
I am able to pack a jar with the following code and do a
spark-submit
SparkSession spark = SparkSession
.builder()
.appName("Datastax Java example")
.getOrCreate();
DseGraphFrame dseGraphFrame = DseGraphFrameBuilder.dseGraph(args[0], spark);
DataFrameWriter dataFrameWriter = dseGraphFrame.V().df().write();
dataFrameWriter.csv("vertices");
The above works fine ,
what I want to be doing is use a specific traversal to filter what I export.
That is use something like that
dseGraphFrame.V().hasLabel("label").df().write();
The above does not work as dseGraphFrame.V().hasLabel("label") does not have .df()
Is this the correct way of doing things
Any help would be appreciated

A late answer to this question, perhaps still of use:
In Java, you need to cast this to a DseGraphTraversal first. This can then be converted to a DataFrame with the .df() method:
((DseGraphTraversal)dseGraphFrame.V().hasLabel("label")).df().write();

How to start sparksession in pyspark

I want to change the default memory, executor and core settings of a spark session.
The first code in my pyspark notebook on HDInsight cluster in Jupyter looks like this:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("Juanita_Smith")\
.config("spark.executor.instances", "2")\
.config("spark.executor.cores", "2")\
.config("spark.executor.memory", "2g")\
.config("spark.driver.memory", "2g")\
.getOrCreate()
On completion, I read the parameters back, which looks like the statement worked
However if I look in yarn, the setting have indeed not worked.
Which settings or commands do I need to make to let the session configuration take effect ?
Thank you for help in advance

By the time your notebook kernel has started, the SparkSession is already created with parameters defined in a kernel configuration file. To change this, you will need to update or replace the kernel configuration file, which I believe is usually somewhere like <jupyter home>/kernels/<kernel name>/kernel.json.
Update
If you have access to the machine hosting your Jupyter server, you can find the location of the current kernel configurations using jupyter kernelspec list. You can then either edit one of the pyspark kernel configurations, or copy it to a new file and edit that. For your purposes, you will need to add the following arguments to the PYSPARK_SUBMIT_ARGS:
"PYSPARK_SUBMIT_ARGS": "--conf spark.executor.instances=2 --conf spark.executor.cores=2 --conf spark.executor.memory=2g --conf spark.driver.memory=2g"

Can't access Spark 2.0 Temporary Table from beeline

With Spark 1.5.1, I've already been able to access spark-shell temporary tables from Beeline using Thrift Server. I've been able to do so by reading answers to related questions on Stackoverflow.
However, after upgrading to Spark 2.0, I can't see temporary tables from Beeline anymore, here are the steps I'm following.
I'm launching spark-shell using the following command:
./bin/spark-shell --master=myHost.local:7077 —conf spark.sql.hive.thriftServer.singleSession=true
Once the spark shell is ready I enter the following lines to launch thrift server and create a temporary view from a data frame taking its source in a json file
import org.apache.spark.sql.hive.thriftserver._
spark.sqlContext.setConf("hive.server2.thrift.port","10002")
HiveThriftServer2.startWithContext(spark.sqlContext)
val df = spark.read.json("examples/src/main/resources/people.json")
df.createOrReplaceTempView("people")
spark.sql("select * from people").show()
The last statement displays the table, it runs fine.
However when I start beeline and log to my thrift server instance, I can't see any temporary tables:
show tables;
+------------+--------------+--+
| tableName | isTemporary |
+------------+--------------+--+
+------------+--------------+--+
No rows selected (0,658 seconds)
Did I miss something regarding my spark upgrade from 1.5.1 to 2.0, how can I gain access to my temporary tables ?

This worked for me after upgrading to spark 2.0.1
val sparkConf =
new SparkConf()
.setAppName("Spark Thrift Server Demo")
.setMaster(sparkMaster)
.set("hive.metastore.warehouse.dir", hdfsDataUri + "/hive")
val spark = SparkSession
.builder()
.enableHiveSupport()
.config(sparkConf)
.getOrCreate()
val sqlContext = new org.apache.spark.sql.SQLContext(spark.sparkContext)
HiveThriftServer2.startWithContext(sqlContext)

howto add hive properties at runtime in spark-shell

How do you set a hive property like: hive.metastore.warehouse.dir at runtime? Or at least a more dynamic way of setting a property like the above, than putting it in a file like spark_home/conf/hive-site.xml

I faced the same issue and for me it worked by setting Hive properties from Spark (2.4.0). Please find below all the options through spark-shell, spark-submit and SparkConf.
Option 1 (spark-shell)
spark-shell --conf spark.hadoop.hive.metastore.warehouse.dir=some_path\metastore_db_2
Initially I tried with spark-shell with hive.metastore.warehouse.dir set to some_path\metastore_db_2. Then I get the next warning:
Warning: Ignoring non-spark config property:
hive.metastore.warehouse.dir=C:\winutils\hadoop-2.7.1\bin\metastore_db_2
Although when I create a Hive table with:
bigDf.write.mode("overwrite").saveAsTable("big_table")
The Hive metadata are stored correctly under metastore_db_2 folder.
When I use spark.hadoop.hive.metastore.warehouse.dir the warning disappears and the results are still saved in the metastore_db_2 directory.
Option 2 (spark-submit)
In order to use hive.metastore.warehouse.dir when submitting a job with spark-submit I followed the next steps.
First I wrote some code to save some random data with Hive:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val sparkConf = new SparkConf().setAppName("metastore_test").setMaster("local")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
import spark.implicits._
var dfA = spark.createDataset(Seq(
(1, "val1", "p1"),
(2, "val1", "p2"),
(3, "val2", "p3"),
(3, "val3", "p4"))).toDF("id", "value", "p")
dfA.write.mode("overwrite").saveAsTable("metastore_test")
spark.sql("select * from metastore_test").show(false)
Next I submitted the job with:
spark-submit --class org.tests.Main \
--conf spark.hadoop.hive.metastore.warehouse.dir=C:\winutils\hadoop-2.7.1\bin\metastore_db_2
spark-scala-test_2.11-0.1.jar
The metastore_test table was properly created under the C:\winutils\hadoop-2.7.1\bin\metastore_db_2 folder.
Option 3 (SparkConf)
Via SparkSession in the Spark code.
val sparkConf = new SparkConf()
.setAppName("metastore_test")
.set("spark.hadoop.hive.metastore.warehouse.dir", "C:\\winutils\\hadoop-2.7.1\\bin\\metastore_db_2")
.setMaster("local")
This attempt was successful as well.
The question which still remains is why I have to extend the property with spark.hadoop in order to work as expected?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to extract application ID from the PySpark context - apache-spark

A previous question recommends sc.applicationId, but it is not present in PySpark, only in scala. So, how do I figure out the application id (for yarn) of my PySpark process?

In Spark 1.6 (probably 1.5 according to #wladymyrov in comment on the other answer) In [1]: sc.applicationId Out[1]: u'local-1455827907865'

For PySpark 2.0.0+ spark_session = SparkSession \ .builder \ .enableHiveSupport() \ .getOrCreate() app_id = spark_session._sc.applicationId

Looks like its available in 3.0.1 at least: from pyspark.sql import SparkSession spark = SparkSession \ .builder \ .appName('Overriding defaults app name') \ .getOrCreate() print(f'--- {spark.sparkContext.applicationId} ---') Results in: --- application_1610550667906_166057 ---

Related

Read /Write delta lake tables on S3 using AWS Glue jobs

How to export a Datastax graph based on a specific traversal using DseGraphFrame

How to start sparksession in pyspark

Can't access Spark 2.0 Temporary Table from beeline

howto add hive properties at runtime in spark-shell

Categories

Resources