Spark.sql not working on EMR (Serverless) - apache-spark

The following script does not create the table in the S3 location indicated by the query.
I tested it locally and the Delta Json file was created and contained the information about the table.
from pyspark.sql import SparkSession
spark = (SparkSession
.builder
.enableHiveSupport()
.appName('omop_ddl')
.getOrCreate()
)
spark.sql(f"""
CREATE
OR REPLACE TABLE CONCEPT (
CONCEPT_ID LONG,
CONCEPT_NAME STRING,
DOMAIN_ID STRING,
VOCABULARY_ID STRING,
CONCEPT_CLASS_ID STRING,
STANDARD_CONCEPT STRING,
CONCEPT_CODE STRING,
VALID_START_DATE DATE,
VALID_END_DATE DATE,
INVALID_REASON STRING
) USING DELTA
LOCATION 's3a://ls-dl-mvp-s3deltalake/health_lakehouse/silver/concept';
""")
The configuration parameters are the following ones:
--conf spark.jars=s3a://ls-dl-mvp-s3development/spark_jars/delta-core_2.12-2.1.0.jar,s3a://ls-dl-mvp-s3development/spark_jars/delta-storage-2.1.0.jar
--conf spark.executor.cores=1
--conf spark.executor.memory=4g
--conf spark.driver.cores=1
--conf spark.driver.memory=4g
--conf spark.executor.instances=1
I tried to modify the location in the query by inserting a non-existent bucket and the script did not go into error. Am I forgetting something? I know I could try with the Dataframe API but I'd like to stick with Sql Syntax.
Thank you very much for your help

Related

JDBC not truncating Postgres table on pyspark

I'm using the following code to truncate a table before inserting data on it.
df.write \
.option("driver", "org.postgresql:postgresql:42.2.16") \
.option("truncate", True) \
.jdbc(url=pgsql_connection, table="service", mode='append', properties=properties_postgres)
Although, it is not working. The table still with old data. I'm using append since I don't want to the DB drop and create a new table everytime.
I've tried .option("truncate", "true") but not worked too.
I got no error messages. How can i solve this problem using .option to truncate my table.
You need to use overwrite mode
df.write \
.option("driver", "org.postgresql:postgresql:42.2.16") \
.option("truncate", True) \
.jdbc(url=pgsql_connection, table="service", mode='overwrite', properties=properties_postgres)
As given in documentation
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
truncate: true -> When SaveMode.Overwrite is enabled, this option causes Spark to
truncate an existing table instead of dropping and recreating it.

Issue with Apache Hudi Update and Delete Operation on Parquet S3 File

Here I am trying to simulate updates and deletes over a Hudi dataset and wish to see the state reflected in Athena table. We use EMR, S3 and Athena services of AWS.
Attempting Record Update with a withdrawal object
withdrawalID_mutate = 10382495
updateDF = final_df.filter(col("withdrawalID") == withdrawalID_mutate) \
.withColumn("accountHolderName", lit("Hudi_Updated"))
updateDF.write.format("hudi") \
.options(**hudi_options) \
.mode("append") \
.save(tablePath)
hudiDF = spark.read \
.format("hudi") \
.load(tablePath).filter(col("withdrawalID") == withdrawalID_mutate).show()
Shows the updated record but it is actually appended in the Athena table. Probably something to do with Glue Catalogue?
Attempting Record Delete
deleteDF = updateDF #deleting the updated record above
deleteDF.write.format("hudi") \
.option('hoodie.datasource.write.operation', 'upsert') \
.option('hoodie.datasource.write.payload.class', 'org.apache.hudi.common.model.EmptyHoodieRecordPayload') \
.options(**hudi_options) \
.mode("append") \
.save(tablePath)
still reflects the deleted record in the Athena table
Also tried using mode("overwrite") but as expected it deletes the older partitions and keeps only the latest.
Did anyone faced same issue and can guide in the right direction

How to start sparksession in pyspark

I want to change the default memory, executor and core settings of a spark session.
The first code in my pyspark notebook on HDInsight cluster in Jupyter looks like this:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("Juanita_Smith")\
.config("spark.executor.instances", "2")\
.config("spark.executor.cores", "2")\
.config("spark.executor.memory", "2g")\
.config("spark.driver.memory", "2g")\
.getOrCreate()
On completion, I read the parameters back, which looks like the statement worked
However if I look in yarn, the setting have indeed not worked.
Which settings or commands do I need to make to let the session configuration take effect ?
Thank you for help in advance
By the time your notebook kernel has started, the SparkSession is already created with parameters defined in a kernel configuration file. To change this, you will need to update or replace the kernel configuration file, which I believe is usually somewhere like <jupyter home>/kernels/<kernel name>/kernel.json.
Update
If you have access to the machine hosting your Jupyter server, you can find the location of the current kernel configurations using jupyter kernelspec list. You can then either edit one of the pyspark kernel configurations, or copy it to a new file and edit that. For your purposes, you will need to add the following arguments to the PYSPARK_SUBMIT_ARGS:
"PYSPARK_SUBMIT_ARGS": "--conf spark.executor.instances=2 --conf spark.executor.cores=2 --conf spark.executor.memory=2g --conf spark.driver.memory=2g"

howto add hive properties at runtime in spark-shell

How do you set a hive property like: hive.metastore.warehouse.dir at runtime? Or at least a more dynamic way of setting a property like the above, than putting it in a file like spark_home/conf/hive-site.xml
I faced the same issue and for me it worked by setting Hive properties from Spark (2.4.0). Please find below all the options through spark-shell, spark-submit and SparkConf.
Option 1 (spark-shell)
spark-shell --conf spark.hadoop.hive.metastore.warehouse.dir=some_path\metastore_db_2
Initially I tried with spark-shell with hive.metastore.warehouse.dir set to some_path\metastore_db_2. Then I get the next warning:
Warning: Ignoring non-spark config property:
hive.metastore.warehouse.dir=C:\winutils\hadoop-2.7.1\bin\metastore_db_2
Although when I create a Hive table with:
bigDf.write.mode("overwrite").saveAsTable("big_table")
The Hive metadata are stored correctly under metastore_db_2 folder.
When I use spark.hadoop.hive.metastore.warehouse.dir the warning disappears and the results are still saved in the metastore_db_2 directory.
Option 2 (spark-submit)
In order to use hive.metastore.warehouse.dir when submitting a job with spark-submit I followed the next steps.
First I wrote some code to save some random data with Hive:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val sparkConf = new SparkConf().setAppName("metastore_test").setMaster("local")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
import spark.implicits._
var dfA = spark.createDataset(Seq(
(1, "val1", "p1"),
(2, "val1", "p2"),
(3, "val2", "p3"),
(3, "val3", "p4"))).toDF("id", "value", "p")
dfA.write.mode("overwrite").saveAsTable("metastore_test")
spark.sql("select * from metastore_test").show(false)
Next I submitted the job with:
spark-submit --class org.tests.Main \
--conf spark.hadoop.hive.metastore.warehouse.dir=C:\winutils\hadoop-2.7.1\bin\metastore_db_2
spark-scala-test_2.11-0.1.jar
The metastore_test table was properly created under the C:\winutils\hadoop-2.7.1\bin\metastore_db_2 folder.
Option 3 (SparkConf)
Via SparkSession in the Spark code.
val sparkConf = new SparkConf()
.setAppName("metastore_test")
.set("spark.hadoop.hive.metastore.warehouse.dir", "C:\\winutils\\hadoop-2.7.1\\bin\\metastore_db_2")
.setMaster("local")
This attempt was successful as well.
The question which still remains is why I have to extend the property with spark.hadoop in order to work as expected?

How to extract application ID from the PySpark context

A previous question recommends sc.applicationId, but it is not present in PySpark, only in scala.
So, how do I figure out the application id (for yarn) of my PySpark process?
You could use Java SparkContext object through the Py4J RPC gateway:
>>> sc._jsc.sc().applicationId()
u'application_1433865536131_34483'
Please note that sc._jsc is internal variable and not the part of public API - so there is (rather small) chance that it may be changed in the future.
I'll submit pull request to add public API call for this.
In Spark 1.6 (probably 1.5 according to #wladymyrov in comment on the other answer)
In [1]: sc.applicationId
Out[1]: u'local-1455827907865'
For PySpark 2.0.0+
spark_session = SparkSession \
.builder \
.enableHiveSupport() \
.getOrCreate()
app_id = spark_session._sc.applicationId
Looks like its available in 3.0.1 at least:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName('Overriding defaults app name') \
.getOrCreate()
print(f'--- {spark.sparkContext.applicationId} ---')
Results in:
--- application_1610550667906_166057 ---

Resources