Not able to create Parquet table as default using spark-submit jobs on EMR - apache-spark

I was able to run an EMR step like this
spark-sql -f "script_location" --jars EMR_SPARK_JARFILE_FULL_PATH --hiveconf hive.default.fileformat=parquet --hiveconf hive.default.fileformat.managed=parquet --conf spark.sql.crossJoin.enabled=true -deploy-mode cluster
On which I set Spark to create tables in Parquet by default using SparkSQL scripts(this is working as expected).
Now, following the documentation https://spark.apache.org/docs/latest/configuration.html#custom-hadoophive-configuration
I want to do the same but in this time, I need to use PySpark, so I tried to run
spark-submit "script_location" \
--jars "jar_location" \
--conf spark.hive.default.fileformat=Parquet \
--conf spark.hive.default.fileformat.managed=Parquet \
--conf spark.sql.crossJoin.enabled=true \
-deploy-mode cluster
It seems spark-submit is not setting parquet as default for tables creation.
Is there something I'm missing?

Related

Spark Kafka Streaming not displaying data on spark-submit on EMR

I am trying to stream the data from a Kafka topic, it is working in spark shell. But if i create a .py file and use spark-submit for the same, it is failing:
Code:
spark_session = SparkSession.builder.appName("TestApp").enableHiveSupport().getOrCreate()
kafka_bootstrap_server = BOOTSTRAP_SERVERS
topic = 'ota-impactreportsync'
starting_offsets = 'earliest'
df = spark_session.readStream.format("kafka").option("kafka.bootstrap.servers", kafka_bootstrap_server).option(
"subscribe", topic).option("startingOffsets", starting_offsets).option("failOnDataLoss", "false").load()
df.writeStream.format("console").outputMode("append").start()
Commands used:
pyspark --master local --packages io.delta:delta-core_2.12:2.1.1,org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,org.apache.spark:spark-avro_2.12:3.3.1 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
spark-submit --master local --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,io.delta:delta-core_2.12:2.1.1,org.apache.spark:spark-avro_2.12:3.3.1 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" test.py
If i use batch streaming with spark-submit it is working
The job fails in like 10 seconds everytime, the logs are not helpful either. No errors.

Apache Spark application is not seen in Spark Web UI (Java)

I am trying to run an ETL job using Apache Spark (Java) in Kubernetes cluster. The Application is running, and data is getting inserted into database (mysql). But, the application is not seen in Spark Web UI.
The command I used for submitting the application is:
./spark-submit --class com.xxxx.etl.EtlApplication \
--name MyETL \
--master k8s://XXXXXXXXXX.xxx.us-west-2.eks.amazonaws.com:443 \
--conf "spark.kubernetes.container.image=YYYYYY.yyy.ecr.us-west-2.amazonaws.com/spark-poc:32" \
--conf "spark.kubernetes.driverEnv.SPARK_MASTER_URL=spark://my-spark-master-headless.default.svc.cluster.local:7077" \
--conf "spark.kubernetes.authenticate.driver.serviceAccountName=my-spark" \
--conf "spark.kubernetes.driver.request.cores=256m" \
--conf "spark.kubernetes.driver.limit.cores=512m" \
--conf "spark.kubernetes.executor.request.cores=256m" \
--conf "spark.kubernetes.executor.limit.cores=512m" \
--deploy-mode cluster \
local:///opt/bitnami/spark/examples/jars/EtlApplication-with-dependencies.jar 1000
I use a jenkins job to build my code and move the jar to /opt/bitnami/spark/examples/jars folder in the container inside the cluster.
The job is seen running in the pod when I check with kubectl get pods, and is seen on taking localhost:4040 after mapping the port to localhost using kubectl port-forward pod/myetl-df26f5843cb88da7-driver 4040:4040
Tried the same spark-submit command with Spark example jar (which came along with Spark installation in the container):
./spark-submit --class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.container.image=YYYYYY.yyy.ecr.us-west-2.amazonaws.com/spark-poc:5" \
--master k8s://XXXXXXXXXX.xxx.us-west-2.eks.amazonaws.com:443 \
--conf spark.kubernetes.driverEnv.SPARK_MASTER_URL=spark://my-spark-master-headless.default.svc.cluster.local:7077" \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=my-spark \
--deploy-mode cluster \
local:///opt/bitnami/spark/examples/jars/spark-examples_2.12-3.3.0.jar 1000
This time this application is getting listed in the Spark Web UI. I tried several options, and on removing the line --conf spark.kubernetes.driverEnv.SPARK_MASTER_URL=spark://my-spark-master-headless.default.svc.cluster.local:7077", the SparkPi example application is also not displayed in Spark Web UI.
Am I missing something? Do I need to change my java code to accept spark.kubernetes.driverEnv.SPARK_MASTER_URL? Tried several options buut nothing works.
Thanks in advance.

Spark-submit with properties file for python

I am trying to load spark configuration through the properties file using the following command in Spark 2.4.0.
spark-submit --properties-file props.conf sample.py
It gives the following error
org.apache.spark.SparkException: Dynamic allocation of executors requires the external shuffle service. You may enable this through spark.shuffle.service.enabled.
The props.conf file has this
spark.master yarn
spark.submit.deployMode client
spark.authenticate true
spark.sql.crossJoin.enabled true
spark.dynamicAllocation.enabled true
spark.driver.memory 4g
spark.driver.memoryOverhead 2048
spark.executor.memory 2g
spark.executor.memoryOverhead 2048
Now, when I try to run the same by adding all arguments to the command itself, it works fine.
spark2-submit \
--conf spark.master=yarn \
--conf spark.submit.deployMode=client \
--conf spark.authenticate=true \
--conf spark.sql.crossJoin.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.driver.memory=4g \
--conf spark.driver.memoryOverhead=2048 \
--conf spark.executor.memory=2g \
--conf spark.executor.memoryOverhead=2048 \
sample.py
This works as expected.
I don't think spark supports --properties-file, one workaround is making the change on $SPARK_HOME/conf/spark-defaults.conf, spark will auto load it.
You can refer https://spark.apache.org/docs/latest/submitting-applications.html#loading-configuration-from-a-file

apache spark: Completed application history deleted after restarting

When I restart spark cluster all of history of completed application in web ui are deleted. How can I preserve this history from deleting when restarting?
Spark itself doesn't store logs. If you want to store them then you need to enable that config by using "spark.eventLog":
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master spark://10.129.6.11:7077 \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir="hdfs://your path" \
/home/spark/spark-3.2.1-bin-hadoop3.2/examples/jars/spark-examples_2.12-3.2.1.jar 8
Don't restart spark master. Just make it got query like Zeppelin.

How to choose the queue for Spark job using spark-submit?

Is there a way to provide parameters or settings to choose the queue in which I'd like my spark_submit job to run?
By using --queue
So an example of a spark-submit job would be:-
spark-submit --master yarn --conf spark.executor.memory=48G --conf spark.driver.memory=6G --packages [packages separated by ,] --queue [queue_name] --class [class_name] [jar_file] [arguments]

Resources