spark dataframe not successfully written in elasticsearch - apache-spark

I am writing data from my spark-dataframe into ES. i did print the schema and the total count of records and it seems all ok until the dump gets started. Job runs successfully and no issue /error raised in spark job but the index doesn't have the supposed amount of data it should have.
i have 1800k records needs to dump and sometimes it dumps only 500k , sometimes 800k etc.
Here is main section of code.
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.config('spark.yarn.executor.memoryOverhead', '4096') \
.enableHiveSupport() \
.getOrCreate()
final_df = spark.read.load("/trans/MergedFinal_stage_p1", multiline="false", format="json")
print(final_df.count()) # It is perfectly ok
final_df.printSchema() # Schema is also ok
## Issue when data gets write in DB ##
final_df.write.mode("ignore").format(
'org.elasticsearch.spark.sql'
).option(
'es.nodes', ES_Nodes
).option(
'es.port', ES_PORT
).option(
'es.resource', ES_RESOURCE,
).save()
My resources are also ok.
Command to run spark job.
time spark-submit --class org.apache.spark.examples.SparkPi --jars elasticsearch-spark-30_2.12-7.14.1.jar --master yarn --deploy-mode cluster --driver-memory 6g --executor-memory 3g --num-executors 16 --executor-cores 2 main_es.py

Related

Multiple spark session in one job submit on Kubernetes

can we use multiple starts and stop spark sessions in Kubernetes in one submit a job?
like: if I submit one job using this
bin/spark-submit \
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.container.image=<spark-image> \
local:///path/to/examples.jar
In my python code, can I starts and stop spark sessions?
examples:
# start spark session
session = SparkSession \
.builder \
.appName(appname) \
.getOrCreate()
## Doing some operations using spark
session.stop()
## some python code.
# start spark session
session = SparkSession \
.builder \
.appName(appname) \
.getOrCreate()
## Doing some operations using spark
session.stop()
is it possible or not?

Pyspark version 3.x, repartition not working as expected for large JSON data

We have a hadoop cluster of two nodes with around 40 cores and 80 GB RAM. We have to simply digest a large multiline JSON into Elastic Search (ES) cluster. The size of json was 120 GB and after bz2 compression, it is reduced to 2 GB only. We have setup following code for data indexing is ES
....
def start_job():
warehouse_location = abspath('spark-warehouse')
# Create a spark session
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
# Configurations
spark.conf.set("spark.sql.caseSensitive", "true")
df = spark.read.option("multiline", "true").json(data_path)
df = df.repartition(20)
#Tranformations
df = df.drop("_id")
df.write.format(
'org.elasticsearch.spark.sql'
).option(
'es.nodes', ES_Nodes
).option(
'es.port', ES_PORT
).option(
'es.resource', ES_RESOURCE,
).save()
if __name__ == '__main__':
# ES Setting
ES_Nodes = "hadoop-master"
ES_PORT = 9200
ES_RESOURCE = "myIndex/type"
# Data absolute path
data_path = "/dss_data/mydata.bz2"
start_job()
print("Job has been finished")
The problem is that only one executor is running as total tasks are one. I was expecting, there should be 20 tasks as I have repartition the data to 20. The Spark UI image is given below. Where is the problem. I am running following command to run the job on cluster
spark-submit --class org.apache.spark.examples.SparkPi --jars elasticsearch-spark-30_2.12-7.14.1.jar --master yarn --deploy-mode cluster --driver-memory 10g --executor-memory 4g --num-executors 20 --executor-cores 2 myscript.py
We are using Hadoop and Spark version 3.x.
Further, we are also getting following trace in the Hadoop logs
df.write.format(
File "/usr/local/leads/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1107, in save
File "/usr/local/leads/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/usr/local/leads/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).json(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().

Spark job stuck on one task only

I have setup Spark 3.x with Hadoop 3.x with YARN. I have to simply index some data using distributed data pipeline i.e., via Spark. Following is the code snippet that I have used for spark app (pyspark)
def index_module(row ):
pass
def start_job(DATABASE_PATH):
global SOLR_URI
warehouse_location = abspath('spark-warehouse')
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
solr_client = pysolr.Solr(SOLR_URI)
df = spark.read.format("csv").option("quote", "\"").option("escape", "\\").option("header", "true").option(
"inferSchema", "true").load(DATABASE_PATH)
df.createOrReplaceTempView("abc")
df2 = spark.sql("select * from abc")
df2.toJSON().map(index_module).collect()
solr_client.commit()
if __name__ == '__main__':
try:
DATABASE_PATH = sys.argv[1].strip()
except:
print("Input file missing !!!", file=sys.stderr)
sys.exit()
start_job(DATABASE_PATH)
There are about 120 csv files and 200 Million records. Each of it should be indexed idealy. To run the job on YARN, I have run following command (according to my Hadoop resources)
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g --num-executors 5 --executor-cores 1 /PATH/myscript.py
Now, about 3 days has been passed. My job is running. Following are the status of executors as shown from YARN dashboard
As shown in the figures, for each executor, all tasks are completed, just one left. Why it is so ? It should also be completed. What is the problem with above all ? What should be the possible way to fix the issue ?

Spark Structured Stream Executors weird behavior

Using Spark Structured Stream, with Cloudera solution
I'm using 3 executors but when I launch the application the executor that is used it's only one.
How can I use multiple executors?
Let me give you more infos.
This is my parameters:
Command Launch:
spark2-submit --master yarn \
--deploy-mode cluster \
--conf spark.ui.port=4042 \
--conf spark.eventLog.enabled=false \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.streaming.backpressure.enabled=true \
--conf spark.streaming.kafka.consumer.poll.ms=512 \
--num-executors 3 \
--executor-cores 3 \
--executor-memory 2g \
--jars /data/test/spark-avro_2.11-3.2.0.jar,/data/test/spark-streaming-kafka-0-10_2.11-2.1.0.cloudera1.jar,/data/test/spark-sql-kafka-0-10_2.11-2.1.0.cloudera1.jar \
--class com.test.Hello /data/test/Hello.jar
The Code:
val lines = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", <topic_list:9092>)
.option("subscribe", <topic_name>)
.option("group.id", <consumer_group_id>)
.load()
.select($"value".as[Array[Byte]], $"timestamp")
.map((c) => { .... })
val query = lines
.writeStream
.format("csv")
.option("path", <outputPath>)
.option("checkpointLocation", <checkpointLocationPath>)
.start()
query.awaitTermination()
Result in SparkUI:
SparkUI Image
What i expected that all executors were working.
Any suggestions?
Thank you
Paolo
Looks like there is nothing wrong in your configuration, it's just the partitions that you are using might be just one. You need to increase the partitions in your kafka producer. Usually, the partitions are around 3-4 times the number of executors.
If you don't want to touch the producer code, you can come around this by doing repartition(3) before you apply the map method, so every executor works on it's own logical partition.
If you still want you explicitly mention the work each executor gets, you could do mapPerPartion method.

How to execute Spark programs with Dynamic Resource Allocation?

I am using spark-summit command for executing Spark jobs with parameters such as:
spark-submit --master yarn-cluster --driver-cores 2 \
--driver-memory 2G --num-executors 10 \
--executor-cores 5 --executor-memory 2G \
--class com.spark.sql.jdbc.SparkDFtoOracle2 \
Spark-hive-sql-Dataframe-0.0.1-SNAPSHOT-jar-with-dependencies.jar
Now i want to execute the same program using Spark's Dynamic Resource allocation. Could you please help with the usage of Dynamic Resource Allocation in executing Spark programs.
In Spark dynamic allocation spark.dynamicAllocation.enabled needs to be set to true because it's false by default.
This requires spark.shuffle.service.enabled to be set to true, as spark application is running on YARN. Check this link to start the shuffle service on each NodeManager in YARN.
The following configurations are also relevant:
spark.dynamicAllocation.minExecutors,
spark.dynamicAllocation.maxExecutors, and
spark.dynamicAllocation.initialExecutors
These options can be configured to Spark application in 3 ways
1. From Spark submit with --conf <prop_name>=<prop_value>
spark-submit --master yarn-cluster \
--driver-cores 2 \
--driver-memory 2G \
--num-executors 10 \
--executor-cores 5 \
--executor-memory 2G \
--conf spark.dynamicAllocation.minExecutors=5 \
--conf spark.dynamicAllocation.maxExecutors=30 \
--conf spark.dynamicAllocation.initialExecutors=10 \ # same as --num-executors 10
--class com.spark.sql.jdbc.SparkDFtoOracle2 \
Spark-hive-sql-Dataframe-0.0.1-SNAPSHOT-jar-with-dependencies.jar
2. Inside Spark program with SparkConf
Set the properties in SparkConf then create SparkSession or SparkContext with it
val conf: SparkConf = new SparkConf()
conf.set("spark.dynamicAllocation.minExecutors", "5");
conf.set("spark.dynamicAllocation.maxExecutors", "30");
conf.set("spark.dynamicAllocation.initialExecutors", "10");
.....
3. spark-defaults.conf usually located in $SPARK_HOME/conf/
Place the same configurations in spark-defaults.conf to apply for all spark applications if no configuration is passed from command-line as well as code.
Spark - Dynamic Allocation Confs
I just did a small demo with Spark's dynamic resource allocation. The code is on my Github. Specifically, the demo is in this release.

Resources