Pyspark version 3.x, repartition not working as expected for large JSON data - apache-spark

We have a hadoop cluster of two nodes with around 40 cores and 80 GB RAM. We have to simply digest a large multiline JSON into Elastic Search (ES) cluster. The size of json was 120 GB and after bz2 compression, it is reduced to 2 GB only. We have setup following code for data indexing is ES
....
def start_job():
warehouse_location = abspath('spark-warehouse')
# Create a spark session
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
# Configurations
spark.conf.set("spark.sql.caseSensitive", "true")
df = spark.read.option("multiline", "true").json(data_path)
df = df.repartition(20)
#Tranformations
df = df.drop("_id")
df.write.format(
'org.elasticsearch.spark.sql'
).option(
'es.nodes', ES_Nodes
).option(
'es.port', ES_PORT
).option(
'es.resource', ES_RESOURCE,
).save()
if __name__ == '__main__':
# ES Setting
ES_Nodes = "hadoop-master"
ES_PORT = 9200
ES_RESOURCE = "myIndex/type"
# Data absolute path
data_path = "/dss_data/mydata.bz2"
start_job()
print("Job has been finished")
The problem is that only one executor is running as total tasks are one. I was expecting, there should be 20 tasks as I have repartition the data to 20. The Spark UI image is given below. Where is the problem. I am running following command to run the job on cluster
spark-submit --class org.apache.spark.examples.SparkPi --jars elasticsearch-spark-30_2.12-7.14.1.jar --master yarn --deploy-mode cluster --driver-memory 10g --executor-memory 4g --num-executors 20 --executor-cores 2 myscript.py
We are using Hadoop and Spark version 3.x.
Further, we are also getting following trace in the Hadoop logs
df.write.format(
File "/usr/local/leads/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1107, in save
File "/usr/local/leads/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/usr/local/leads/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).json(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().

Related

spark dataframe not successfully written in elasticsearch

I am writing data from my spark-dataframe into ES. i did print the schema and the total count of records and it seems all ok until the dump gets started. Job runs successfully and no issue /error raised in spark job but the index doesn't have the supposed amount of data it should have.
i have 1800k records needs to dump and sometimes it dumps only 500k , sometimes 800k etc.
Here is main section of code.
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.config('spark.yarn.executor.memoryOverhead', '4096') \
.enableHiveSupport() \
.getOrCreate()
final_df = spark.read.load("/trans/MergedFinal_stage_p1", multiline="false", format="json")
print(final_df.count()) # It is perfectly ok
final_df.printSchema() # Schema is also ok
## Issue when data gets write in DB ##
final_df.write.mode("ignore").format(
'org.elasticsearch.spark.sql'
).option(
'es.nodes', ES_Nodes
).option(
'es.port', ES_PORT
).option(
'es.resource', ES_RESOURCE,
).save()
My resources are also ok.
Command to run spark job.
time spark-submit --class org.apache.spark.examples.SparkPi --jars elasticsearch-spark-30_2.12-7.14.1.jar --master yarn --deploy-mode cluster --driver-memory 6g --executor-memory 3g --num-executors 16 --executor-cores 2 main_es.py

Spark job stuck on one task only

I have setup Spark 3.x with Hadoop 3.x with YARN. I have to simply index some data using distributed data pipeline i.e., via Spark. Following is the code snippet that I have used for spark app (pyspark)
def index_module(row ):
pass
def start_job(DATABASE_PATH):
global SOLR_URI
warehouse_location = abspath('spark-warehouse')
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
solr_client = pysolr.Solr(SOLR_URI)
df = spark.read.format("csv").option("quote", "\"").option("escape", "\\").option("header", "true").option(
"inferSchema", "true").load(DATABASE_PATH)
df.createOrReplaceTempView("abc")
df2 = spark.sql("select * from abc")
df2.toJSON().map(index_module).collect()
solr_client.commit()
if __name__ == '__main__':
try:
DATABASE_PATH = sys.argv[1].strip()
except:
print("Input file missing !!!", file=sys.stderr)
sys.exit()
start_job(DATABASE_PATH)
There are about 120 csv files and 200 Million records. Each of it should be indexed idealy. To run the job on YARN, I have run following command (according to my Hadoop resources)
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g --num-executors 5 --executor-cores 1 /PATH/myscript.py
Now, about 3 days has been passed. My job is running. Following are the status of executors as shown from YARN dashboard
As shown in the figures, for each executor, all tasks are completed, just one left. Why it is so ? It should also be completed. What is the problem with above all ? What should be the possible way to fix the issue ?

SparkSession Application Source Code Config Properties not Overriding JupyterHub & Zeppelin on AWS EMR defaults

I have Spark Driver setup to use Zeppelin and or JupyterHub as client for interactive Spark Programming on AWS EMR. However, when I create the SparkSession with custom config properties (application name, # of cores, executor ram, # of executors, serializer, etc) it is not overriding the default values for those configs (confirmed under Environment tab in Spark UI and spark.conf.get(...)).
Like any Spark App these clients on EMR should be using my custom config properties because SparkSession code is the 1st highest override before spark-submit, spark config file, and then spark-defaults. JupyterHub also immediately launches a Spark Application w/o coding one or when just running an empty cell.
Is there a setting specific to Zeppelin, JupyterHub, or a separate xml conf that needs adjusted to get custom configs recognized and working? Any help is much appreciated.
Example of creating a basic application where these cluster resource configs should be implemented instead of the standard default configs which is what is happening with Zeppelin/JupyterHub on EMR.
# via zep or jup [configs NOT being recognized]
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("app_name")\
.master("yarn")\
.config("spark.submit.deployMode","client")\
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
.config("spark.executor.instances", 11)\
.config("spark.executor.cores", 5)\
.config("spark.executor.memory", "19g")\
.getOrCreate()
# via ssh terminal [configs ARE recognized at run-time]
pyspark \
--name "app_name" \
--master yarn \
--deploy-mode client \
--num-executors 11 \
--executor-cores 5 \
--executor-memory 19 \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"
Found a solution. The config.json file under /etc/jupyter/conf had some default spark config values hence I removed them to display an empty json key/value like => _configs":{}. Creating a custom SparkSession via JupyterHub now understands the specified cluster configs.
These magic commands are always working %%configure
https://github.com/jupyter-incubator/sparkmagic

How to submit a spark job in a 4 node CDH cluster

I have a cluster with following configurations.
Distribution : CDH5,
Number nodes : 4,
RAM : 126GB,
Number of cores : 24 per node,
Harddisk : 5TB
My input file size is 10GB. It takes a lot of time (Around 20 mins) when I submit with following command.
spark-submit --jars xxxx --files xxx,yyy --master yarn /home/me/python/ParseMain.py
In my python code I am setting the following:
sparkConf = SparkConf().setAppName("myapp")
sc = SparkContext(conf = sparkConf)
hContext = HiveContext(sc)
How can I change the spark submit arguments so that I can achieve better performance?
Some spark-submit options that you could try
--driver-cores 4
--num-executors 4
--executor-cores 20
--executor-memory 5G
CDH has to be configured to have enough vCore and vMemory. Otherwise the submitted job would remain ACCEPTED it wouldn't RUN.

PySpark distributed processing on a YARN cluster

I have Spark running on a Cloudera CDH5.3 cluster, using YARN as the resource manager. I am developing Spark apps in Python (PySpark).
I can submit jobs and they run succesfully, however they never seem to run on more than one machine (the local machine I submit from).
I have tried a variety of options, like setting --deploy-mode to cluster and --master to yarn-client and yarn-cluster, yet it never seems to run on more than one server.
I can get it to run on more than one core by passing something like --master local[8], but that obviously doesn't distribute the processing over multiple nodes.
I have a very simply Python script processing data from HDFS like so:
import simplejson as json
from pyspark import SparkContext
sc = SparkContext("", "Joe Counter")
rrd = sc.textFile("hdfs:///tmp/twitter/json/data/")
data = rrd.map(lambda line: json.loads(line))
joes = data.filter(lambda tweet: "Joe" in tweet.get("text",""))
print joes.count()
And I am running a submit command like:
spark-submit atest.py --deploy-mode client --master yarn-client
What can I do to ensure the job runs in parallel across the cluster?
Can you swap the arguments for the command?
spark-submit --deploy-mode client --master yarn-client atest.py
If you see the help text for the command:
spark-submit
Usage: spark-submit [options] <app jar | python file>
I believe #MrChristine is correct -- the option flags you specify are being passed to your python script, not to spark-submit. In addition, you'll want to specify --executor-cores and --num-executors since by default it will run on a single core and use two executors.
Its not true that python script doesn't run in cluster mode. I am not sure about previous versions but this is executing in spark 2.2 version on Hortonworks cluster.
Command : spark-submit --master yarn --num-executors 10 --executor-cores 1 --driver-memory 5g /pyspark-example.py
Python Code :
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf = (SparkConf()
.setMaster("yarn")
.setAppName("retrieve data"))
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
parquetFile = sqlContext.read.parquet("/<hdfs-path>/*.parquet")
parquetFile.createOrReplaceTempView("temp")
df1 = sqlContext.sql("select * from temp limit 5")
df1.show()
df1.write.save('/<hdfs-path>/test.csv', format='csv', mode='append')
sc.stop()
Output : Its big so i am not pasting. But it runs perfect.
It seems that PySpark does not run in distributed mode using Spark/YARN - you need to use stand-alone Spark with a Spark Master server. In that case, my PySpark script ran very well across the cluster with a Python process per core/node.

Resources