I am trying to load a 60gb table data onto a spark python dataframe and then write that into a hive table.
I have set driver memory, executor memory, max result size sufficiently to handle the data. But i am getting error when i run through spark submit with all the above said configs mentioned in command line.
Note: Through spark python shell (by specifying driver & executor memory while launching the shell), i am able to populate the target hive table.
Any thoughts??
Try using syntax:
./spark-submit --conf ...
For the memory-related configuration. What I suspect you're doing is - you are setting them, while initializing SparkSession - which becomes irrelevant, since kernel is already started by then. Same parameters, as you set for running shell will do.
https://spark.apache.org/docs/latest/submitting-applications.html
Related
I am using Spark 3.1.2 with hadoop 3.2.0 to run Spark Structured Streaming (SSS) aggregation job, running on Spark K8S.
Theses job are reading files from S3 using SSS provided File Source input and also use S3 for checkpointing (with the directory output commiter).
What I noticed is that, after few days of running, the driver is having memory issue and crash.
As the driver is not doing many things (just calling Spark SQL functions and write the output to S3), I am wondering how to detect the source of these memory issues (memory leaking from hadoop/S3A library ?) and how I can fix them.
As shown on the screenshot, the driver take some time befoe using all the memory, and once it reached it, it seems to be able to call GC enough time. But after 1 week of running, it crash, as if the GC doesn't run enough/doesn't find something to clean.
I am using spark in an integration test suite. It has to run locally and read/write files to local file-system. I also want to read/write these data as tables.
In the first step of the suite I write some hive tables in the db feature_store specifying
spark.sql.warehouse.dir=/opt/spark/work-dir/warehouse. The step completes correctly and I see the files in the folder I expect.
Afterwards I run a spark-submit step with (among others) these confs
--conf spark.sql.warehouse.dir=/opt/spark/work-dir/warehouse --conf spark.sql.catalogImplementation=hive
and when trying to read a table previously written I get
Exception in thread "main" org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'feature_store' not found
However if I try to do exactly the same thing with exactly the same configs in a spark-shell I am able to read the data.
In the spark-submit I use the following code to get the spark-session
SparkSession spark = SparkSession.active();
I have also tried to use instead
SparkSession spark = SparkSession.builder().enableHiveSupport().getOrCreate();
but I keep getting the same problem as above.
I have understood that the problem is related to the spark-submit not picking up hive as
catalog implementation. In fact I see that the class spark.catalog is not an instance of HiveCatalogImpl during the spark-submit (while it is when using spark-shell).
I'm using EMR Notebooks with pyspark and livy.
I'm reading the data from s3 which is in parquet format and string into a pyspark dataframe. there are approx. 2 million rows. when i do a join operation. I am getting 400 session isn't active. for which i have already set the livy timeout to 5h.
An error was encountered:
Invalid status code '400' from
https://172.31.12.103:18888/sessions/5/statements/20 with error
payload:
"requirement failed: Session isn't active."
I had the same issue and the reason for the timeout is the driver running out of memory. By default the driver memory is 1000M when creating a spark application through EMR Notebooks even if you set a higher value through config.json. You can see that by executing the code from within a jupyter notebook
spark.sparkContext.getConf().get('spark.driver.memory')
1000M
To increase the driver memory just do
%%configure -f
{"driverMemory": "6000M"}
This will restart the application with increased driver memory. You might need to use higher values for your data. Hope it helps.
You can try working your operation on small amount of data first. Once it is working end to end as expected, you can move to large data.
I'm very new to PySpark.
I am running a script (mainly creating a tfidf and predicting 9 categorical columns with it) in Jupyter Notebook. It is taking some 5 mins when manually executing all cells. When running the same script from spark-submit it is taking some 45 mins. What is happening?
Also the same thing happens (the excess time) if I run the code using python from terminal.
I am also setting the configuration in the script as
conf = SparkConf().set('spark.executor.memory', '45G').set('spark.driver.memory', '80G').set('spark.driver.maxResultSize', '20G')
Any help is appreciated. Thanks in advance.
There are various ways to run your Spark code like you have mentioned few Notebook, Pyspark and Spark-submit.
Regarding Jupyter Notebook or pyspark shell.
While you are running your code in Jupyter notebook or pyspark shell it might have set some default values for executor memory, driver memory, executor cores etc.
Regarding spark-submit.
However, when you use Spark-submit these values could be different by default. So the best way would be to pass these values as flags while submitting the pyspark application using "spark-submit" utility.
Regarding the configuration object which you have created can pe be passes while creating the Spark Context (sc).
sc = SparkContext(conf=conf)
Hope this helps.
Regards,
Neeraj
I had the same problem, but to initialize my spark variable I was using this line :
spark = SparkSession.builder.master("local[1]").appName("Test").getOrCreate()
The problem is that "local[X]", is equivalent to say that spark will do the operations on the local machine, on X cores. So you have to optimize X with the number of cores available on your machine.
To use it with a yarn cluster, you have to put "yarn".
There is many others possibilities listed here : https://spark.apache.org/docs/latest/submitting-applications.html
I am using Spark SQL (1.5.1) to run JOIN query in Spark Shell. The data contains extremely amount of rows, and the JOIN query never succeeded. Anyway, if I process with Hive SQL on Hive with the same data set, everything went fine. So probably there is something wrong with my configuration
From the console ouput, I found
"[Stage 2:=========================> (92 + 54) / 200]15/10/29 14:26:23 ERROR YarnScheduler: Lost executor 1 on cn233.local: remote Rpc client disassociated"
On base of this Spark started 200 executors by default on base of the configuration spark.shuffle.partitions, and this definitely consumed all memory as I have a small cluster
So how to solve this problem?
Client disassociated error occurs mostly in case of Spark executor running out of memory. You can try the following options
Increase the Executor memory
--executor-memory 20g
You may also try to tune your memory overhead, if your application is using a lot of JVM memory.
--conf spark.yarn.executor.memoryOverhead=5000
Try adjusting the akka framesize, (Default 100MB)
--conf spark.akka.frameSize=1000
May be you may also want to try with smaller block size for the input data. This will increase the tasks, and each tasks will have lesser data to work with, This may prevent executor from running into OutOfMemory.