all.
When Im working in notebook using Pyspark as my kernel, it will create a new spark session each time i run a different line of code, thus eventually i will run into resource not enough issue. I wonder is it possible to use the same spark session instead of multiple? Or is it possible to use a single session for all the task? Thanks!
Tried all the different selection, other than extend the spark session time out period from 60 s to longer, i can't seems to solve my problem. Im expecting an simple answer on whether it will work. Or a spark session is not able to do multi-tasking, like multi-thread.
Related
I have a practical use case. three notebooks (pyspark) all have one common parameter.
need to schedule all three notebooks in a sequence.
is there any way to run them by setting one parameter value, as they are same in all?
please suggest the best way to do it.
I am using the %run feature in Azure databricks to execute many notebooks in sequence from a command notebook. One notebook has a single line of code which is a long computation on a dataset (~ 5 hrs) and I want to save the output of this. I tried including the save step at the end of the long-running notebook, but the save times out (see error below). I'm only seeing this error when the long-running notebook takes 2+ hrs to run. Is there any way I can automate this?
I'm able to pass the data I want back through the %run feature in the command notebook and save the data there, but I have to run the save manually after the long-running notebook, otherwise I get the same authentication timeout error. I'd like to be able to have one notebook where I only need to click "run all".
I find it is better to break up long notebooks into smaller ones and use the multi-task job scheduler to help run them in order.
If a single line of code cannot be broken up (per comment), then I wonder if the single line is being parallelized. If so, maybe you need more workers. If not, then using a large single node cluster would be my suggestion if parallelizing is not an option.
I am trying to run a Hive query with pyspark. I am using Hortonworks so I need to use the Hive WarehouseConnector.
Running one or even multiple queries is easy and works. My problem is that I want to issue set commands before. For instance to set the dag name in tez ui: set hive.query.name=something relevant or to set up some memory configuration set hive.tez.container.size = 8192. For these statements to take effect, they need to run on the same session than the main query and that's my issue.
I tried 2 ways:
The first one was to generate a new hive session for each query, with a properly setup url eg.:
url='jdbc:hive2://hiveserver:10000/default?hive.query.name=relevant'
builder = HiveWarehouseSession.session(self.spark)
builder.hs2url(url)
hive = builder.build()
hive.execute("select * from whatever")
It works well for the first query, but the same url is reused for the next one (even if I try to manually delete builder and hive), so does not work.
The second way is to set spark.sql.hive.thriftServer.singleSession=true globally in the spark thrift server. This does seem to work, but I find it a shame to limit the global spark thrift server for the benefit of one application only.
Is there a way to achieve what I am looking for? Maybe there could be a way to pin a query to one executor, so hopefully one session?
this has been a big peeve of mine...still is actually.
The solution that resolved this issue for me was putting all the queries in a query file, where each query would be separated by a semicolon. Then I run the query using beeline from within a python script.
Unfortunately, it does not work with queries that return results...only suitable for set, overwrite, insert kind of queries.
In case you might have discovered a more efficient way to do this, please do share.
I am trying to limit my CPU usage, but I got this strange result : when I try to limit to 3 CPU, I still get a sparck Context with [*] master :
Without more information my guess would be that you are doing this from inside of the spark shell. That means the master has already been set and will be used. Note that the call is getOrCreate, which means it will create only if it cannot get something already there.
That's because you already have one SparkSession object.
If there is active session in thread context, then this session will be used. Your notebook has one attached session and that's why getOrCreate is returning existing SparkSession.
Check in your logs, probably you have:
Using an existing SparkSession; some configuration may not take effect.
Then you can clean active sessions:
SparkSession.clearActiveSession()
But in notebooks it is not recommended as it can cause errors in other notebooks on your servers
So asking if anyone knows a way to change the Spark properties (e.g. spark.executor.memory, spark.shuffle.spill.compress, etc) during runtime, so that a change may take effect between the tasks/stages during a job...
So I know that...
1) The documentation for Spark 2.0+ (and previous versions too) state that once the Spark Context has been created, it can't be changed in runtime.
2) SparkSession.conf.set that may change a few things for SQL, but I was looking at more general, all encompassing configurations.
3) I could start a new context in the program with new properties, but the case here is to actually tune the properties once a job is already executing.
Ideas...
1) Would killing an Executor force it to read a configuration file again, or does it just get what's already configured during the beginning of the job?
2) Is there any command to force a "refresh" of the properties in spark context?
So hoping there might be a way or other ideas out there (thanks in advance)...
After submitting the Spark application, we can change a few parameter values at Runtime and a few not.
By using spark.conf.isModifiable() method, we can check parameter value we can modify at runtime or not. If the value returns true then we can modify the parameter value otherwise, we can't modify the value at runtime.
Examples:
>>> spark.conf.isModifiable("spark.executor.memory")
False
>>> spark.conf.isModifiable("spark.sql.shuffle.partitions")
True
So based on the above testing, we can't modify the spark.executor.memory parameter value at runtime.
No, it is not possible to change settings like spark.executor.memory at runtime.
In addition, there are probably not too many great tricks in the direction of 'quickly switching to a new context' as the strength of spark is that it can pick up data and keep going. What you essentially are asking for is a map-reduce framework. Of course you could rewrite your job into this structure, and divide the work across multiple spark jobs, but then you would lose some of the ease and performance that spark brings. (Though possibly not all).
If you really think the request makes sense on a conceptual level, you could consider making a feature request. This can be through your spark supplier, or directly by logging a Jira on the apache Spark project.