Unable to infer schema for Parquet only on Scheduled Job Run - databricks

I am running a notebook that executes other notebooks through the dbutils.notebooks.run() command. Whenever I run this job manually it executes without any issues. Whenever the job runs nightly the ephemeral notebook runs returns the error
org.apache.spark.sql.AnalysisException: Unable to infer schema for
Parquet. It must be specified manually.;
Some other notebooks that ran into the error I was able to resolve by increasing the number of workers on the cluster. I've tried doing that on this workflow as well without any luck, and I can't find any documentation that indicates that should be necessary anyway.
Any insight would be helpful.

Increasing the number of works on the cluster pool fixed the problem. Not certain the correct number of workers needed per ephemeral run, it would seem the minimum of 2 per run would needed, and they are necessarily returned immediately when the run is completed.

Related

Results in databricks on AWS are not displayed when run as a job

Instead of the expected output from a display(my_dataframe), I get Failed to fetch the result. Retry when looking at the completed run (also marked as success).
The notebook runs fine, including the expected outputs, when run as an on-demand notebook (same cluster config etc.). It seems to be a UI issue? I honestly don't even know where to look for possible causes.
I had the same problem while running a job on Azure Databricks and restarting my computer (or maybe the explorer..) helped.

Using databricks for twtter sentiment analysis - issue running the official tutorial

I am starting to use Databricks and tried to implement one of the official tutorials (https://learn.microsoft.com/en-gb/azure/azure-databricks/databricks-sentiment-analysis-cognitive-services) from the website. However, I run into an issue - not even sure if I can call it an issue - when I run the second notebook (analysetweetsfromeventhub) then all commands (2nd, 3rd, 4th ...) are officially waiting to run, but never run. See the picture. Any idea what might be? Thanks.
After you cancel a running streaming cell in a notebook attached to a Databricks Runtime cluster, you cannot run any subsequent commands in the notebook. The commands are left in the “waiting to run” state, and you must clear the notebook’s state or detach and reattach the cluster before you can successfully run commands on the notebook.
Note that this issue occurs only when you cancel a single cell; it does not apply when you run all and cancel all cells.
In the meantime, you can do either of the following:
To remediate an affected notebook without restarting the cluster, go to the notebook’s Clear menu and select Clear State:
If restarting the cluster is acceptable, you can solve the issue by turning off idle context tracking. Set the following Spark configuration value on the cluster:
spark.databricks.chauffeur.enableIdleContextTracking false
Then restart the cluster.

Why am I getting out of memory errors only after several runs of my Spark Application?

I ran my spark application successfully twice after spinning up a fresh EMR cluster. After running a different Spark Application several times that DOES have out of memory issues, I ran the first spark application again and got out of memory errors.
I repeated this sequence of events three times and it happens every time. What could be happening? Shouldn't Spark free all memory between runs?
After a spark program completed, It generates temporary directories and it's remain in the temp directory so after runs several spark applications it might be gives out of memory error. There is some clean up options which can solve this issue.
spark.worker.cleanup.enabled (Default value is false),
spark.worker.cleanup.interval, spark.worker.cleanup.appDataTtl for more details about these kindly go through this document.
http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts

Sudden surge in number of YARN apps on HDInsight cluster

For some reason sometimes the cluster seems to misbehave for I suddenly see surge in number of YARN jobs.We are using HDInsight Linux based Hadoop cluster. We run Azure Data Factory jobs to basically execute some hive script pointing to this cluster. Generally average number of YARN apps at any given time are like 50 running and 40-50 pending. None uses this cluster for ad-hoc query execution. But once in few days we notice something weird. Suddenly number of Yarn apps start increasing, both running as well as pending, but especially pending apps. So this number goes more than 100 for running Yarn apps and as for pending it is more than 400 or sometimes even 500+. We have a script that kills all Yarn apps one by one but it takes long time, and that too is not really a solution. From our experience we found that the only solution, when it happens, is to delete and recreate the cluster. It may be possible that for some time cluster's response time is delayed (Hive component especially) but in that case even if ADF keeps retrying several times if a slice is failing, is it possible that the cluster is storing all the supposedly failed slice execution requests (according to ADF) in a pool and trying to run when it can? That's probably the only explanation why it could be happening. Has anyone faced this issue?
Check if all the running jobs in the default queue are Templeton jobs. If so, then your queue is deadlocked.
Azure Data factory uses WebHCat (Templeton) to submit jobs to HDInsight. WebHCat spins up a parent Templeton job which then submits a child job which is the actual Hive script you are trying to run. The yarn queue can get deadlocked if there are too many parents jobs at one time filling up the cluster capacity that no child job (the actual work) is able to spin up an Application Master, thus no work is actually being done. Note that if you kill the Templeton job this will result in Data Factory marking the time slice as completed even though obviously it was not.
If you are already in a deadlock, you can try adjusting the Maximum AM Resource from the default 33% to something higher and/or scaling up your cluster. The goal is to be able to allow some of the pending child jobs to run and slowly draining the queue.
As a correct long term fix, you need to configure WebHCat so that parent templeton job is submitted to a separate Yarn queue. You can do this by (1) creating a separate yarn queue and (2) set templeton.hadoop.queue.name to the newly created queue.
To create queue you can do this via the Ambari > Yarn Queue Manager.
To update WebHCat config via Ambari go to Hive tab > Advanced > Advanced WebHCat-site, and update the config value there.
More info on WebHCat config:
https://cwiki.apache.org/confluence/display/Hive/WebHCat+Configure

spark-ec2 sporadic errors - Initial job has not accepted any resources

Sporadically, I get the following error when running Apache Spark on EC2:
WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
It happens randomly and the only way to fix is it to shut down the cluster and re start everything. Clearly, this is inefficient. Why does this error randomly occur? I can see in the console all my instances are perfectly fine. I am trying to work with a 6MB file. It can't possibly be out of memory. Every time I attempt to do something on Apache Spark EC2 there seems to be some random error popping up. This recent error occurred when I'm running a program I have run 5'000+ times before. On the same cluster type. Why is there so many sporadic errors? And what does this one even mean considering my instances and master are working perfectly.

Resources