I am working on Azure Databricks in the GUI. I'm trying to upload a JAR file in the DBFS Filestore/jar which is 41 mb in size but midway in the upload I get the error Downstream duration timeout.
I'm not using any notebook so there is 0 code, all the work is being done in the user interface. Is there any solution to it?
You can use cluster UI for this . but you should have CAN MANAGE access for your cluster .
Related
I have to move around 80TB data from Hadoop to Azure. I got 10 data disks from Azure which I can connect to the gateway node one at a time.
After connecting the azure disk to gateway node machine I run the following command :-
hdfs dfs -get /path/to/hdfs/dir /azuredisk1/
This is giving me a download speed of ~70 MB/s
Is there a better way to do this?
I have a High Concurency cluster with Active Directory integration turned on. Runtime: Latest stable (Scala 2.11), Python: 3.
I've mounted Azure Datalake and when I want to read the data, always the first time after cluster start I get:
com.databricks.backend.daemon.data.client.adl.AzureCredentialNotFoundException: Could not find ADLS Gen1 Token
When I rerun it works fine. I read data in the following way:
df = spark.read.option("inferSchema","true").option("header","true").json(path)
Any idea what is wrong?
Thanks!
Tomek
I believe you can only run the command using a high concurrency cluster. If you've attached your notebook to a standard cluster, the command won't work.
I'm trying to change the Spark staging directory to prevent the loss of data on worker decommisionning (on google dataproc with Spark 2.4).
I want to switch the HDFS staging to Google Cloud Storage staging.
When I run this command :
spark-submit --conf "spark.yarn.stagingDir=gs://my-bucket/my-staging/" gs://dataproc-examples-2f10d78d114f6aaec76462e3c310f31f/src/pyspark/hello-world/hello-world.py
I have this error :
org.apache.spark.SparkException: Application application_1560413919313_0056 failed 2 times due to AM Container for appattempt_1560413919313_0056_000002 exited with exitCode: -1000
Failing this attempt.Diagnostics: [2019-06-20 07:58:04.462]File not found : gs:/my-staging/.sparkStaging/application_1560413919313_0056/pyspark.zip
java.io.FileNotFoundException: File not found : gs:/my-staging/.sparkStaging/application_1560413919313_0056/pyspark.zip
The Spark job fails but the .sparkStaging/ directory is created on GCS.
Any idea on this issue ?
Thanks.
First, it's important to realize that the staging directory is primarily used for staging artifacts for executors (primarily jars and other archives) rather than for storing intermediate data as a job executes. If you want to preserve intermediate job data (primarily shuffle data) following worker decommissioning (e.g., after machine preemption or scale down), then Dataproc Enhanced Flexibility Mode (currently in alpha) may help you.
Your command works for me on both Dataproc image versions 1.3 and 1.4. Make sure that your target staging bucket exists and that the Dataproc cluster (i.e., the service account that the cluster runs as) has read and write access to the bucket. Note that the GCS connector will not create buckets for you.
I'm trying to fetch some data from Cloudera's Quick Start Hadoop distribution (a Linux VM for us) on our SAP HANA database using SAP Spark Controller. Every time I trigger the job in HANA, it gets stuck and I see the following warning being logged continuously every 10-15 seconds in SPARK Controller's log file, unless I kill the job.
WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Although it's logged like a warning it looks like it's a problem that prevents the job from executing on Cloudera. From what I read, it's either an issue with the resource management on Cloudera, or an issue with blocked ports. In our case we don't have any blocked ports so it must be the former.
Our Cloudera is running a single node and has 16GB RAM with 4 CPU cores.
Looking at the overall configuration I have a bunch of warnings, but I can't determine if they are relevant to the issue or not.
Here's also how the RAM is distributed on Cloudera
It would be great if you can help me pinpoint the cause for this issue because I've been trying various combinations of things over the past few days without any success.
Thanks,
Dimitar
You're trying to use the Cloudera Quickstart VM for a purpose beyond it's capacity. It's really meant for someone to play around with Hadoop and CDH and should not be used for any production level work.
Your Node Manager only has 5GB of memory to use for compute resources. In order to do any work, you need to create an Application Master(AM) and a Spark Executor and then have reserve memory for your executors which you won't have on a Quickstart VM.
How do I create a batch job for spark service under Bluemix.
I need to process billion records in the text file , each record needs to be processed through a Java API. I would be interested to perform this with distributing the work over the cluster.
Stay tuned, should be able to run batch jobs on Bluemix Apache Spark service during the beta. Not there yet.