Does Databricks support submitting a SparkSQL job similar to Google Cloud Dataproc?
The Databricks Job API, doesn't seem to have an option for submitting a Spark SQL job.
Reference:
https://docs.databricks.com/dev-tools/api/latest/jobs.html
https://cloud.google.com/dataproc/docs/reference/rest/v1beta2/projects.regions.jobs
No, you submit a notebook.
That notebook can be many things: python, spark script or with %sql Spark SQL.
You can submit the spark job on databricks cluster just like the dataproc.
Run your spark job in scala context and create a jar for the same. Submitting spark-sql directly is not supported.
To create a job follow the official guide https://docs.databricks.com/jobs.html
Also, to trigger the job using REST API you can trigger the run-now request described https://docs.databricks.com/dev-tools/api/latest/jobs.html#runs-submit
Related
I created a spark submit job in databricks to run a .py script. I created a spark object in my python script. I tried to access existing Hive Tables. But my script fails with "Table or view not found" error. Should I add some configuration settings in my spark submit job to connect to existing hive metastore?
Please try using like below while creating spark Session in spark 2.0+
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
This will usually solve these kind of errors
I have Spark jobs on EMR, and EMR is configured to use the Glue catalog for Hive and Spark metadata.
I create Hive external tables, and they appear in the Glue catalog, and my Spark jobs can reference them in Spark SQL like spark.sql("select * from hive_table ...")
Now, when I try to run the same code in a Glue job, it fails with "table not found" error. It looks like Glue jobs are not using the Glue catalog for Spark SQL the same way that Spark SQL would running in EMR.
I can work around this by using Glue APIs and registering dataframes as temp views:
create_dynamic_frame_from_catalog(...).toDF().createOrReplaceTempView(...)
but is there a way to do this automatically?
This was a much awaited feature request (to use Glue Data Catalog with Glue ETL jobs) which has been release recently.
When you create a new job, you'll find the following option
Use Glue data catalog as the Hive metastore
You may also enable it for an existing job by editing the job and adding --enable-glue-datacatalog in the job parameters providing no value
Instead of using SparkContext.getOrCreate(), you should use SparkSession.builder().enableHiveSupport().getOrCreate(), with enableHiveSupport() being the important part that's missing. I think what's probably happening is that your Spark job is not actually creating your tables in Glue but rather is creating them in Spark's embedded Hive metastore, since you have not enabled Hive support.
Had the same problem. It was working on my Dev endpoint but not the actual ETL job. It is fixed by editing the job from Spark 2.2 to Spark 2.4.
in my case
SparkSession.builder
.config(
"hive.metastore.client.factory.class",
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
I'm trying to configure Hive, running on Google Dataproc image v1.1 (so Hive 2.1.0 and Spark 2.0.2), to use Spark as an execution engine instead of the default MapReduce one.
Following the instructions here https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started doesn't really help, I keep getting Error running query: java.lang.NoClassDefFoundError: scala/collection/Iterable errors when I set hive.execution.engine=spark.
Does anyone know the specific steps to get this running on Dataproc? From what I can tell it should just be a question of making Hive see the right JARs, since both Hive and Spark are already installed and configured on the cluster, and using Hive from Spark (so the other way around) works fine.
This will probably not work with the jars in a Dataproc cluster. In Dataproc, Spark is compiled with Hive bundled (-Phive), which is not suggested / supported by Hive on Spark.
If you really want to run Hive on Spark, you might want to try to bring your own Spark in an initialization action compiled as described in the wiki.
If you just want to run Hive off MapReduce on Dataproc running Tez, with this initialization action would probably be easier.
I am relatively new to Spark and DSE and I am trying to submit a spark job to the DSE spark cluster programmatically?
I am using the org.apache.spark.launcher.SparkLauncher api. I tried following the documentation for the SparkLauncher.
Process launcher = new SparkLauncher().setAppName("appName")
.setAppResource("spark-job.jar")
.setSparkHome("spark-home")
.setMainClass("main-class")
.setVerbose(true).launch();
launcher.waitFor();
But it doesn't seem to launch the job on the dse cluster. I can trigger the job manually using: dse spark-submit command
Will appreciate any help here. Thanks !
I believe this has something to do with not setting your sparkHOme. Identify your spark home In DSE and then add
.setSparkHome("sparkHomeDir")
And You rather use SparkHandle than blocking wait.
SparkAppHandle handle = launcher.startApplication();
I am able to submit spark job on linux server using console. But is there any API or some framework that can enable to submit spark job in linux server?
You can use port 7077 to submit spark jobs in you spark cluster instead of using spark-submit.
val spark = SparkSession
.builder()
.master(spark://master-machine:7077)
you can look into Livy server. It is in GA mode in Hortonworks and Cloudera distros of Apache Hadoop. We have had good success with it. its documentation is good enough to get started with. Spark jobs start instantaneously when submitted via Livy since it has multiple SparkContexts running inside it.