No FileSystem for scheme "gs" in apache airflow - apache-spark

am trying to submit my scala Jar using SparkSubmitOperator in airflow, I have set configurations for gs, am not sure are they in right place, below is how my DAG looks like
SparkSubmitOperator(task_id='spark_task',
application = 'gs://xxx/xxx/xxx.jar',
conf = {"spark.driver.allowMultipleContexts":True, "spark.blacklist.enabled":False, "fs.gs.impl":True, "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem":False "fs.AbstractFileSystem.gs.impl":False, "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS":True},
conn_id='spark_local',
java_class = 'com.xxx.xxx.xxx.xxx',
jars="gs://xxx/xxx/*",
application_args=["xxx" "xxx" "xxx", "xxx"])
am getting this exception

Related

How to set the YARN queue when submitting a Spark application from the Airflow SparkSubmitOperator

I am new to Airflow and the SparkSubmitOperator. I can see that Spark applications are submitted to the 'root.default' queue out the box when targeting YARN.
Simple question - how does one set a custom queue name ?
wordcount = SparkSubmitOperator(
application='/path/to/wordcount.py',
task_id="wordcount",
conn_id="spark_default",
dag=dag
)
p.s. I have read the docs:
https://airflow.apache.org/docs/stable/_modules/airflow/contrib/operators/spark_submit_operator.html
Thanks
I can see now that --queue value is coming from the Airflow spark-default connection:
Conn Id = spark_default
Host = yarn
Extra = {"queue": "root.default"}
Go to Admin Menu > Connections, select spark default and edit it :
Change Extra {"queue": "root.default"} to {"queue": "default"} in the Airflow WebServer UI.
This of course means an Airflow connection is required for each queue.
To be clear, there are at least two ways to do this:
Via the Spark connection, as Phillip answered.
Via the a --conf parameter, which Dustan mentions in a comment.
From my testing, if there's a queue set in the Connection's Extra field, that is used regardless of what you pass into the SparkSubmit conf.
However, if you remove queue from Extra in the Connection, and send it in the SparkSubmitOperator conf arg like below, YARN will show it properly.
conf={
"spark.yarn.queue": "team_the_best_queue",
"spark.submit.deployMode": "cluster",
"spark.whatever.configs.you.have" = "more_config",
}

Running Custom Java Class in PySpark on EMR

I am attempting to utilize the Cerner Bunsen package for FHIR processing in PySpark on an AWS EMR, specifically the Bundles class and it's methods. I am creating the spark session using the Apache Livy API,
def create_spark_session(master_dns, kind, jars):
# 8998 is the port on which the Livy server runs
host = 'http://' + master_dns + ':8998'
data = {'kind': kind, 'jars': jars}
headers = {'Content-Type': 'application/json'}
response = requests.post(host + '/sessions', data=json.dumps(data), headers=headers)
logging.info(response.json())
return response.headers
Where kind = pyspark3 and jars is an S3 location that houses the jar (bunsen-shaded-1.4.7.jar)
The data transformation is attempting to import the jar and call the methods via:
# Setting the Spark Session and Pulling the Existing SparkContext
sc = SparkContext.getOrCreate()
# Cerner Bunsen
from py4j.java_gateway import java_import, JavaGateway
java_import(sc._gateway.jvm,"com.cerner.bunsen.Bundles")
func = sc._gateway.jvm.Bundles()
The error I am receiving is
"py4j.protocol.Py4JError: An error occurred while calling
None.com.cerner.bunsen.Bundles. Trace:\npy4j.Py4JException:
Constructor com.cerner.bunsen.Bundles([]) does not exist"
This is the first time I have attempted to use java_import so any help would be appreciated.
EDIT: I changed up the transformation script slightly and am now seeing a different error. I can see the jar being added in the logs so I am certain it is there and that the jars: jars functionality is working as intended. The new transformation is:
# Setting the Spark Session and Pulling the Existing SparkContext
sc = SparkContext.getOrCreate()
# Manage logging
#sc.setLogLevel("INFO")
# Cerner Bunsen
from py4j.java_gateway import java_import, JavaGateway
java_import(sc._gateway.jvm,"com.cerner.bunsen")
func_main = sc._gateway.jvm.Bundles
func_deep = sc._gateway.jvm.Bundles.BundleContainer
fhir_data_frame = func_deep.loadFromDirectory(spark,"s3://<bucket>/source_database/Patient",1)
fhir_data_frame_fromJson = func_deep.fromJson(fhir_data_frame)
fhir_data_frame_clean = func_main.extract_entry(spark,fhir_data_frame_fromJson,'patient')
fhir_data_frame_clean.show(20, False)
and the new error is:
'JavaPackage' object is not callable
Searching for this error has been a bit futile, but again, if anyone has ideas I will gladly take them.
If you want to use a Scala/Java function in Pyspark you have also to add the jar package in classpath. You can do it with 2 different ways:
Option1:
In Spark submit with the flag --jars
spark-submit example.py --jars /path/to/bunsen-shaded-1.4.7.jar
Option2: Add it in spark-defaults.conf file in property:
Add the following code in : path/to/spark/conf/spark-defaults.conf
# Comma-separated list of jars include on the driver and executor classpaths.
spark.jars /path/to/bunsen-shaded-1.4.7.jar

Dataproc - setting driverLogLevels results in log4j error

I'm attempting to set driver log levels when launching jobs in Dataproc (https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs#LoggingConfig). Launching is done via a Java program using the dataproc SDK.
LoggingConfig loggingConfig = new LoggingConfig();
loggingConfig.put("driverLogLevels", Collections.singletonMap("root", "ERROR"));
com.google.api.services.dataproc.model.SparkJob sparkJob = new com.google.api.services.dataproc.model.SparkJob().setMainClass(mainClass).setJarFileUris(jarFileUris).setArgs(args).setProperties(properties).setLoggingConfig(loggingConfig);
Job job = new Job().setPlacement(new JobPlacement().setClusterName(clusterName)).setSparkJob(sparkJob);
// ommitted irrelevant code
Dataproc dp = new Dataproc.Builder(httpTransport, jsonFactory, credential).setApplicationName(jobName).build();
SubmitJobRequest request = new SubmitJobRequest().setJob(job);
return dp.projects().regions().jobs().submit(googleProject, "global", request).execute();
This launches successfully, but does not successfully set log4j configuration:
log4j:ERROR Could not read configuration file from URL [file:/tmp/[guid]/driver_log4j.properties].
java.io.FileNotFoundException: /tmp/[guid]/driver_log4j.properties (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at java.io.FileInputStream.<init>(FileInputStream.java:93)
at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557)
at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
at org.apache.log4j.LogManager.<clinit>(LogManager.java:127)
at org.apache.spark.internal.Logging$class.initializeLogging(Logging.scala:117)
at org.apache.spark.internal.Logging$class.initializeLogIfNecessary(Logging.scala:102)
at org.apache.spark.deploy.yarn.ApplicationMaster$.initializeLogIfNecessary(ApplicationMaster.scala:736)
at org.apache.spark.internal.Logging$class.log(Logging.scala:46)
at org.apache.spark.deploy.yarn.ApplicationMaster$.log(ApplicationMaster.scala:736)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:751)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
log4j:ERROR Ignoring configuration file [file:/tmp/[guid]/driver_log4j.properties].
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
where [guid] is a GUID that differs for every job. Logging is by (verbose) default config.
How can I successfully set config? What is the most elegant and robust way on dataproc to adjust log levels for Spark? would be a fallback, but I'd rather use a method that's not liable to change out from under me.
The official way to set log level is the method described in your link. See the dataproc docs.
so I believe that to invoke this from the java SDK within the setArgs(...) term of your builder. So in your case you would want to add:
args.add("--driver-log-levels");
args.add("root=ERROR");
like so:
args.add("--driver-log-levels");
args.add("root=ERROR");
com.google.api.services.dataproc.model.SparkJob sparkJob = new com.google.api.services.dataproc.model.SparkJob().setMainClass(mainClass).setJarFileUris(jarFileUris).setArgs(args).setProperties(properties).setLoggingConfig(loggingConfig);
Job job = new Job().setPlacement(new JobPlacement().setClusterName(clusterName)).setSparkJob(sparkJob);
// ommitted irrelevant code
Dataproc dp = new Dataproc.Builder(httpTransport, jsonFactory, credential).setApplicationName(jobName).build();
SubmitJobRequest request = new SubmitJobRequest().setJob(job);
return dp.projects().regions().jobs().submit(googleProject, "global", request).execute();
I'm not sure what you mean when you call this a feature that's liable to change out from under you. This should be a stable feature.

How do I get independent service Zeppelin to see Hive?

I am using HDP-2.6.0.3 but I need Zeppelin 0.8, so I have installed it as an independent service. When I run:
%sql
show tables
I get nothing back and I get 'table not found' when I run Spark2 SQL commands. Tables can be seen in the 0.7 Zeppelin that is part of HDP.
Can anyone tell me what I am missing, for Zeppelin/Spark to see Hive?
The steps I performed to create the zep0.8 are as follows:
maven clean package -DskipTests -Pspark-2.1 -Phadoop-2.7-Dhadoop.version=2.7.3 -Pyarn -Ppyspark -Psparkr -Pr -Pscala-2.11
Copied zeppelin-site.xml and shiro.ini from /usr/hdp/2.6.0.3-8/zeppelin/conf to /home/ed/zeppelin/conf.
created /home/ed/zeppelin/conf/zeppeli-env.sh in which I put the following:
export JAVA_HOME=/usr/jdk64/jdk1.8.0_112
export HADOOP_CONF_DIR=/etc/hadoop/conf
export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.6.0.3-8"
Copied /etc/hive/conf/hive-site.xml to /home/ed/zeppelin/conf
EDIT:
I have also tried:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("interfacing spark sql to hive metastore without configuration file")
.config("hive.metastore.uris", "thrift://s2.royble.co.uk:9083") // replace with your hivemetastore service's thrift url
.config("url", "jdbc:hive2://s2.royble.co.uk:10000/default")
.config("UID", "admin")
.config("PWD", "admin")
.config("driver", "org.apache.hive.jdbc.HiveDriver")
.enableHiveSupport() // don't forget to enable hive support
.getOrCreate()
same result, and:
import java.sql.{DriverManager, Connection, Statement, ResultSet}
val url = "jdbc:hive2://"
val driver = "org.apache.hive.jdbc.HiveDriver"
val user = "admin"
val password = "admin"
Class.forName(driver).newInstance
val conn: Connection = DriverManager.getConnection(url, user, password)
which gives:
java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
ERROR XSDB6: Another instance of Derby may have already booted the database /home/ed/metastore_db
Fixed error with:
val url = "jdbc:hive2://s2.royble.co.uk:10000"
but still no tables :(
This works:
import java.sql.{DriverManager, Connection, Statement, ResultSet}
val url = "jdbc:hive2://s2.royble.co.uk:10000"
val driver = "org.apache.hive.jdbc.HiveDriver"
val user = "admin"
val password = "admin"
Class.forName(driver).newInstance
val conn: Connection = DriverManager.getConnection(url, user, password)
val r: ResultSet = conn.createStatement.executeQuery("SELECT * FROM tweetsorc0")
but then I have the pain of converting the resultset to a dataframe. I'd rather SparkSession worked and I get a dataframe so I will add a bounty later today.
I had a similar problem in Cloudera Hadoop. In my case the problem was that spark sql did not see my hive metastore. So when I used my Spark Session object for spark SQL I could not see my previously created tables. I managed to solve it with adding in zeppelin-env.sh
export SPARK_HOME=/opt/cloudera/parcels/SPARK2/lib/spark2
export HADOOP_HOME=/opt/cloudera/parcels/CDH
export SPARK_CONF_DIR=/etc/spark/conf
export HADOOP_CONF_DIR=/etc/hadoop/conf
(I assume for Horton Works these paths are something else). I also change spark.master from local[*] to yarn-client at Interpreter UI. Most importantly I manually copied hive-site.xml in /etc/spark/conf/ because I though it was strange that it was not in that directory and that solved my problem.
So my advice is to see if hive-site.xml exists in your SPARK_CONF_DIR and if not add it manually. I also find a guide for Horton Works and zeppelin in case this will not work.

Is it possible to get the current spark context settings in PySpark?

I'm trying to get the path to spark.worker.dir for the current sparkcontext.
If I explicitly set it as a config param, I can read it back out of SparkConf, but is there anyway to access the complete config (including all defaults) using PySpark?
Spark 2.1+
spark.sparkContext.getConf().getAll() where spark is your sparksession (gives you a dict with all configured settings)
Yes: sc.getConf().getAll()
Which uses the method:
SparkConf.getAll()
as accessed by
SparkContext.sc.getConf()
See it in action:
In [4]: sc.getConf().getAll()
Out[4]:
[(u'spark.master', u'local'),
(u'spark.rdd.compress', u'True'),
(u'spark.serializer.objectStreamReset', u'100'),
(u'spark.app.name', u'PySparkShell')]
update configuration in Spark 2.3.1
To change the default spark configurations you can follow these steps:
Import the required classes
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
Get the default configurations
spark.sparkContext._conf.getAll()
Update the default configurations
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'), ('spark.app.name', 'Spark Updated Conf'), ('spark.executor.cores', '4'), ('spark.cores.max', '4'), ('spark.driver.memory','4g')])
Stop the current Spark Session
spark.sparkContext.stop()
Create a Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
Spark 1.6+
sc.getConf.getAll.foreach(println)
For a complete overview of your Spark environment and configuration I found the following code snippets useful:
SparkContext:
for item in sorted(sc._conf.getAll()): print(item)
Hadoop Configuration:
hadoopConf = {}
iterator = sc._jsc.hadoopConfiguration().iterator()
while iterator.hasNext():
prop = iterator.next()
hadoopConf[prop.getKey()] = prop.getValue()
for item in sorted(hadoopConf.items()): print(item)
Environment variables:
import os
for item in sorted(os.environ.items()): print(item)
Simply running
sc.getConf().getAll()
should give you a list with all settings.
Unfortunately, no, the Spark platform as of version 2.3.1 does not provide any way to programmatically access the value of every property at run time. It provides several methods to access the values of properties that were explicitly set through a configuration file (like spark-defaults.conf), set through the SparkConf object when you created the session, or set through the command line when you submitted the job, but none of these methods will show the default value for a property that was not explicitly set. For completeness, the best options are:
The Spark application’s web UI, usually at http://<driver>:4040, has an “Environment” tab with a property value table.
The SparkContext keeps a hidden reference to its configuration in PySpark, and the configuration provides a getAll method: spark.sparkContext._conf.getAll().
Spark SQL provides the SET command that will return a table of property values: spark.sql("SET").toPandas(). You can also use SET -v to include a column with the property’s description.
(These three methods all return the same data on my cluster.)
For Spark 2+ you can also use when using scala
spark.conf.getAll; //spark as spark session
You can use:
sc.sparkContext.getConf.getAll
For example, I often have the following at the top of my Spark programs:
logger.info(sc.sparkContext.getConf.getAll.mkString("\n"))
Just for the records the analogous java version:
Tuple2<String, String> sc[] = sparkConf.getAll();
for (int i = 0; i < sc.length; i++) {
System.out.println(sc[i]);
}
Suppose I want to increase the driver memory in runtime using Spark Session:
s2 = SparkSession.builder.config("spark.driver.memory", "29g").getOrCreate()
Now I want to view the updated settings:
s2.conf.get("spark.driver.memory")
To get all the settings, you can make use of spark.sparkContext._conf.getAll()
Hope this helps
Not sure if you can get all the default settings easily, but specifically for the worker dir, it's quite straigt-forward:
from pyspark import SparkFiles
print SparkFiles.getRootDirectory()
If you want to see the configuration in data bricks use the below command
spark.sparkContext._conf.getAll()
I would suggest you try the method below in order to get the current spark context settings.
SparkConf.getAll()
as accessed by
SparkContext.sc._conf
Get the default configurations specifically for Spark 2.1+
spark.sparkContext.getConf().getAll()
Stop the current Spark Session
spark.sparkContext.stop()
Create a Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()

Resources