"No Filesystem for Scheme: gs" when running spark job locally - apache-spark

I am running a Spark job (version 1.2.0), and the input is a folder inside a Google Clous Storage bucket (i.e. gs://mybucket/folder)
When running the job locally on my Mac machine, I am getting the following error:
5932 [main] ERROR com.doit.customer.dataconverter.Phase1 - Job for date: 2014_09_23 failed with error: No FileSystem for scheme: gs
I know that 2 things need to be done in order for gs paths to be supported. One is install the GCS connector, and the other is have the following setup in core-site.xml of the Hadoop installation:
<description>The FileSystem for gs: (GCS) uris.</description>
The AbstractFileSystem for gs: (GCS) uris. Only necessary for use with Hadoop 2.
I think my problem comes from the fact I am not sure where exactly each piece need to be configured in this local mode. In the Intellij project, I am using Maven, and so I imported the spark library as follows:
<dependency> <!-- Spark dependency -->
<exclusion> <!-- declare the exclusion here -->
, and Hadoop 1.2.1 as follows:
The thing is, I am not sure where the hadoop location is configured for Spark, and also where the hadoop conf is configured. Therefore, I may be adding to the wrong Hadoop installation. In addition, is there something that needs to be restarted after modifying the files? As far as I saw, there is no Hadoop service running on my machine.

In Scala, add the following config when setting your hadoopConfiguration:
val conf = sc.hadoopConfiguration
conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")

There are a couple ways to help Spark pick up the relevant Hadoop configurations, both involving modifying ${SPARK_INSTALL_DIR}/conf:
Copy or symlink your ${HADOOP_HOME}/conf/core-site.xml into ${SPARK_INSTALL_DIR}/conf/core-site.xml. For example, when bdutil installs onto a VM, it runs:
ln -s ${HADOOP_CONF_DIR}/core-site.xml ${SPARK_INSTALL_DIR}/conf/core-site.xml
Older Spark docs explain that this makes the xml files included in Spark's classpath automatically: https://spark.apache.org/docs/0.9.1/hadoop-third-party-distributions.html
Add an entry to ${SPARK_INSTALL_DIR}/conf/spark-env.sh with:
export HADOOP_CONF_DIR=/full/path/to/your/hadoop/conf/dir
Newer Spark docs seem to indicate this as the preferred method going forward: https://spark.apache.org/docs/1.1.0/hadoop-third-party-distributions.html

I can't say what's wrong, but here's what I would try.
Try setting fs.gs.project.id: <property><name>fs.gs.project.id</name><value>my-little-project</value></property>
Print sc.hadoopConfiguration.get(fs.gs.impl) to make sure your core-site.xml is getting loaded. Print it in the driver and also in the executor: println(x); rdd.foreachPartition { _ => println(x) }
Make sure the GCS jar is sent to the executors (sparkConf.setJars(...)). I don't think this would matter in local mode (it's all one JVM, right?) but you never know.
Nothing but your program needs to be restarted. There is no Hadoop process. In local and standalone modes Spark only uses Hadoop as a library, and only for IO I think.

You can apply these settings directly on the spark reader/writer as follows:
.option("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
.option("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
.option("google.cloud.auth.service.account.enable", "true")
.option("google.cloud.auth.service.account.json.keyfile", "<path-to-json-keyfile.json>")
.option("header", true)
.show(10, false)
And add the relevant jar dependency to your build.sbt (or whichever build tool you use) and check https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector for latest:
"com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop3-2.2.6" classifier "shaded"
See GCS Connector and Google Cloud Storage connector for non-dataproc clusters


How to rebuild apache Livy with scala 2.12

I'm using Spark 3.1.1 which uses Scala 2.12, and the pre-built Livy downloaded from here uses Scala 2.11 (one could find the folder named repl_2.11-jars/ after unzip).
Referred to the comment made by Aliaksandr Sasnouskikh, Livy needs to be rebuilt or it'll throw error {'msg': 'requirement failed: Cannot find Livy REPL jars.'} even in POST Session.
In the README.md, it mentioned:
By default Livy is built against Apache Spark 2.4.5
If I'd like to rebuild Livy, how could I change the spark version that it is built with?
Thanks in advance.
You can rebuild Livy passing spark-3.0 profile in maven to create a custom build for spark 3, for example:
git clone https://github.com/apache/incubator-livy.git && \
cd incubator-livy && \
mvn clean package -B -V -e \
-Pspark-3.0 \
-Pthriftserver \
-DskipTests \
-DskipITs \
This profile is defined in pom.xml, the default one installs Spark 3.0.0. You can change it to use different spark version.
As long as I know, Livy supports spark 3.0.x. But worth testing with 3.1.1, and let us know :)
I tried to build Livy for Spark 3.1.1 based on rmakoto's answer and it worked! I tinkered a lot and I couldn't exactly remember what I edited in the pom.xml so I am just going to attach my gist link here.
I also had to edit the python-api/pom.xml file to use Python3 to build since there's some syntax error issues when building with the default pom.xml file. Here's the pom.xml gist for python-api.
After that just build with
mvn clean package -B -V -e \
-Pspark-3.0 \
-Pthriftserver \
-DskipTests \
-DskipITs \
Based on #gamberooni 's changes (but using 3.1.2 instead of 3.1.1 for the Spark version and Hadoop 3.2.0 instead of 3.2.1), this is the diff:
diff --git a/pom.xml b/pom.xml
index d2e535a..5c28ee6 100644
--- a/pom.xml
+++ b/pom.xml
## -79,12 +79,12 ##
- <hadoop.version>2.7.3</hadoop.version>
+ <hadoop.version>3.2.0</hadoop.version>
- <spark.scala-2.12.version>2.4.5</spark.scala-2.12.version>
- <spark.version>${spark.scala-2.11.version}</spark.version>
- <hive.version>3.0.0</hive.version>
+ <spark.scala-2.12.version>3.1.2</spark.scala-2.12.version>
+ <spark.version>${spark.scala-2.12.version}</spark.version>
+ <hive.version>3.1.2</hive.version>
## -1060,7 +1060,7 ##
- <spark.scala-2.12.version>3.0.0</spark.scala-2.12.version>
+ <spark.scala-2.12.version>3.1.2</spark.scala-2.12.version>
## -1072,9 +1072,9 ##
- https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz
+ https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
- <spark.bin.name>spark-3.0.0-bin-hadoop2.7</spark.bin.name>
+ <spark.bin.name>spark-3.1.2-bin-hadoop3.2</spark.bin.name>
diff --git a/python-api/pom.xml b/python-api/pom.xml
index 8e5cdab..a8fb042 100644
--- a/python-api/pom.xml
+++ b/python-api/pom.xml
## -46,7 +46,7 ##
- <executable>python</executable>
+ <executable>python3</executable>
## -60,7 +60,7 ##
- <executable>python</executable>
+ <executable>python3</executable>
My Spark version is 3.2.1, and my scala version is 2.12.15. I have successfully built and put it into use. I will show my construction process. Pull the master code of livy and modify the pom file as follows:
Finally, execute the package command
mvn clean package -B -V -e -Pspark-3.0 -Pthriftserver -DskipTests -DskipITs -Dmaven.javadoc.skip=true
After Livy's deployment:

PIT-Cucumber plugin not finding scenarios in feature files

Try to institue PIT Mutation testing in a enterprise project. Got it to do existing JUNit tests, but we also have a lot of Cucumber tests that need to be part of the metric. Added pit-cucumber plugin to the maven project, but the output is no scenarios found. Not sure if there is some secret in the config of the plugin that I can't see.
I get this output:
INFO : Sending 0 test classes to minion
Make sure you're using Cucumber version 4.20 jars with pitest-cucumber-plugin 0.8
Everything else looks good. You may not need to specify targetClasses and targetTests.

Using log4j2 in Spark java application

I'm trying to use log4j2 logger in my Spark job. Essential requirement: log4j2 config is located outside classpath, so I need to specify its location explicitly. When I run my code directly within IDE without using spark-submit, log4j2 works well. However when I submit the same code to Spark cluster using spark-submit, it fails to find log42 configuration and falls back to default old log4j.
Launcher command
${SPARK_HOME}/bin/spark-submit \
--class my.app.JobDriver \
--verbose \
--master 'local[*]' \
--files "log4j2.xml" \
--conf spark.executor.extraJavaOptions="-Dlog4j.configurationFile=log4j2.xml" \
--conf spark.driver.extraJavaOptions="-Dlog4j.configurationFile=log4j2.xml" \
Log4j2 dependencies in maven
. . .
<!-- Bridge log4j to log4j2 -->
<!-- Bridge slf4j to log4j2 -->
Any ideas what I could miss?
Apparently at the moment there is no official support official for log4j2 in Spark. Here is detailed discussion on the subject: https://issues.apache.org/jira/browse/SPARK-6305
On practical side that means:
If you have access to Spark configs and jars and can modify them, you still can use log4j2 after manually adding log4j2 jars to SPARK_CLASSPATH, and providing log4j2 configuration file to Spark.
If you run on managed Spark cluster and have no access to Spark jars/configs, then you still can use log4j2, however its use will be limited to the code executed at driver side. Any code part running by executors will use Spark executors logger (which is old log4j)
Spark falls back to log4j because it probably cannot initialize logging system during startup (your application code is not added to classpath).
If you are permitted to place new files on your cluster nodes then create directory on all of them (for example /opt/spark_extras), place there all log4j2 jars and add two configuration options to spark-submit:
--conf spark.executor.extraClassPath=/opt/spark_extras/*
--conf spark.driver.extraClassPath=/opt/spark_extras/*
Then libraries will be added to classpath.
If you have no access to modify files on cluster you can try another approach. Add all log4j2 jars to spark-submit parameters using --jars. According to the documentation all these libries will be added to driver's and executor's classpath so it should work in the same way.
Try using the --driver-java-options
${SPARK_HOME}/bin/spark-submit \
--class my.app.JobDriver \
--verbose \
--master 'local[*]' \
--files "log4j2.xml" \
--driver-java-options "-Dlog4j.configuration=log4j2.xml" \
--jars log4j-api-2.8.jar,log4j-core-2.8.jar,log4j-1.2-api-2.8.jar \
If log4j2 is being used in one of your own dependencies, it's quite easy to bipass all configuration files and use programmatic configuration for one or two high level loggers IF and only IF the configuration file is not found.
The code below does the trick. Just name the logger to your top level logger.
private static boolean configured = false;
private static void buildLog()
final LoggerContext ctx = (LoggerContext) LogManager.getContext(false);
System.out.println("Configuration found at "+ctx.getConfiguration().toString());
System.out.println("\n\n\nNo log4j2 config available. Configuring programmatically\n\n");
ConfigurationBuilder<BuiltConfiguration> builder = ConfigurationBuilderFactory
AppenderComponentBuilder appenderBuilder = builder.newAppender("Stdout", "CONSOLE")
.addAttribute("target", ConsoleAppender.Target.SYSTEM_OUT);
"%d [%t] %msg%n%throwable"));
LayoutComponentBuilder layoutBuilder = builder.newLayout("PatternLayout").addAttribute("pattern",
"%d [%t] %-5level: %msg%n");
appenderBuilder = builder.newAppender("file", "File").addAttribute("fileName", "./logs/ikoda.log")
builder.add(builder.newLogger("ikoda", Level.DEBUG)
.addAttribute("additivity", false));
((org.apache.logging.log4j.core.LoggerContext) LogManager.getContext(false)).start(builder.build());
System.out.println("Configuration file found.");
catch(Exception e)
System.out.println("\n\n\n\nFAILED TO CONFIGURE LOG4J2"+e.getMessage());

Jhipster executable jar with undertow does not work - 404

I built a jhipster application. Added an entity. Built it with
mvn -Pprod package
Application runs fine with tomcat when I use
java -jar xyz.war
But, since we need Undertow for high load scenarios and fast startup time, I simply change the maven dependency from Tomcat to Undertow in all the places (all the profiles) wherever tomcat starter dependency was mentioned as per the Spring documentation : Using Undertow in Place of Tomcat
This runs fine when run from Eclipse and I can see the requests are now served by Undertow and not Tomcat as logs print XNIO as the thread-id.
But., when I again build it and try to run it with java -jar xyz.war,
Application boots up fine but when I hit the URLs
it says not found.
What else do I need to do to put jHipster application with Undertow embedded to work?
Any quick help appreciated as a critical POC to push jHipster in our organization hinges on this step.
We used to support Undertow, but removed it recently. So you shouldn't have much trouble setting it up back (and what you do looks good, then you don't post your whole configuration so it's hard to tell).
Anyway, concerning your specific use-case, you need to know why we removed Undertow:
Start-up time is indeed lower, by something like 300-500ms. We were mostly using this in our "dev" profile, as start-up time is important. But now that we have the Spring Dev Tools hot restart, this isn't useful at all anymore.
For "prod" usage, I haven't seen any performance difference between Tomcat and Undertow. Compared to just one database access (costing several ms), I guess you can't see this kind of improvements.
Besides, we have removed Undertow because it lacks a number of important features for us. Most importantly:
GZip compression support -> as you will lose this, your performance will in fact be much worse with Undertow than with Tomcat
Websocket support
Last but not least, it's easy to scale up your JHipster application by adding new nodes (and it will be even easier in JHipster 3.0), so handling a large number of users shouldn't be an issue.
I didn't manage to have your error. Undertow seems to work fine for me.
1) I generated a new JHipster project (from master), all default options
2) I replace tomcat by undertow only in this part of pom.xml :
<!-- log configuration -->
3) Build :
mvn package -Pprod
4) Start database :
docker-compose -f src/main/docker/prod.yml up -d
5) Start app
java -jar target/*.war --spring.profiles.active=prod
:: JHipster 🤓 :: Running Spring Boot 1.3.2.RELEASE ::
:: http://jhipster.github.io ::
2016-02-22 00:18:40.051 INFO 6118 --- [ main] com.mycompany.myapp.JhundertowApp : Starting JhundertowApp on pgrXps with PID 6118 (started by pgrimaud in /home/pgrimaud/workspace/tests2/32-undertow)
2016-02-22 00:18:40.054 INFO 6118 --- [ main] com.mycompany.myapp.JhundertowApp : The following profiles are active: prod
2016-02-22 00:18:44.024 WARN 6118 --- [ main] io.undertow.websockets.jsr : UT026009: XNIO worker was not set on WebSocketDeploymentInfo, the default worker will be used
2016-02-22 00:18:44.126 WARN 6118 --- [ main] io.undertow.websockets.jsr : UT026010: Buffer pool was not set on WebSocketDeploymentInfo, the default pool will be used
2016-02-22 00:18:44.742 INFO 6118 --- [ main] c.mycompany.myapp.config.WebConfigurer : Web application configuration, using profiles: [prod]
6) I change log in app to confirm it runs with undertow
Application 'jhundertow' is running! Access URLs:
2016-02-22 00:20:20.585 TRACE 6118 --- [ XNIO-2 task-31] c.m.m.c.l.AngularCookieLocaleResolver : Parsed cookie value [%22en%22] into locale 'en'
2016-02-22 00:20:25.741 TRACE 6118 --- [ XNIO-2 task-32] c.m.m.c.l.AngularCookieLocaleResolver : Parsed cookie value [%22en%22] into locale 'en'
Fortunately for me, when I move the same war file to a RHEL system, it works just fine. :-) I am accepting #pgrimaud 's answer. Thanks #deepu and #Julien. You guys are awesome.
I will investigate what's going wrong on my Win7 machine-will post back here if I am able to figure out. (npm clear cache dint help. Will re-install node.js and npm as I had updated them for my other node.js work - I will see if that helps). I will also try to debug spring-boot-starter-undertow.
Finally, a solution for the nemesis is in place. Here is an update from me - Today, I started to debug spring boot and undertow code and realized that spring boot is looking for resources in the below locations :
private static final String[] CLASSPATH_RESOURCE_LOCATIONS = {
"classpath:/META-INF/resources/", "classpath:/resources/",
"classpath:/static/", "classpath:/public/" };
After that, I created a folder called resources inside the META-INF directory and copied all the resources inside it using 7zip. And lo & behold., it works! :-).
Although spring boot is supposed to also load resources from
private static final String[] SERVLET_RESOURCE_LOCATIONS = { "/" };
for some reason, it is not doing so. (which is where jHipster is putting all of the resource files)
I thought this to be a bug with spring boot version that jHipster uses and I upgraded my app spring boot version to 1.3.3.RELEASE , but that does not help either.

Using log4j with JBoss 7.1

How can I use log4j with JBoss 7.1?
I have a log4j-1.2.16.jar in my WebContent/WEB-INF/lib folder. When I output the result of Logger.getRootLogger().getClass().toString() I get class org.jboss.logmanager.log4j.BridgeLogger which is wrong.
If I add Dependencies: org.apache.commons.logging to my MANIFEST.MF file I get the same result.
This results into the problem that my log4j.properties file (which I created unter WEB-INF/classes) is ignored.
There will soon be a way that will just work for you, but currently you have to exclude the log4j dependency from your deployment. You will also have to manually invoke the PropertyConfigurator.configure() to load the properties file.
The following file (jboss-deployment-structure.xml) needs to contain the following:
<!-- Exclusions allow you to prevent the server from automatically adding some dependencies -->
<module name="org.apache.log4j" />
Then adding including your own version of log4j in the WEB-INF/lib directory should work as you expect it to.
