Hikari NoSuchMethodError on AWS EMR/Spark - apache-spark

I am trying to upgrade EMR from 5.13to 5.35 using spark-2.4.8. The jar I'm trying to use has a dependency on HikariCP:4.0.3 which is called to set the db pool-config setKeepaliveTime. While I can run my job fine on my local machine, it bombs out in EMR-5.35 with the following error:
java.lang.NoSuchMethodError: com.zaxxer.hikari.HikariConfig.setKeepaliveTime(J)
The problem is, in runtime, the HikariConfig is being loaded from file:/usr/lib/spark/jars/HikariCP-java7-2.4.12.jar instead of what was provided as a dependency in my custom/fat jar. The workaround right now is to remove that jar, but is there an elegant way to know where that jar is coming from just on the EMR and how could we remove that on start-up?

Just in case, anyone else faces this, the fix was shading (Process that allows renaming packages on the Uber jar), I basically had to make sure the dependency I use doesn't get overridden with the one that's stale in EMR-5.35.0. It looked something like the below:
assembly / assemblyShadeRules := Seq(
ShadeRule
.rename("com.zaxxer.hikari.**" -> "x_hikari_conf.#1")
.inLibrary("x" % "y" % z)
.inProject
)
And that was pretty much it, after the above lines were put in and the new jar was created, it worked like a charm.
More on shading can be read here

Related

How can I install flashtext on every executor?

I am using the flashtext library in a couple of UDFs. It works when I run it locally in Client mode, but once I try to run it in the Cloudera Workbench with several executors, I get an ModuleNotFoundError.
After some research I found that it is possible to add archives (and packages?) to a SparkSession when creating it, so I tried:
SparkSession.builder.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz')
but it didn't help, the same error remains.
According to Spark Configuration doc, there are other configs I could try, e.g. spark.submit.pyFiles, but I don't understand how these py-files to be added would have to look like.
Would it be enough to just create a pyton script with this content?
from flashtext import KeywordProcessor
Could you tell me the easiest way how I can install flashtext on every node?
Edit:
In the meantime, I figured that not only Flashtext was causing issues, but also every relative import from other scripts that I intended to use in a UDF. In order to fix it, I followed this article. I also took the source code from Flashtext and imported it to the main file without installing the actual library.
I think in order to point Spark executors to python modules extracted from your archive, you will need to add another config setting, that adds their location to PYTHONPATH. Something like this:
SparkSession.builder \
.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz#myUDFs') \
.config('spark.executorEnv.PYTHONPATH', './myUDFs')
Citing from the same link you have in the question:
spark.executorEnv.[EnvironmentVariableName]...Add the environment
variable specified by EnvironmentVariableName to the Executor process.
The user can specify multiple of these to set multiple environment
variables.
There are no environment details in your question (or I'm simply not familiar with Cloudera Workbench) but if you're trying to run Spark on YARN, you may need to use slightly different setting spark.yarn.dist.archives.
Also, please make sure that your driver log contains message confirming that an archive was actually uploaded, as in:
:
22/11/08 INFO yarn.Client: Uploading resource file:/absolute/path/to/your/archive.zip -> hdfs://nameservice/user/<your-user-id>/.sparkStaging/<application-id>/archive.zip
:

Unable to start geomesa-accumulo

hduser#Neha-PC:/usr/local/geomesa-tutorials$ java -cp geomesa-tutorials-accumulo/geomesa-tutorials-accumulo-quickstart/target/geomesa-tutorials-accumulo-quickstart-2.3.0-SNAPSHOT.jar org.geomesa.example.accumulo.AccumuloQuickStart --accumulo.instance.id accumulo --accumulo.zookeepers localhost:2184 --accumulo.user root --accumulo.password PASS1234 --accumulo.catalog table1
Picked up JAVA_TOOL_OPTIONS: -Dgeomesa.hbase.coprocessor.path=hdfs://localhost:8020/hbase/lib/geomesa-hbase-distributed-runtime_2.11-2.2.0.jar
Loading datastore
java.lang.IncompatibleClassChangeError: Method org.locationtech.geomesa.security.AuthorizationsProvider.apply(Ljava/util/Map;Ljava/util/List;)Lorg/locationtech/geomesa/security/AuthorizationsProvider; must be InterfaceMethodref constant
at org.locationtech.geomesa.accumulo.data.AccumuloDataStoreFactory$.buildAuthsProvider(AccumuloDataStoreFactory.scala:234)
at org.locationtech.geomesa.accumulo.data.AccumuloDataStoreFactory$.buildConfig(AccumuloDataStoreFactory.scala:162)
at org.locationtech.geomesa.accumulo.data.AccumuloDataStoreFactory.createDataStore(AccumuloDataStoreFactory.scala:48)
at org.locationtech.geomesa.accumulo.data.AccumuloDataStoreFactory.createDataStore(AccumuloDataStoreFactory.scala:36)
at org.geotools.data.DataAccessFinder.getDataStore(DataAccessFinder.java:121)
at org.geotools.data.DataStoreFinder.getDataStore(DataStoreFinder.java:71)
at org.geomesa.example.quickstart.GeoMesaQuickStart.createDataStore(GeoMesaQuickStart.java:103)
at org.geomesa.example.quickstart.GeoMesaQuickStart.run(GeoMesaQuickStart.java:77)
at org.geomesa.example.accumulo.AccumuloQuickStart.main(AccumuloQuickStart.java:25)
You need to ensure that all versions of GeoMesa on the classpath are the same. Just from your command, it seems you are at least mixing 2.3.0-SNAPSHOT with 2.2.0. Try checking out the git tag for tutorial project that corresponds to the GeoMesa version you want, as described here. If you want to use a SNAPSHOT version, you need to make sure that you have pulled the latest changes for each project.

When to use SPARK_CLASSPATH or SparkContext.addJar

I'm using a standalone spark cluster, one master and 2 workers.
I really don't understand how to use wisely SPARK_CLASSPATH or SparkContext.addJar. I tried both and It looks like addJar doesn't work as I used to believe.
In my case I tried to use some joda-time function, in the closures or outside. If I set SPARK_CLASSPATH with a path to the joda-time jar, everything works ok. But if I remove SPARK_CLASSPATH and add in my program:
JavaSparkContext sc = new JavaSparkContext("spark://localhost:7077", "name", "path-to-spark-home", "path-to-the-job-jar");
sc.addJar("path-to-joda-jar");
It doesn't work anymore, although in logs I can see:
14/03/17 15:32:57 INFO SparkContext: Added JAR /home/hduser/projects/joda-time-2.1.jar at http://127.0.0.1:46388/jars/joda-time-2.1.jar with timestamp 1395066777041
and immediatly after:
Caused by: java.lang.NoClassDefFoundError: org/joda/time/DateTime
at com.xxx.sparkjava1.SimpleApp.main(SimpleApp.java:57)
... 6 more
Caused by: java.lang.ClassNotFoundException: org.joda.time.DateTime
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
I used to suppose that SPARK_CLASSPATH was setting the classpath for the driver part of the job, and SparkContext.addJar was setting the classpath for the executors, but It does not seem right anymore.
Anyone knows better than me?
SparkContext.addJar is broken in 0.9 as well as ADD_JARS environment variable. It used to work as documented in 0.8.x and the fix is already commited to master, so it's expected in the next release. For now you can either use workaround described in Jira or make patched Spark build.
See relevant mailing list discussion: http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3C5234E529519F4320A322B80FBCF5BDA6#gmail.com%3E
Jira issue: https://spark-project.atlassian.net/plugins/servlet/mobile#issue/SPARK-1089
SPARK_CLASSPATH is deprecated since Spark 1.0+. You can add jars to the classpath programatically, inside file spark-defaults.conf or with spark-submit flags.
Add jars to a Spark Job - spark-submit

Perl: libapt-pkg-perl AptPkg::Cache->new strange behaviour under precise

I have a very strange problem with the constructor of AptPkg::Cache object in the precise package of libapt-pkg-perl (v. 0.1.25).
The perl script is designed to download a debian package for three different architectures (i386, armel, armhf). For each architecture I do the following:
Configure AptPkg::Config '$_config' with the right parameters and package-lists for the desired architecture.
Create the cache object with AptPkg::Cache->new .
Call the method AptPkg::Cache->policy to create the AptPkg::Policy object.
Call the method AptPkg::Policy->candidate("program-name") .
Download the package for the selected architecture.
This works very well with Ubuntu Lucid, but with Ubuntu Precise I can only download the package for the first architecture defined. For the other two architectures there will be no installation candidate (method AptPkg::Policy->candidate("Package-Name") doesn't return an object).
I tried to build a workaround and I found one solution how the script works for all three architectures, without problems, in precise:
If I create the cache object (with AptPkg::Cache->new) twice in a row it works and the script downloads the debian package for all three architectures:
my $cache = AptPkg::Cache->new;
$cache = AptPkg::Cache->new;
I'm sure that the problem has something to do with the method AptPkg::Cache->new because I checked everything else, what could cause the problem, twice. All config-variables are set correctly and I even get a different Hash for AptPkg::Cache->new for each architecture, but it seems that I am overlooking something important.
I'm not very familiar with perl, so I am asking you guys if someone can explain why the script works with the workaround but not without it. Further it looks quite strange if you have the same line of code twice in your script.
Maybe you hit this bug - https://bugs.launchpad.net/ubuntu/+source/libapt-pkg-perl/+bug/994509
There is a script there to test if you're affected. If it's something else consider submitting a bug report.
edit: Just saw this is 11 months old :/

Error running cassandra Word count example

I am tryin to run the cassandra word count example on eclipse. I have loaded all the requisite jar files. But i am still getting some errors, in fileCassandraDemonThread.java
TNonblockingServer.Args serverArgs = new TNonblockingServer.Args(serverTransport).inputTransportFactory(inTransportFactory)
.outputTransportFactory(outTransportFactory)
.inputProtocolFactory(tProtocolFactory)
.outputProtocolFactory(tProtocolFactory)
.processor(processor);
It throws the compilation error: TNonblockingServer.Args cannot be resolved to a type
Can somebody tell if i am missing any file to be linked?
Thanks for the help.
Sounds like you don't have lib/*.jar on your runtime classpath, or less likely you have an old Thrift jar somewhere else that's getting used instead of the right one.

Resources