Exception when trying to write a file to HDFS from Zeppelin - apache-spark

When trying to write to HDFS from Spark within Zeppelin, I am receiving this ClassNotFoundException for org.apache.hadoop.mapred.DirectFileOutputCommitter:
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.mapred.DirectFileOutputCommitter not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2106)
at org.apache.hadoop.mapred.JobConf.getOutputCommitter(JobConf.java:725)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:983)
Code that is trying to run:
val model = LinearRegressionWithSGD.train(someRDD, numIterations)
val modelPath = "hdfs:///some_path/LinearRegressionWithSGD"
model.save(sc, modelPath)
When searching for this class, I cannot even find it. The closest I can find is org.apache.hadoop.mapred.FileOutputCommitter in Hadoop.
I am using commit 18c8c9ea512a0d87699a73e2ca26192d03748661 (Oct 9) of Zeppelin, Spark 1.5.0 on YARN, and Hadoop 2.6.

I had the same problem. Looked for that file in "hadoop-mapreduce-client-core.X.X.X.jar", but couldn't find that in the jar.
I fixed the problem by adding org.apache.hadoop.mapred.DirectFileOutputCommitter to my repository. Source of that file is found here : https://gist.github.com/apivovarov
Not sure yet what's the root cause of this issue. Digging into it. Will update here once I have the answer.

Related

How to run CreateIndex function in Hyperspace (spark)

I am trying to create an index using hyperspace in pyspark.
But I am getting this error
sample_data = [(1, "name1"), (2, "name2")]
spark.createDataFrame(sample_data, ['id','name']).write.mode("overwrite").parquet("table")
df = spark.read.parquet("table")
from hyperspace import *
# Create an instance of Hyperspace
hyperspace = Hyperspace(spark)
hs.createIndex(df, IndexConfig("index", ["id"], ["name"]))
java.lang.ClassCastException: org.apache.spark.sql.execution.datasources.SerializableFileStatus cannot be cast to org.apache.hadoop.fs.FileStatus
I am running on Azure databricks environment-
Spark 3.0.0 scala 2.12
When I try to do the same on spark 2.4.2 scala 2.12 or scala 2.11
I get the error in the same function (CreateIndex)
Here I get the following error-
.Py4JJavaError: An error occurred while calling None.com.microsoft.hyperspace.index.IndexConfig.
: java.lang.NoClassDefFoundError:
Can anyone suggest some solutions.
Per the last comment of https://github.com/microsoft/hyperspace/discussions/285, it is a known issue with Databricks runtime.
If you use open source spark, it should work.
Seeking a solution with Databricks team.

Nifi java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations

I am following this link to set up Nifi putHDFS to write to Azure Data Lake.Connecting to Azure Data Lake from a NiFi dataflow
The Nifi is within HDF 3.1 VM and the Nifi version is 1.5.
We got the jar files mentioned in the above link, from a HD Insight(v 3.6, which supports hadoop 2.7) head node, these jars are:
adls2-oauth2-token-provider-1.0.jar
azure-data-lake-store-sdk-2.1.4.jar
hadoop-azure-datalake.jar
jackson-core-2.2.3.jar
okhttp-2.4.0.jar
okio-1.4.0.jar
And they are copied to the folder /usr/lib/hdinsight-datalake of the HDF cluster Nifi host(we only have 1 host in the cluster). And the putHDFS config(picture) is as attached(exactly as the link above)putHDFS attributes.
But in the nifi log we are getting this:
Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()V at org.apache.hadoop.fs.adl.AdlConfKeys.addDeprecatedKeys(AdlConfKeys.java:112) at org.apache.hadoop.fs.adl.AdlFileSystem.(AdlFileSystem.java:92) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.nifi.processors.hadoop.AbstractHadoopProcessor$ExtendedConfiguration.getClassByNameOrNull(AbstractHadoopProcessor.java:490) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:172) at org.apache.nifi.processors.hadoop.AbstractHadoopProcessor$1.run(AbstractHadoopProcessor.java:322) at org.apache.nifi.processors.hadoop.AbstractHadoopProcessor$1.run(AbstractHadoopProcessor.java:319) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.nifi.processors.hadoop.AbstractHadoopProcessor.getFileSystemAsUser(AbstractHadoopProcessor.java:319) at org.apache.nifi.processors.hadoop.AbstractHadoopProcessor.resetHDFSResources(AbstractHadoopProcessor.java:281) at org.apache.nifi.processors.hadoop.AbstractHadoopProcessor.abstractOnScheduled(AbstractHadoopProcessor.java:205) ... 16 common frames omitted
The AdlConfKeys class is from the hadoop-azure-datalake.jar file above. From the above exception, it seems to me this AdlConfKeys is loading an older version of the org.apache.hadoop.conf.Configuration class, which does not have the reloadExistingConfigurations method. However we cannot find out from where this older class gets loaded. This HDF 3.1 has the hadoop-common-XXXX.jar in multiple locations, all those on version 2.7 something has the org.apache.hadoop.conf.Configuration containing the method reloadExistingConfigurations, only those on version 2.3 don't have this method.(I decompiled both 2.7 and 2.3 jars to find out)
[root#NifiHost /]# find . -name *hadoop-common*
(the output is a lot more than below, however I removed some for display purpose, most of them are on 2.7, only 2 of them are on version 2.3):
./var/lib/nifi/work/nar/extensions/nifi-hadoop-libraries-nar-1.5.0.3.1.0.0-564.nar-unpacked/META-INF/bundled-dependencies/hadoop-common-2.7.3.jar
./var/lib/ambari-agent/cred/lib/hadoop-common-2.7.3.jar
./var/lib/ambari-server/resources.backup/views/work/WORKFLOW_MANAGER{1.0.0}/WEB-INF/lib/hadoop-common-2.7.3.2.6.2.0-205.jar
./var/lib/ambari-server/resources.backup/views/work/HUETOAMBARI_MIGRATION{1.0.0}/WEB-INF/lib/hadoop-common-2.3.0.jar
./var/lib/ambari-server/resources/views/work/HUETOAMBARI_MIGRATION{1.0.0}/WEB-INF/lib/hadoop-common-2.3.0.jar
./var/lib/ambari-server/resources/views/work/HIVE{1.5.0}/WEB-INF/lib/hadoop-common-2.7.3.2.6.4.0-91.jar
./var/lib/ambari-server/resources/views/work/CAPACITY-SCHEDULER{1.0.0}/WEB-INF/lib/hadoop-common-2.7.3.2.6.4.0-91.jar
./var/lib/ambari-server/resources/views/work/TEZ{0.7.0.2.6.2.0-205}/WEB-INF/lib/hadoop-common-2.7.3.2.6.2.0-205.jar
./usr/lib/ambari-server/hadoop-common-2.7.2.jar
./usr/hdf/3.1.0.0-564/nifi/ext/ranger/install/lib/hadoop-common-2.7.3.jar
./usr/hdf/3.0.2.0-76/nifi/ext/ranger/install/lib/hadoop-common-2.7.3.jar
So I really don't know how Nifi managed to find a hadoop-common jar file or something else containing the Configuration class does not have the method reloadExistingConfigurations(). We do not have any customized Nar files deployed to Nifi either, everything is pretty much default from whatever HDF 3.1 has on Nifi.
Please advise. I've been spending a whole day on this but can't fix the issue. Appreciate your help.
I think the Azure JARs you are using require a newer version of hadoop-common than the 2.7.3 one that NiFi is using.
If you look at the Configuration class from 2.7.3 there is no "reloadExistingConfiguration" method:
https://github.com/apache/hadoop/blob/release-2.7.3-RC2/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/conf/Configuration.java
It appears to be introduced sometime during 2.8.x:
https://github.com/apache/hadoop/blob/release-2.8.3-RC0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/conf/Configuration.java

spark connecting to Phoenix NoSuchMethod Exception

I am trying to connect to Phoenix through Spark/Scala to read and write data as a DataFrame. I am following the example on GitHub however when I try the very first example Load as a DataFrame using the Data Source API I get the below exception.
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Put.setWriteToWAL(Z)Lorg/apache/hadoop/hbase/client/Put;
There are couple of things that are driving me crazy from those examples:
1)The import statement import org.apache.phoenix.spark._ gives me below exception in my code:
cannot resolve symbol phoenix
I have included below jars in my sbt
"org.apache.phoenix" % "phoenix-spark" % "4.4.0.2.4.3.0-227" % Provided,
"org.apache.phoenix" % "phoenix-core" % "4.4.0.2.4.3.0-227" % Provided,
2) I get the deprecated warning for symbol load.
I googled about that warnign but didn't got any reference and I was not able to find any example of the suggested method. I am not able to find any other good resource which guides on how to connect to Phoenix. Thanks for your time.
please use .read instead of load as shown below
val df = sparkSession.sqlContext.read
.format("org.apache.phoenix.spark")
.option("zkUrl", "localhost:2181")
.option("table", "TABLE1").load()
Its late to answer but here's what i did to solve a similar problem(Different method not found and deprecation warning):
1.) About the NoSuchMethodError: I took all the jars from hbase installation lib folder and add it to your project .Also add pheonix spark jars .Make sure to use compatible versions of spark and pheonix spark.Spark 2.0+ is compatible with pheonix-spark-4.10+
maven-central-link.This resolved the NoSuchMethodError
2.) About the load - The load method has long since been deprecated .Use sqlContext.phoenixTableAsDataFrame.For reference see this Load as a DataFrame directly using a Configuration object

Read multiple files with SparkSession in Spark 2.0

In Spark 1.6 to read multiple files, I have used:
JavaSparkContext ctx;
ctx.textFile(filePaths);
With filePaths is the directory to files. For example we have:
/home/user/folderA/0.log,/home/user/folderB/0.log. Each path separates by comma character.
But, when I upgrade to Spark 2.0. Method
SparkSession sparkSession;
sparkSession.read().textFile(filePaths);
doesn't work. The code throws exception: Path does not exist:
Question: Is there any solution to read multiple files, from multiple paths in Spark 2.0 just like Spark 1.6?
Edit: I try to call the method like Spark 1.6 using:
sparkSession.sparkContext().textFile(filePaths, 1).toJavaRDD();
The problem will solved. But, Is there have another solution?

NoClassDefFoundError when using avro in spark-shell

I keep getting
java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper
when calling show() on a DataFrame object. I'm attempting to do this through the shell (spark-shell --master yarn). I can see that the shell recognizes the schema when creating the DataFrame object, but if I execute any actions on the data it will always throw the NoClassDefFoundError when trying to instantiate the AvroWrapper. I've tried adding avro-mapred-1.8.0.jar in my $HDFS_USER/lib directory on the cluster and even included it using the --jar option when launching the shell. Neither of these options worked. Any advice would be greatly appreciated. Below is example code:
scala> import org.apache.spark.sql._
scala> import com.databricks.spark.avro._
scala> val sqc = new SQLContext(sc)
scala> val df = sqc.read.avro("my_avro_file") // recognizes the schema and creates the DataFrame object
scala> df.show // this is where I get NoClassDefFoundError
The DataFrame object itself is created at the val df =... line, but data is not read yet. Spark only starts reading and processing the data, when you ask for some kind of output (like a df.count(), or df.show()).
So the original issue is that the avro-mapred package is missing.
Try launching your Spark Shell like this:
spark-shell --packages org.apache.avro:avro-mapred:1.7.7,com.databricks:spark-avro_2.10:2.0.1
The Spark Avro package marks the Avro Mapred package as provided, but it is not available on your system (or classpath) for one or other reason.
If anyone else runs into this problem, I finally solved it. I removed the CDH spark package and downloaded it from http://spark.apache.org/downloads.html. After that everything worked fine. Not sure what the issues was with the CDH version, but I'm not going to waste anymore time trying to figure it out.

Resources