spark connecting to Phoenix NoSuchMethod Exception - apache-spark

I am trying to connect to Phoenix through Spark/Scala to read and write data as a DataFrame. I am following the example on GitHub however when I try the very first example Load as a DataFrame using the Data Source API I get the below exception.
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Put.setWriteToWAL(Z)Lorg/apache/hadoop/hbase/client/Put;
There are couple of things that are driving me crazy from those examples:
1)The import statement import org.apache.phoenix.spark._ gives me below exception in my code:
cannot resolve symbol phoenix
I have included below jars in my sbt
"org.apache.phoenix" % "phoenix-spark" % "4.4.0.2.4.3.0-227" % Provided,
"org.apache.phoenix" % "phoenix-core" % "4.4.0.2.4.3.0-227" % Provided,
2) I get the deprecated warning for symbol load.
I googled about that warnign but didn't got any reference and I was not able to find any example of the suggested method. I am not able to find any other good resource which guides on how to connect to Phoenix. Thanks for your time.

please use .read instead of load as shown below
val df = sparkSession.sqlContext.read
.format("org.apache.phoenix.spark")
.option("zkUrl", "localhost:2181")
.option("table", "TABLE1").load()

Its late to answer but here's what i did to solve a similar problem(Different method not found and deprecation warning):
1.) About the NoSuchMethodError: I took all the jars from hbase installation lib folder and add it to your project .Also add pheonix spark jars .Make sure to use compatible versions of spark and pheonix spark.Spark 2.0+ is compatible with pheonix-spark-4.10+
maven-central-link.This resolved the NoSuchMethodError
2.) About the load - The load method has long since been deprecated .Use sqlContext.phoenixTableAsDataFrame.For reference see this Load as a DataFrame directly using a Configuration object

Related

Why DeltaTable.forPath throws "[path] is not a Delta table"?

I'm trying to read a delta lake table which I loaded previously using Spark and I'm using IntelliJ IDE.
val dt = DeltaTable.forPath(spark, "/some/path/")
Now when I'm trying to read the table again I'm getting below error, it was working fine but suddenly it throws error like these, what might be the reason for this?
Note:
Checked the files in the DeltaLake path - it looks good.
Colleague was able to read the same DeltaLake file.
Exception in thread "main" org.apache.spark.sql.AnalysisException: `/some/path/` is not a Delta table.
at org.apache.spark.sql.delta.DeltaErrors$.notADeltaTableException(DeltaErrors.scala:260)
at io.delta.tables.DeltaTable$.forPath(DeltaTable.scala:593)
at com.datalake.az.core.DeltaLake$.delayedEndpoint$com$walmart$sustainability$datalake$az$core$DeltaLake$1(DeltaLake.scala:66)
at com.datalake.az.core.DeltaLake$delayedInit$body.apply(DeltaLake.scala:18)
at scala.Function0.apply$mcV$sp(Function0.scala:39)
at scala.Function0.apply$mcV$sp$(Function0.scala:39)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
at scala.App.$anonfun$main$1$adapted(App.scala:80)
at scala.collection.immutable.List.foreach(List.scala:431)
at scala.App.main(App.scala:80)
at scala.App.main$(App.scala:78)
at com.datalake.az.core.DeltaLake$.main(DeltaLake.scala:18)
at com.datalake.az.core.DeltaLake.main(DeltaLake.scala)
AnalysisException: /some/path/ is not a Delta table.
AnalysisException is thrown when the given path has no transaction log under _delta_log directory.
There could be other issues but that's the first check.
BTW By the stacktrace I figured you may not be using the latest and greatest Delta Lake 2.0.0. Please upgrade as soon as possible as it brings tons of improvements you don't want to miss.

Writing avro files using Spark 2.3

I'm somewhat new to Spark, but I understand that read/write of avro files was built into Spark 2.4, but unfortunately I'm limited to version 2.3 right now. I'm having trouble writing to avro and keep getting errors. Am I not installing this properly?
Have used this in spark session setup:
avro_loc = "com.databricks:spark-avro_2.11:4.0.0"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages ' + avro_loc + ' pyspark-shell'
And I've tried these two versions for the write code I'm attempting:
df.write.mode('overwrite')\
.option('batchsize',10000) \
.avro('{}/df.avro' \
.format(HDFS_LOC))
df.write.format('avro').save('/user/Data/df.avro')
I get these errors for the 1st and 2nd bit of code above, respectively:
AttributeError: 'DataFrameWriter' object has no attribute 'avro'
AnalysisException: 'Failed to find data source: avro. Please find an Avro package at http://spark.apache.org/third-party-projects.html;

How to run CreateIndex function in Hyperspace (spark)

I am trying to create an index using hyperspace in pyspark.
But I am getting this error
sample_data = [(1, "name1"), (2, "name2")]
spark.createDataFrame(sample_data, ['id','name']).write.mode("overwrite").parquet("table")
df = spark.read.parquet("table")
from hyperspace import *
# Create an instance of Hyperspace
hyperspace = Hyperspace(spark)
hs.createIndex(df, IndexConfig("index", ["id"], ["name"]))
java.lang.ClassCastException: org.apache.spark.sql.execution.datasources.SerializableFileStatus cannot be cast to org.apache.hadoop.fs.FileStatus
I am running on Azure databricks environment-
Spark 3.0.0 scala 2.12
When I try to do the same on spark 2.4.2 scala 2.12 or scala 2.11
I get the error in the same function (CreateIndex)
Here I get the following error-
.Py4JJavaError: An error occurred while calling None.com.microsoft.hyperspace.index.IndexConfig.
: java.lang.NoClassDefFoundError:
Can anyone suggest some solutions.
Per the last comment of https://github.com/microsoft/hyperspace/discussions/285, it is a known issue with Databricks runtime.
If you use open source spark, it should work.
Seeking a solution with Databricks team.

NoClassDefFoundError when using avro in spark-shell

I keep getting
java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper
when calling show() on a DataFrame object. I'm attempting to do this through the shell (spark-shell --master yarn). I can see that the shell recognizes the schema when creating the DataFrame object, but if I execute any actions on the data it will always throw the NoClassDefFoundError when trying to instantiate the AvroWrapper. I've tried adding avro-mapred-1.8.0.jar in my $HDFS_USER/lib directory on the cluster and even included it using the --jar option when launching the shell. Neither of these options worked. Any advice would be greatly appreciated. Below is example code:
scala> import org.apache.spark.sql._
scala> import com.databricks.spark.avro._
scala> val sqc = new SQLContext(sc)
scala> val df = sqc.read.avro("my_avro_file") // recognizes the schema and creates the DataFrame object
scala> df.show // this is where I get NoClassDefFoundError
The DataFrame object itself is created at the val df =... line, but data is not read yet. Spark only starts reading and processing the data, when you ask for some kind of output (like a df.count(), or df.show()).
So the original issue is that the avro-mapred package is missing.
Try launching your Spark Shell like this:
spark-shell --packages org.apache.avro:avro-mapred:1.7.7,com.databricks:spark-avro_2.10:2.0.1
The Spark Avro package marks the Avro Mapred package as provided, but it is not available on your system (or classpath) for one or other reason.
If anyone else runs into this problem, I finally solved it. I removed the CDH spark package and downloaded it from http://spark.apache.org/downloads.html. After that everything worked fine. Not sure what the issues was with the CDH version, but I'm not going to waste anymore time trying to figure it out.

Exception when trying to write a file to HDFS from Zeppelin

When trying to write to HDFS from Spark within Zeppelin, I am receiving this ClassNotFoundException for org.apache.hadoop.mapred.DirectFileOutputCommitter:
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.mapred.DirectFileOutputCommitter not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2106)
at org.apache.hadoop.mapred.JobConf.getOutputCommitter(JobConf.java:725)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:983)
Code that is trying to run:
val model = LinearRegressionWithSGD.train(someRDD, numIterations)
val modelPath = "hdfs:///some_path/LinearRegressionWithSGD"
model.save(sc, modelPath)
When searching for this class, I cannot even find it. The closest I can find is org.apache.hadoop.mapred.FileOutputCommitter in Hadoop.
I am using commit 18c8c9ea512a0d87699a73e2ca26192d03748661 (Oct 9) of Zeppelin, Spark 1.5.0 on YARN, and Hadoop 2.6.
I had the same problem. Looked for that file in "hadoop-mapreduce-client-core.X.X.X.jar", but couldn't find that in the jar.
I fixed the problem by adding org.apache.hadoop.mapred.DirectFileOutputCommitter to my repository. Source of that file is found here : https://gist.github.com/apivovarov
Not sure yet what's the root cause of this issue. Digging into it. Will update here once I have the answer.

Resources