NoClassDefFoundError when using avro in spark-shell - apache-spark

I keep getting
java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper
when calling show() on a DataFrame object. I'm attempting to do this through the shell (spark-shell --master yarn). I can see that the shell recognizes the schema when creating the DataFrame object, but if I execute any actions on the data it will always throw the NoClassDefFoundError when trying to instantiate the AvroWrapper. I've tried adding avro-mapred-1.8.0.jar in my $HDFS_USER/lib directory on the cluster and even included it using the --jar option when launching the shell. Neither of these options worked. Any advice would be greatly appreciated. Below is example code:
scala> import org.apache.spark.sql._
scala> import com.databricks.spark.avro._
scala> val sqc = new SQLContext(sc)
scala> val df = sqc.read.avro("my_avro_file") // recognizes the schema and creates the DataFrame object
scala> df.show // this is where I get NoClassDefFoundError

The DataFrame object itself is created at the val df =... line, but data is not read yet. Spark only starts reading and processing the data, when you ask for some kind of output (like a df.count(), or df.show()).
So the original issue is that the avro-mapred package is missing.
Try launching your Spark Shell like this:
spark-shell --packages org.apache.avro:avro-mapred:1.7.7,com.databricks:spark-avro_2.10:2.0.1
The Spark Avro package marks the Avro Mapred package as provided, but it is not available on your system (or classpath) for one or other reason.

If anyone else runs into this problem, I finally solved it. I removed the CDH spark package and downloaded it from http://spark.apache.org/downloads.html. After that everything worked fine. Not sure what the issues was with the CDH version, but I'm not going to waste anymore time trying to figure it out.

Related

Pyspark reads csv - NameError: name 'spark' is not defined

I am trying to run the following code in databricks in order to call a spark session and use it to open a csv file:
spark
fireServiceCallsDF = spark.read.csv('/mnt/sf_open_data/fire_dept_calls_for_service/Fire_Department_Calls_for_Service.csv', header=True, inferSchema=True)
And I get the following error:
NameError:name 'spark' is not defined
Any idea what might be wrong?
I have also tried to run:
from pyspark.sql import SparkSession
But got the following in response:
ImportError: cannot import name SparkSession
If it helps, I am trying to follow the following example (you will understand better if you watch it from from 17:30 on):
https://www.youtube.com/watch?v=K14plpZgy_c&list=PLIxzgeMkSrQ-2Uizm4l0HjNSSy2NxgqjX
I got it worked by using the following imports:
from pyspark import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext
I got the idea by looking into the pyspark code as I found read csv was working in the interactive shell.
Please note the example code your are using is for Spark version 2.x
"spark" and "SparkSession" are not available on Spark 1.x. The error messages you are getting point to a possible version issue (Spark 1.x).
Check the Spark version you are using.

How to save a dataframe into HBase?

I have a df with a schema, also create a table in HBase with phoenix. What i want is to save this df to HBase using spark. I have tried the descriptions in the following link and run the spark-shell with phoenix plugin dependencies.
spark-shell --jars ./phoenix-spark-4.8.0-HBase-1.2.jar,./phoenix-4.8.0-HBase-1.2-client.jar,./spark-sql_2.11-2.0.1.jar
However, i got an error saying even when i run the read function ;
val df = sqlContext.load("org.apache.phoenix.spark", Map("table" -> "INPUT_TABLE",
| "zkUrl" -> hbaseConnectionString))
java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame
I have a feeling that i am on the wrong track. So if there is another way of putting data generated on spark into HBase, i will appreciate if you share it with me.
https://phoenix.apache.org/phoenix_spark.html

Cannot create Spark Phoenix DataFrames

I am trying to load data from Apache Phoenix into a Spark DataFrame.
I have been able to successfully create an RDD with the following code:
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val foo: RDD[Map[String, AnyRef]] = sc.phoenixTableAsRDD(
table = "FOO",
columns = Seq("ID", "MESSAGE_EPOCH", "MESSAGE_VALUE"),
zkUrl = Some("<zk-ip-address>:2181:/hbase-unsecure"))
foo.collect().foreach(x => println(x))
However I have not been so lucky trying to create a DataFrame. My current attempt is:
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new SQLContext(sc)
val df = sqlContext.phoenixTableAsDataFrame(
table = "FOO",
columns = Seq("ID", "MESSAGE_EPOCH", "MESSAGE_VALUE"),
zkUrl = Some("<zk-ip-address>:2181:/hbase-unsecure"))
df.select(df("ID")).show
Unfortunately the above code results in a ClassCastException:
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to org.apache.spark.sql.Row
I am still very new to spark. If anyone can help it would be very much appreciated!
Although you haven't mentioned your spark version and details of the exception...
Please see PHOENIX-2287 which is fixed, which says
Environment: HBase 1.1.1 running in standalone mode on OS X *
Spark 1.5.0 Phoenix 4.5.2
Josh Mahonin added a comment - 23/Sep/15 17:56 Updated patch adds
support for Spark 1.5.0, and is backwards compatible back down to
1.3.0 (manually tested, Spark version profiles may be worth looking at in the future) In 1.5.0, they've gone and explicitly hidden the
GenericMutableRow data structure. Fortunately, we are able to the
external-facing 'Row' data type, which is backwards compatible, and
should remain compatible in future releases as well. As part of the
update, Spark SQL deprecated a constructor on their 'DecimalType'.
In updating this, I exposed a new issue, which is that we don't
carry-forward the precision and scale of the underlying Decimal type
through to Spark. For now I've set it to use the Spark defaults, but
I'll create another issue for that specifically. I've included an
ignored integration test in this patch as well.

howto add hive properties at runtime in spark-shell

How do you set a hive property like: hive.metastore.warehouse.dir at runtime? Or at least a more dynamic way of setting a property like the above, than putting it in a file like spark_home/conf/hive-site.xml
I faced the same issue and for me it worked by setting Hive properties from Spark (2.4.0). Please find below all the options through spark-shell, spark-submit and SparkConf.
Option 1 (spark-shell)
spark-shell --conf spark.hadoop.hive.metastore.warehouse.dir=some_path\metastore_db_2
Initially I tried with spark-shell with hive.metastore.warehouse.dir set to some_path\metastore_db_2. Then I get the next warning:
Warning: Ignoring non-spark config property:
hive.metastore.warehouse.dir=C:\winutils\hadoop-2.7.1\bin\metastore_db_2
Although when I create a Hive table with:
bigDf.write.mode("overwrite").saveAsTable("big_table")
The Hive metadata are stored correctly under metastore_db_2 folder.
When I use spark.hadoop.hive.metastore.warehouse.dir the warning disappears and the results are still saved in the metastore_db_2 directory.
Option 2 (spark-submit)
In order to use hive.metastore.warehouse.dir when submitting a job with spark-submit I followed the next steps.
First I wrote some code to save some random data with Hive:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val sparkConf = new SparkConf().setAppName("metastore_test").setMaster("local")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
import spark.implicits._
var dfA = spark.createDataset(Seq(
(1, "val1", "p1"),
(2, "val1", "p2"),
(3, "val2", "p3"),
(3, "val3", "p4"))).toDF("id", "value", "p")
dfA.write.mode("overwrite").saveAsTable("metastore_test")
spark.sql("select * from metastore_test").show(false)
Next I submitted the job with:
spark-submit --class org.tests.Main \
--conf spark.hadoop.hive.metastore.warehouse.dir=C:\winutils\hadoop-2.7.1\bin\metastore_db_2
spark-scala-test_2.11-0.1.jar
The metastore_test table was properly created under the C:\winutils\hadoop-2.7.1\bin\metastore_db_2 folder.
Option 3 (SparkConf)
Via SparkSession in the Spark code.
val sparkConf = new SparkConf()
.setAppName("metastore_test")
.set("spark.hadoop.hive.metastore.warehouse.dir", "C:\\winutils\\hadoop-2.7.1\\bin\\metastore_db_2")
.setMaster("local")
This attempt was successful as well.
The question which still remains is why I have to extend the property with spark.hadoop in order to work as expected?

Spark workflow with jar

I'm trying to understand the extend to which one must compile a jar to use Spark.
I'd normally write ad-hoc analysis code in an IDE, then run it locally against data with a single click (in the IDE). If my experiments with Spark are giving me the right indication then I have to compile my script into a jar, and send it to all the Spark nodes. I.e. my workflow would be
Writing analysis script, which will upload a jar of itself (created
below)
Go make the jar.
Run the script.
For ad-hoc iterative work this seems a bit much, and I don't understand how the REPL gets away without it.
Update:
Here's an example, which I couldn't get to work unless I compiled it into a jar and did sc.addJar. But the fact that I must do this seems odd, since there is only plain Scala and Spark code.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.SparkFiles
import org.apache.spark.rdd.RDD
object Runner {
def main(args: Array[String]) {
val logFile = "myData.txt"
val conf = new SparkConf()
.setAppName("MyFirstSpark")
.setMaster("spark://Spark-Master:7077")
val sc = new SparkContext(conf)
sc.addJar("Analysis.jar")
sc.addFile(logFile)
val logData = sc.textFile(SparkFiles.get(logFile), 2).cache()
Analysis.run(logData)
}
}
object Analysis{
def run(logData: RDD[String]) {
val numA = logData.filter(line => line.contains("a")).count()
val numB = logData.filter(line => line.contains("b")).count()
println("Lines with 'a': %s, Lines with 'b': %s".format(numA, numB))
}
}
You are creating an anonymous function in the use of 'filter':
scala> (line: String) => line.contains("a")
res0: String => Boolean = <function1>
That function's generated name is not available unless the jar is distributed to the workers. Did the stack trace on the worker highlight a missing symbol?
If you just want to debug locally without having to distribute the jar you could use the 'local' master:
val conf = new SparkConf().setAppName("myApp").setMaster("local")
While creating JARs is the most common way of handling long-running Spark jobs, for interactive development work Spark has shells available directly in Scala, Python & R. The current quick start guide ( https://spark.apache.org/docs/latest/quick-start.html ) only mentions the Scala & Python shells, but the SparkR guide discusses how to work with SparkR interactively as well (see https://spark.apache.org/docs/latest/sparkr.html ). Best of luck with your journeys into Spark as you find yourself working with larger datasets :)
You can use SparkContext.jarOfObject(Analysis.getClass) to automatically include the jar that you want to distribute without packaging it yourself.
Find the JAR from which a given class was loaded, to make it easy for
users to pass their JARs to SparkContext.
def jarOfClass(cls: Class[_]): Option[String]
def jarOfObject(obj: AnyRef): Option[String]
You want to do something like:
sc.addJar(SparkContext.jarOfObject(Analysis.getClass).get)
HTH!

Resources