Reading Avro into spark using spark-avro - apache-spark

I'm not being able to read spark files using the spark-avro library. Here are the steps I took:
Got the jar from: http://mvnrepository.com/artifact/com.databricks/spark-avro_2.10/0.1
Invoked spark-shell using spark-shell --jars avro/spark-avro_2.10-0.1.jar
Executed commands as given in the git readme:
import com.databricks.spark.avro._
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val episodes = sqlContext.avroFile("episodes.avro")
The action sqlContext.avroFile("episodes.avro") fails with the following error:
scala> val episodes = sqlContext.avroFile("episodes.avro")
java.lang.IncompatibleClassChangeError: class com.databricks.spark.avro.AvroRelation has interface org.apache.spark.sql.sources.TableScan as super class
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)

My bad. The readme clearly says:
Versions
Spark changed how it reads / writes data in 1.4, so please use the correct version of this dedicated for your spark version
1.3 -> 1.0.0
1.4+ -> 1.1.0-SNAPSHOT
I used spark:1.3.1 and spark-avro: 1.1.0. When I used spark-avro: 1.0.0, it worked.

Since spark-avro module is external, there is no .avro API in DataFrameReader or DataFrameWriter.
To load/save data in Avro format, you need to specify the data source option format as avro.
Example:
val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")

import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.appName(appName).master(master).getOrCreate()
val sqlContext = spark.sqlContext
val episodes = sqlContext.read.format("com.databricks.spark.avro")
.option("header","true")
.option("inferSchema","true")
.load("episodes.avro")
episodes.show(10)

Related

Read /Write delta lake tables on S3 using AWS Glue jobs

I am trying to access Delta lake tables underlying on S3 using AWS glue jobs however getting error as "Module Delta not defined"
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
spark = SparkSession.builder.appName("MyApp").config("spark.jars.packages", "io.delta:delta-core_2.11:0.6.0").getOrCreate()
from delta.tables import *
data = spark.range(0, 5)
data.write.format("delta").save("S3://databricksblaze/data")
Added the necessary Jar ( delta-core_2.11-0.6.0.jar ) too in the dependency jars of the glue job.
Can anyone help me on this
Thanks
I have had success in using Glue + Deltalake. I added the Deltalake dependencies to the section "Dependent jars path" of the Glue job.
Here you have the list of them (I am using Deltalake 0.6.1):
com.ibm.icu_icu4j-58.2.jar
io.delta_delta-core_2.11-0.6.1.jar
org.abego.treelayout_org.abego.treelayout.core-1.0.3.jar
org.antlr_antlr4-4.7.jar
org.antlr_antlr4-runtime-4.7.jar
org.antlr_antlr-runtime-3.5.2.jar
org.antlr_ST4-4.0.8.jar
org.glassfish_javax.json-1.0.4.jar
Then in your Glue job you can use the following code:
from pyspark.context import SparkContext
from awsglue.context import GlueContext
sc = SparkContext()
sc.addPyFile("io.delta_delta-core_2.11-0.6.1.jar")
from delta.tables import *
glueContext = GlueContext(sc)
spark = glueContext.spark_session
delta_path = "s3a://your_bucket/folder"
data = spark.range(0, 5)
data.write.format("delta").mode("overwrite").save(delta_path)
deltaTable = DeltaTable.forPath(spark, delta_path)
You need to pass the additional configuration properties
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
Setting spark.jars.packages in SparkSession.builder.config doesn't work. spark.jars.packages is handled by org.apache.spark.deploy.SparkSubmitArguments/SparkSubmit. So it must be passed as an argument of the spark-submit or pyspark script. When SparkSession.builder.config is called, SparkSubmit has done its job. So spark.jars.packages is no-op at this moment. See https://issues.apache.org/jira/browse/SPARK-21752 for more details.

Do we need any external jar for xml parsing in Spark?

I'm trying to parse XML in Spark. i am getting below error. Could you please help me?
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object TestSpark{
def main(args:Array[String})
{
val conf = new SparkConf().setAppName("Test")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rootTag", "book")
load("c:\\sample.xml")
}
}`
Error:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to load class for data source: com.databricks.spark.xml.
No other external jar are required except the databricks spark xml. You need to add dependency for 2.0+. If you are using older Spark then you need t use this.
You need to use
groupId: com.databricks
artifactId: spark-xml_2.11
version: 0.4.1
Match the Scala version to that of Spark. Starting version 2.0, Spark is built with Scala 2.11 by default. Scala 2.10 users should need the Spark source package and build with Scala 2.10 support.
This may help
Compatibility issue with Scala and Spark for compiled jars
spark-xml

Cannot create Spark Phoenix DataFrames

I am trying to load data from Apache Phoenix into a Spark DataFrame.
I have been able to successfully create an RDD with the following code:
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val foo: RDD[Map[String, AnyRef]] = sc.phoenixTableAsRDD(
table = "FOO",
columns = Seq("ID", "MESSAGE_EPOCH", "MESSAGE_VALUE"),
zkUrl = Some("<zk-ip-address>:2181:/hbase-unsecure"))
foo.collect().foreach(x => println(x))
However I have not been so lucky trying to create a DataFrame. My current attempt is:
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new SQLContext(sc)
val df = sqlContext.phoenixTableAsDataFrame(
table = "FOO",
columns = Seq("ID", "MESSAGE_EPOCH", "MESSAGE_VALUE"),
zkUrl = Some("<zk-ip-address>:2181:/hbase-unsecure"))
df.select(df("ID")).show
Unfortunately the above code results in a ClassCastException:
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to org.apache.spark.sql.Row
I am still very new to spark. If anyone can help it would be very much appreciated!
Although you haven't mentioned your spark version and details of the exception...
Please see PHOENIX-2287 which is fixed, which says
Environment: HBase 1.1.1 running in standalone mode on OS X *
Spark 1.5.0 Phoenix 4.5.2
Josh Mahonin added a comment - 23/Sep/15 17:56 Updated patch adds
support for Spark 1.5.0, and is backwards compatible back down to
1.3.0 (manually tested, Spark version profiles may be worth looking at in the future) In 1.5.0, they've gone and explicitly hidden the
GenericMutableRow data structure. Fortunately, we are able to the
external-facing 'Row' data type, which is backwards compatible, and
should remain compatible in future releases as well. As part of the
update, Spark SQL deprecated a constructor on their 'DecimalType'.
In updating this, I exposed a new issue, which is that we don't
carry-forward the precision and scale of the underlying Decimal type
through to Spark. For now I've set it to use the Spark defaults, but
I'll create another issue for that specifically. I've included an
ignored integration test in this patch as well.

NoClassDefFoundError when using avro in spark-shell

I keep getting
java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper
when calling show() on a DataFrame object. I'm attempting to do this through the shell (spark-shell --master yarn). I can see that the shell recognizes the schema when creating the DataFrame object, but if I execute any actions on the data it will always throw the NoClassDefFoundError when trying to instantiate the AvroWrapper. I've tried adding avro-mapred-1.8.0.jar in my $HDFS_USER/lib directory on the cluster and even included it using the --jar option when launching the shell. Neither of these options worked. Any advice would be greatly appreciated. Below is example code:
scala> import org.apache.spark.sql._
scala> import com.databricks.spark.avro._
scala> val sqc = new SQLContext(sc)
scala> val df = sqc.read.avro("my_avro_file") // recognizes the schema and creates the DataFrame object
scala> df.show // this is where I get NoClassDefFoundError
The DataFrame object itself is created at the val df =... line, but data is not read yet. Spark only starts reading and processing the data, when you ask for some kind of output (like a df.count(), or df.show()).
So the original issue is that the avro-mapred package is missing.
Try launching your Spark Shell like this:
spark-shell --packages org.apache.avro:avro-mapred:1.7.7,com.databricks:spark-avro_2.10:2.0.1
The Spark Avro package marks the Avro Mapred package as provided, but it is not available on your system (or classpath) for one or other reason.
If anyone else runs into this problem, I finally solved it. I removed the CDH spark package and downloaded it from http://spark.apache.org/downloads.html. After that everything worked fine. Not sure what the issues was with the CDH version, but I'm not going to waste anymore time trying to figure it out.

howto add hive properties at runtime in spark-shell

How do you set a hive property like: hive.metastore.warehouse.dir at runtime? Or at least a more dynamic way of setting a property like the above, than putting it in a file like spark_home/conf/hive-site.xml
I faced the same issue and for me it worked by setting Hive properties from Spark (2.4.0). Please find below all the options through spark-shell, spark-submit and SparkConf.
Option 1 (spark-shell)
spark-shell --conf spark.hadoop.hive.metastore.warehouse.dir=some_path\metastore_db_2
Initially I tried with spark-shell with hive.metastore.warehouse.dir set to some_path\metastore_db_2. Then I get the next warning:
Warning: Ignoring non-spark config property:
hive.metastore.warehouse.dir=C:\winutils\hadoop-2.7.1\bin\metastore_db_2
Although when I create a Hive table with:
bigDf.write.mode("overwrite").saveAsTable("big_table")
The Hive metadata are stored correctly under metastore_db_2 folder.
When I use spark.hadoop.hive.metastore.warehouse.dir the warning disappears and the results are still saved in the metastore_db_2 directory.
Option 2 (spark-submit)
In order to use hive.metastore.warehouse.dir when submitting a job with spark-submit I followed the next steps.
First I wrote some code to save some random data with Hive:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val sparkConf = new SparkConf().setAppName("metastore_test").setMaster("local")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
import spark.implicits._
var dfA = spark.createDataset(Seq(
(1, "val1", "p1"),
(2, "val1", "p2"),
(3, "val2", "p3"),
(3, "val3", "p4"))).toDF("id", "value", "p")
dfA.write.mode("overwrite").saveAsTable("metastore_test")
spark.sql("select * from metastore_test").show(false)
Next I submitted the job with:
spark-submit --class org.tests.Main \
--conf spark.hadoop.hive.metastore.warehouse.dir=C:\winutils\hadoop-2.7.1\bin\metastore_db_2
spark-scala-test_2.11-0.1.jar
The metastore_test table was properly created under the C:\winutils\hadoop-2.7.1\bin\metastore_db_2 folder.
Option 3 (SparkConf)
Via SparkSession in the Spark code.
val sparkConf = new SparkConf()
.setAppName("metastore_test")
.set("spark.hadoop.hive.metastore.warehouse.dir", "C:\\winutils\\hadoop-2.7.1\\bin\\metastore_db_2")
.setMaster("local")
This attempt was successful as well.
The question which still remains is why I have to extend the property with spark.hadoop in order to work as expected?

Resources