Do we need any external jar for xml parsing in Spark? - apache-spark

I'm trying to parse XML in Spark. i am getting below error. Could you please help me?
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object TestSpark{
def main(args:Array[String})
{
val conf = new SparkConf().setAppName("Test")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rootTag", "book")
load("c:\\sample.xml")
}
}`
Error:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to load class for data source: com.databricks.spark.xml.

No other external jar are required except the databricks spark xml. You need to add dependency for 2.0+. If you are using older Spark then you need t use this.
You need to use
groupId: com.databricks
artifactId: spark-xml_2.11
version: 0.4.1

Match the Scala version to that of Spark. Starting version 2.0, Spark is built with Scala 2.11 by default. Scala 2.10 users should need the Spark source package and build with Scala 2.10 support.
This may help
Compatibility issue with Scala and Spark for compiled jars
spark-xml

Related

Connecting Pyspark to Oracle SQL

I am almost new in spark. I want to connect pyspark to oracle sql, I am using the following pyspark code:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, Row
import os
spark_config = SparkConf().setMaster("local").setAppName("Project_SQL")
sc = SparkContext(conf = spark_config)
sqlctx = SQLContext(sc)
os.environ['SPARK_CLASSPATH'] = "C:\Program Files (x86)\Oracle\SQL Developer 4.0.1\jdbc\lib.jdbc6.jar"
df = sqlctx.read.format("jdbc").options(url="jdbc:oracle:thin:#<>:<>:<>"
, driver = "oracle.ojdbc6.jar.OracleDriver"
, dbtable = "account"
, user="...."
, password="...").load()
But I get the following error:
An error occurred while calling o29.load.:
java.lang.ClassNotFoundExceotion : oracle.ojdbc6.jar.OracleDriver
I searched a lot and try several ways that I found to change/correct the path to the driver but still got the same error.
Could anyone help me with this please?
oracle.ojdbc6.jar.OracleDriver is not a valid driver class name for the Oracle JDBC driver. The name of the driver is oracle.jdbc.driver.OracleDriver. Just make sure that the jar-file of the Oracle driver is on the classpath.
Try placing the oracle JDBC connectivity jar in jars folder under spark

Setting up ignite context in pyspark

How to set up ignite context in spark using python .
igniteContext = IgniteContext(sparkContext,"c:/desktop/config.xml")
Import Error : cannot import IgniteContext
There is only Java and Scala API available for IgniteRDD.

Cannot create Spark Phoenix DataFrames

I am trying to load data from Apache Phoenix into a Spark DataFrame.
I have been able to successfully create an RDD with the following code:
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val foo: RDD[Map[String, AnyRef]] = sc.phoenixTableAsRDD(
table = "FOO",
columns = Seq("ID", "MESSAGE_EPOCH", "MESSAGE_VALUE"),
zkUrl = Some("<zk-ip-address>:2181:/hbase-unsecure"))
foo.collect().foreach(x => println(x))
However I have not been so lucky trying to create a DataFrame. My current attempt is:
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new SQLContext(sc)
val df = sqlContext.phoenixTableAsDataFrame(
table = "FOO",
columns = Seq("ID", "MESSAGE_EPOCH", "MESSAGE_VALUE"),
zkUrl = Some("<zk-ip-address>:2181:/hbase-unsecure"))
df.select(df("ID")).show
Unfortunately the above code results in a ClassCastException:
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to org.apache.spark.sql.Row
I am still very new to spark. If anyone can help it would be very much appreciated!
Although you haven't mentioned your spark version and details of the exception...
Please see PHOENIX-2287 which is fixed, which says
Environment: HBase 1.1.1 running in standalone mode on OS X *
Spark 1.5.0 Phoenix 4.5.2
Josh Mahonin added a comment - 23/Sep/15 17:56 Updated patch adds
support for Spark 1.5.0, and is backwards compatible back down to
1.3.0 (manually tested, Spark version profiles may be worth looking at in the future) In 1.5.0, they've gone and explicitly hidden the
GenericMutableRow data structure. Fortunately, we are able to the
external-facing 'Row' data type, which is backwards compatible, and
should remain compatible in future releases as well. As part of the
update, Spark SQL deprecated a constructor on their 'DecimalType'.
In updating this, I exposed a new issue, which is that we don't
carry-forward the precision and scale of the underlying Decimal type
through to Spark. For now I've set it to use the Spark defaults, but
I'll create another issue for that specifically. I've included an
ignored integration test in this patch as well.

NoClassDefFoundError when using avro in spark-shell

I keep getting
java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper
when calling show() on a DataFrame object. I'm attempting to do this through the shell (spark-shell --master yarn). I can see that the shell recognizes the schema when creating the DataFrame object, but if I execute any actions on the data it will always throw the NoClassDefFoundError when trying to instantiate the AvroWrapper. I've tried adding avro-mapred-1.8.0.jar in my $HDFS_USER/lib directory on the cluster and even included it using the --jar option when launching the shell. Neither of these options worked. Any advice would be greatly appreciated. Below is example code:
scala> import org.apache.spark.sql._
scala> import com.databricks.spark.avro._
scala> val sqc = new SQLContext(sc)
scala> val df = sqc.read.avro("my_avro_file") // recognizes the schema and creates the DataFrame object
scala> df.show // this is where I get NoClassDefFoundError
The DataFrame object itself is created at the val df =... line, but data is not read yet. Spark only starts reading and processing the data, when you ask for some kind of output (like a df.count(), or df.show()).
So the original issue is that the avro-mapred package is missing.
Try launching your Spark Shell like this:
spark-shell --packages org.apache.avro:avro-mapred:1.7.7,com.databricks:spark-avro_2.10:2.0.1
The Spark Avro package marks the Avro Mapred package as provided, but it is not available on your system (or classpath) for one or other reason.
If anyone else runs into this problem, I finally solved it. I removed the CDH spark package and downloaded it from http://spark.apache.org/downloads.html. After that everything worked fine. Not sure what the issues was with the CDH version, but I'm not going to waste anymore time trying to figure it out.

Reading Avro into spark using spark-avro

I'm not being able to read spark files using the spark-avro library. Here are the steps I took:
Got the jar from: http://mvnrepository.com/artifact/com.databricks/spark-avro_2.10/0.1
Invoked spark-shell using spark-shell --jars avro/spark-avro_2.10-0.1.jar
Executed commands as given in the git readme:
import com.databricks.spark.avro._
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val episodes = sqlContext.avroFile("episodes.avro")
The action sqlContext.avroFile("episodes.avro") fails with the following error:
scala> val episodes = sqlContext.avroFile("episodes.avro")
java.lang.IncompatibleClassChangeError: class com.databricks.spark.avro.AvroRelation has interface org.apache.spark.sql.sources.TableScan as super class
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
My bad. The readme clearly says:
Versions
Spark changed how it reads / writes data in 1.4, so please use the correct version of this dedicated for your spark version
1.3 -> 1.0.0
1.4+ -> 1.1.0-SNAPSHOT
I used spark:1.3.1 and spark-avro: 1.1.0. When I used spark-avro: 1.0.0, it worked.
Since spark-avro module is external, there is no .avro API in DataFrameReader or DataFrameWriter.
To load/save data in Avro format, you need to specify the data source option format as avro.
Example:
val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.appName(appName).master(master).getOrCreate()
val sqlContext = spark.sqlContext
val episodes = sqlContext.read.format("com.databricks.spark.avro")
.option("header","true")
.option("inferSchema","true")
.load("episodes.avro")
episodes.show(10)

Resources