Spark XML file loading - apache-spark

How can I load XML files in Spark 2.0?
val rd = spark.read.format("com.databricks.spark.xml").load("C:/Users/kumar/Desktop/d.xml")
I'm getting error com.databricks.spark.xml not available.
java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects
at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:148)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:79)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:79)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:325)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
... 48 elided

ClassNotFoundException means that you need a fat jar which you could include the package in your build.sbt and make the jar by sbt assembly. you may have a try.
If can not work. add the jar into $SPARK_HOME/jars and have a try.

Alternatively, you can add the jar file into your spark shell. Download the spark-xml_2.10-0.2.0.jar jar file and copy into the spark's class path and add the jar file in your spark shell using the :cp command as
:cp spark-xml_2.10-0.2.0.jar
/*
jar file will get imported into the spark shell
now you can use this jar file anywhere in your code inside the spark shell.
*/
val rd = spark.read.format("com.databricks.spark.xml").load("C:/Users/kumar/Desktop/d.xml")

Related

Find specific path inside a jar

I got a jar file, and I know a class name in it for sure.
I need to know, in case this class file is in folder inside that jar, a command #Unix, in which I can see the full path of that class file where in that jar.
Thanks.
Try doing
jar tf TicTacToe.jar
For additional information you can do
jar tvf TicTacToe.jar

How to manually deploy 3rd party utility jar for Apache Spark cluster?

I have a Apache Spark cluster (multi-nodes) and I would like to manually deploy some utility jars to each Spark node. Where should I put these jars to?
For example: spark-streaming-twitter_2.10-1.6.0.jar
I know we can use maven to build a fat jar which including these jars, however I would like to deploy these utilities manually. In this way, programmers would not have to deploy these utilities jars.
Any suggestion?
1, Copy your 3rd party jars to reserved HDFS directory;
for example hdfs://xxx-ns/user/xxx/3rd-jars/
2, In spark-submit, specify these jars using hdfs path;
hdfs: - executors will pull down files and JARs from hdfs directory
--jars hdfs://xxx-ns/user/xxx/3rd-jars/xxx.jar
3, spark-submit will not repleatly upload these jars
Client: Source and destination file systems are the same. Not copying hdfs://xxx-ns/user/xxx/3rd-jars/xxx.jar
spark-submit and spark-shell have a --jars option. This will distribute the jars to all the executors. The spark-submit --help for --jars is as follows
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
This is taken from the programming guide..
Or, to also add code.jar to its classpath, use:
$ ./bin/spark-shell --master local[4] --jars code.jar

Apache Spark custom log4j configuration for application

I would like to customize the Log4J configuration for my application in a standalone Spark cluster. I have a log4j.xml file which is inside my application JAR. What is the correct way to get Spark to use that configuration instead of its own Log4J configuration?
I tried using the --conf options to set the following, but no luck.
spark.executor.extraJavaOptions -> -Dlog4j.configuration=log4j.xml
spark.driver.extraJavaOptions -> -Dlog4j.configuration=log4j.xml
I am using Spark 1.4.1 and there's no log4j.properties file in my /conf.
If you are using SBT as package manager/builder:
There is a log4j.properties.template in $SPARK_HOME/conf
copy it in your SBT project's src/main/resource
remove the .template suffix
edit it to fit your needs
SBT run/package/* will include this in JAR and Spark references it.
Works for me, and will probably include similar steps for other package managers, e.g. maven.
Try using driver-java-options. For example:
spark-submit --class my.class --master spark://myhost:7077 --driver-java-options "-Dlog4j.configuration=file:///opt/apps/conf/my.log4j.properties" my.jar

cassandra 1.2 fails to init snappy in freebsd

ERROR [WRITE-/10.10.35.30] 2013-06-19 23:15:56,907 CassandraDaemon.java (line 175) Exception in thread Thread[WRITE-/10.10.35.30,5,main]
java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy
at org.xerial.snappy.SnappyOutputStream.<init>(SnappyOutputStream.java:79)
at org.xerial.snappy.SnappyOutputStream.<init>(SnappyOutputStream.java:66)
at org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:341)
at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:143)
When going through know issues i found this
The native library snappy-1.0.4.1-libsnappyjava.so for Snappy compression is included in the snappy-java-1.0.4.1.jar file. When the JVM initializes the JAR, the library is added to the default temp directory. If the default temp directory is mounted with a noexec option, it results in the above exception.
i added JVM_OPTS=-Dorg.xerial.snappy.tempdir=/tmp in cassandra.in.sh and it still didnot work.
i also tried specifying the temp directory directly
./bin/cassandra -Dorg.xerial.snappy.tempdir=/tmp
On the same machine cassandra version 1.0.12 works fine.
Any help will be appreciated.
The problem is that there is no FreeBSD library included in the snappy JAR file that comes with Cassandra. Install the archivers/snappy-java port, delete the snappy-java JAR file that came with Cassandra, and copy /usr/local/share/java/classes/snappy-java.jar into Cassandra's lib directory.
Same problem happened when trying to enable snappy compression for apache kafka 0.8 on FreeBSD but the solution was the same. Just copy /usr/local/share/java/classes/snappy-java.jar to the kafka/src/core/target/scala-2.8.0 directory, restart kafka and enjoy!

Not able to access BouncyCastle jar from application jar

I have made jar file for my application. One of the class of my application uses BouncyCastleProvider class of BC jar.
I have created one folder "lib" in the same parent folder where my application jar is residing.
I have changed my machine CLASSPATH to point to this new lib folder. But when I run my application it gives me classnotfound exception.
But if I copy this BC jar file to my jre/lib/ext then everything works fine.
Can anybody tell me what I need to do to access BC jar file from my lib directory?
Thanks in Advance,
Jenish
Your JAR file must have its MANIFEST.MF file set to declare the classpath for the JAR.#
Extract from the Sun Tutorial below, in your case you just need to make the Class-Path directive point to your lib directory, presumably
Class-Path: lib/BouncyCastle.jar
We want to load classes in MyUtils.jar
into the class path for use in
MyJar.jar. These two JAR files are in
the same directory.
We first create a text file named
Manifest.txt with the following
contents:
Class-Path: MyUtils.jar
Warning : The text file must end with a new line or carriage return.
The last line will not be parsed
properly if it does not end with a new
line or carriage return.
We then create a JAR file named
MyJar.jar by entering the following
command:
jar cfm MyJar.jar Manifest.txt MyPackage/*.class
This creates the JAR file with a
manifest with the following contents:
Manifest-Version: 1.0
Class-Path: MyUtils.jar
Created-By: 1.6.0 (Sun Microsystems Inc.)
The classes in MyUtils.jar are now
loaded into the class path when you
run MyJar.jar.

Resources