Convert Excel file to csv in Spark 1.X - excel

Is there a tool to convert Excel files into csv using Spark 1.X ?
got this issue when executing this tuto
https://github.com/ZuInnoTe/hadoopoffice/wiki/Read-Excel-document-using-Spark-1.x
Exception in thread "main" java.lang.NoClassDefFoundError: org/zuinnote/hadoop/office/format/mapreduce/ExcelFileInputFormat
at org.zuinnote.spark.office.example.excel.SparkScalaExcelIn$.convertToCSV(SparkScalaExcelIn.scala:63)
at org.zuinnote.spark.office.example.excel.SparkScalaExcelIn$.main(SparkScalaExcelIn.scala:56)
at org.zuinnote.spark.office.example.excel.SparkScalaExcelIn.main(SparkScalaExcelIn.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.zuinnote.hadoop.office.format.mapreduce.ExcelFileInputFormat
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

Spark is unable to find org.zuinnote.hadoop.office.format.mapreduce.ExcelFileInputFormat File format class in classpath.
Supply below dependency to spark-submit using --jars parameter-
<!-- https://mvnrepository.com/artifact/com.github.zuinnote/hadoopoffice-fileformat -->
<dependency>
<groupId>com.github.zuinnote</groupId>
<artifactId>hadoopoffice-fileformat</artifactId>
<version>1.0.4</version>
</dependency>
Command:
spark-submit --jars hadoopoffice-fileformat-1.0.4.jar \
#rest of the command arguments

You have to build a fat jar that contains all the necessary dependencies. The example project on the HadoopOffice page shows how you build one. One you build the fat/uber jar you simply use it in Spark summit.

Related

Couldn't resolve the dependency for elasticsearch library for spark-submit py files

I am trying to stream data from flat files into elastic search using structured streaming (pyspark)
Spark - 2.4.6
Scala - 2.11.0
Hadoop - 2.7
While trying to submit the job by specifying dependency like below it works,
spark-submit --packages org.elasticsearch:elasticsearch-hadoop:7.7.1 FileStructuredStreaming_ES.py
Problem is:
My production environment I cannot use --packages (restricted to the internet). I am trying to find the jar, which can be moved into the cluster rather than using --packages but couldn't achieve it, tried will all possible ways like
--py-files / --archives / --jars
Following way of submitting the spark job fails with follwoing error:
spark-submit --py-files elasticsearch-hadoop-7.7.1.jar /workspace/scripts/pyspark/FileStructuredStreaming_ES.py
Error Trace
java.lang.ClassNotFoundException: Failed to find data source: org.elasticsearch.spark.sql. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:307)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.elasticsearch.spark.sql.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
... 12 more
Am I missing anything here, is there a way to find out which library / jar i need to use? What i am using is an official jar?

Provider org.apache.hadoop.fs.s3a.S3AFileSystem could not be instantiated

I am trying to save a model learning to S3 from my Spark Standalone cluster. But I have this error :
java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.fs.s3a.S3AFileSystem could not be instantiated
at java.util.ServiceLoader.fail(ServiceLoader.java:232)
at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2631)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2650)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1853)
at org.apache.spark.scheduler.EventLoggingListener.<init>(EventLoggingListener.scala:68)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:529)
at ALS$.main(ALS.scala:32)
at ALS.main(ALS.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.NoClassDefFoundError: com/amazonaws/event/ProgressListener
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671)
at java.lang.Class.getConstructor0(Class.java:3075)
at java.lang.Class.newInstance(Class.java:412)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
... 23 more
Caused by: java.lang.ClassNotFoundException:com.amazonaws.event.ProgressListener
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 28 more
I have added Hadoop-aws aws-sdk in extraClassPath in spark-defaults.conf
What I have tried so far : I send my spark-submit with a fat jar compiled by sbt assembly (I have also added those dependencies in the sbt). My AWS Credentials are exported in the master environnement
Any idea on where I need to explore to fix this ?
Thanks !
That's an aws class, so you are going to need to make sure your CP has *the exact set of aws-java JARs your hadoop-aws JAR was built against.
mvnrepository lists those dependencies.
I have a project whose whole aim in life is to work out WTF is wrong with blobstore connector bindings, cloudstore. You can use that in spark-shell or real spark queries to help diagnose things.

Spark in Oozie Workflow throws Class not found Exception

]1
Hue 3.10
Spark 1.6.0
CDH 5.8.0
When i run jar using spark-submit command it works fine but using hue workflow it gives me an error.
`java.lang.ClassNotFoundException: RowCountFilter
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.spark.util.Utils$.classForName(Utils.scala:175)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:689)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
at org.apache.oozie.action.hadoop.SparkMain.runSpark(SparkMain.java:256)
at org.apache.oozie.action.hadoop.SparkMain.run(SparkMain.java:207)
at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:49)
at org.apache.oozie.action.hadoop.SparkMain.main(SparkMain.java:52)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:236)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Intercepting System.exit(101)
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], exit code [101]
`
>
Can anyone help what is missing ?
Please share your job.properties & coordinator.properties file. check the lib path oozie.libpath in these files and see if the required jar is present.
When oozie triggers a job , it will check the jars in the lib path distribute the to all the nodes in the cluster for execution.
You may also want to verify the configs in oozie-site.xml

Hive On Spark: java.lang.NoClassDefFoundError: org/apache/hive/spark/client/Job

When I ran a query on hive console in debug mode, I got an error as listed below. I'm using hive-1.2.1 and spark 1.5.1; I checked the hive-exec jar, which has the class definition org/apache/hive/spark/client/Job .
Caused by: java.lang.NoClassDefFoundError: org/apache/hive/spark/client/Job
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:792)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:136)
at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:656)
at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:99)
at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
at org.apache.hive.spark.client.rpc.KryoMessageCodec.decode(KryoMessageCodec.java:96)
at io.netty.handler.codec.ByteToMessageCodec$1.decode(ByteToMessageCodec.java:42)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:327)
... 15 more*
And finally the query fails with:
"ERROR spark.SparkTask: Failed to execute spark task, with exception 'java.lang.IllegalStateException(RPC channel is closed.)'"*
How can I resolve this issue?
In hive-1.2.1 pom.xml, the spark.version is 1.3.1
So, The easy way is dowload a spark-1.3.1-bin-hadoop from spark.apache.org.
then, add it's path to hive-site.xml like:
<property>
<name>spark.home</name>
<value>/path/spark-1.3.1-bin-hadoop2.4</value>
</property>

PhoenixOutputFormat not found when running a Spark Job on CDH 5.4 with Phoenix 4.5

I managed to configure Phoenix 4.5 on Cloudera CDH 5.4 by recompiling the source code. sqlline.py works well, but there are problems with spark.
spark-submit --class my.JobRunner \
--master yarn --deploy-mode client \
--jars `ls -dm /myapp/lib/* | tr -d ' \r\n'` \
/myapp/mainjar.jar
The /myapp/lib folders contains the phoenix core lib, which contains class org.apache.phoenix.mapreduce.PhoenixOutputFormat. But it seems that the driver/executor cannot see it.
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.phoenix.mapreduce.PhoenixOutputFormat not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2112)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getOutputFormatClass(JobContextImpl.java:232)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:971)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:903)
at org.apache.phoenix.spark.ProductRDDFunctions.saveToPhoenix(ProductRDDFunctions.scala:51)
at com.mypackage.save(DAOImpl.scala:41)
at com.mypackage.ProtoStreamingJob.execute(ProtoStreamingJob.scala:58)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.mypackage.SparkApplication.sparkRun(SparkApplication.scala:95)
at com.mypackage.SparkApplication$delayedInit$body.apply(SparkApplication.scala:112)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
at scala.App$class.main(App.scala:71)
at com.mypackage.SparkApplication.main(SparkApplication.scala:15)
at com.mypackage.ProtoStreamingJobRunner.main(ProtoStreamingJob.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class org.apache.phoenix.mapreduce.PhoenixOutputFormat not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2018)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2110)
... 30 more
What can I do to overcome this exception?
Adding phoenix-core to classpath.txt solves the problem. This file is usually located under /etc/spark/conf folder.

Resources