Spark 2.0 - Dataset<Row> Write to Parquet in Java - apache-spark

I want to write Dataset into a Parquet file in Java, I use
Dataset<Row> ds = getDataFrame();
ds.write().parquet("data.parquet");
This code is run by spark-submit command as given below
sudo spark-submit --class getdata --master yarn --num-executors 4 --executor-cores 1 --jars guava-14.0.1.jar,hadoop-common-2.7.3.jar,hbase-client-1.3.0.jar,hbase-common-1.3.0.jar,hbase-protocol-1.3.0.jar,log4j-1.2.17.jar,metrics-core-2.2.0.jar,ojdbc6.jar,spark-core_2.11-2.0.2.jar,spark-assembly.jar,spark-sql_2.11-2.0.2.jar,hive-beeline-1.2.1.spark2.jar,hive-cli-1.2.1.spark2.jar,hive-exec-1.2.1.spark2.jar,hive-jdbc-1.2.1.spark2.jar,hive-metastore-1.2.1.spark2.jar,parquet-column-1.7.0.jar,parquet-common-1.7.0.jar,parquet-encoding-1.7.0.jar,parquet-format-2.3.0-incubating.jar,parquet-generator-1.7.0.jar,parquet-hadoop-1.7.0.jar,parquet-hadoop-bundle-1.6.0.jar,parquet-hive-1.0.1.jar,parquet-jackson-1.7.0.jar,spark-hive_2.11-2.0.2.jar getdata.jar
I get the following exception.
Exception in thread "main" java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.hive.orc.DefaultSource could not be instantiated
What am I missing? Please help.

A ServiceLoader finds DefaultSource implementations on the classpath invokes a constructor that returns a type which doesn't correspond to the expected return type.
An OrcRelation is returned where a HadoopFsRelation is expected, but OrcRelation doesn't implement HadoopFsRelation. It might be a version conflict, since I can't find HadoopFsRelation in 2.1.0, while it is there in older versions (e.g. 1.6.0).
Do you have multiple Spark versions on your classpath, or mixed Spark/Hive implementations?

Related

sparkjob fails with guava error-java.lang.NoSuchMethodError:com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V

We have setup open source apache hadoop cluster with following below components.
hadoop - 3.1.4
spark - 3.3.1
hive - 3.1.3
When we are trying to run the spark example job with below command but it fails with the following exception.
/opt/spark-3.3.1/bin/spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster --master yarn --num-executors 1 --driver-memory 1G --executor-memory 1G --executor-cores 1 /opt/spark-3.3.1/examples/jars/spark-examples_2.12-3.3.1.jar
Error :
[2022-12-09 00:05:02.747]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/hdfsdata2/yarn/local/usercache/spark/filecache/70/__spark_libs__3692263374412677830.zip/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-3.1.4/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2022-12-09 00:05:02,137 INFO util.SignalUtils: Registering signal handler for TERM
2022-12-09 00:05:02,139 INFO util.SignalUtils: Registering signal handler for HUP
2022-12-09 00:05:02,139 INFO util.SignalUtils: Registering signal handler for INT
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
at org.apache.spark.deploy.SparkHadoopUtil$.$anonfun$appendHiveConfigs$1(SparkHadoopUtil.scala:477)
at org.apache.spark.deploy.SparkHadoopUtil$.$anonfun$appendHiveConfigs$1$adapted(SparkHadoopUtil.scala:476)
at scala.collection.immutable.Stream.foreach(Stream.scala:533)
at org.apache.spark.deploy.SparkHadoopUtil$.appendHiveConfigs(SparkHadoopUtil.scala:476)
at org.apache.spark.deploy.SparkHadoopUtil$.org$apache$spark$deploy$SparkHadoopUtil$$appendS3AndSparkHadoopHiveConfigurations(SparkHadoopUtil.scala:456)
at org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:430)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:894)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
After debugging, this error seems to be related to guava and it's dependent jars.
Hadoop is having guava-27.0-jre.jar and spark is having guava-14.0-jre.jar.
I removed the spark guava jars and copied the guava and it's dependent jars from hadoop lib location to spark jars folder. Below are all list of guava and it's dependent jars.
/opt/spark-3.3.1/jars/animal-sniffer-annotations-1.17.jar
/opt/spark-3.3.1/jars/failureaccess-1.0.jar
/opt/spark-3.3.1/jars/error_prone_annotations-2.2.0.jar
/opt/spark-3.3.1/jars/checker-qual-2.5.2.jar
/opt/spark-3.3.1/jars/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar
/opt/spark-3.3.1/jars/jsr305-3.0.2.jar
/opt/spark-3.3.1/jars/j2objc-annotations-1.1.jar
/opt/spark-3.3.1/jars/guava-27.0-jre.jar
But still the error seems to persist.
Interestingly, when I run the same example spark job as below it succeeds.
/opt/spark-3.3.1/bin/spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster --master yarn --num-executors 1 --driver-memory 1G --executor-memory 1G --executor-cores 1 /opt/spark-3.3.1/examples/jars/spark-examples_2.12-3.3.1.jar 50
So the observation is any value less than 50 passed at the end of the command fails whereas higher value makes the job succeed. I am not sure about the reason behind this.
#Saurav Suman, To avoid this confusion of spark to find its hadoop yarn specific jars, Apache Spark documentation has provided a clean resolution.
If you have downloaded Spark without hadoop binaries and you have set up hadoop by yourself, then you should follow this below link and create your SPARK_DIST_CLASSPATH=$(hadoop classpath)
https://spark.apache.org/docs/latest/hadoop-provided.html
And this will take care of Spark picking up its hadoop related binaries. And I reckon spark does not come with its own Guava.
This above error particularly appears when either you are using outdated Guava or there are multiple Guava files that are conflicting which is not allowing Spark to decide which one to choose. Hope this helps.

List all additional jars loaded in pyspark

I want to see the jars my spark context is using.
I found the code in Scala:
$ spark-shell --jars --master=spark://datasci:7077 --jars /opt/jars/xgboost4j-spark-0.7-jar-with-dependencies.jar --packages elsevierlabs-os:spark-xml-utils:1.6.0
scala> spark.sparkContext.listJars.foreach(println)
spark://datasci:42661/jars/net.sf.saxon_Saxon-HE-9.6.0-7.jar
spark://datasci:42661/jars/elsevierlabs-os_spark-xml-utils-1.6.0.jar
spark://datasci:42661/jars/org.apache.commons_commons-lang3-3.4.jar
spark://datasci:42661/jars/commons-logging_commons-logging-1.2.jar
spark://datasci:42661/jars/xgboost4j-spark-0.7-jar-with-dependencies.jar
spark://datasci:42661/jars/commons-io_commons-io-2.4.jar
Source: List All Additional Jars Loaded in Spark
But I could not find how to do it in PySpark.
Any suggestions?
Thanks
sparkContext._jsc.sc().listJars()
_jsc is the java spark context
I really got the extra jars with this command:
print(spark.sparkContext._jsc.sc().listJars())

Pyspark Structured streaming locally with Kafka-Jupyter

After looking at the other answers i still cant figure it out.
I am able to use kafkaProducer and kafkaConsumer to send and receive a messages from within my notebook.
producer = KafkaProducer(bootstrap_servers=['127.0.0.1:9092'],value_serializer=lambda m: json.dumps(m).encode('ascii'))
consumer = KafkaConsumer('hr',bootstrap_servers=['127.0.0.1:9092'],group_id='abc' )
I've tried to connect to the stream with both spark context and spark session.
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext("local[*]", "stream")
ssc = StreamingContext(sc, 1)
Which gives me this error
Spark Streaming's Kafka libraries not found in class path. Try one
of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-
kafka-0-8:2.3.2 ...
It seems that i needed to add the JAR to my
!/usr/local/bin/spark-submit --master local[*] /usr/local/Cellar/apache-spark/2.3.0/libexec/jars/spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar pyspark-shell
which returns
Error: No main class set in JAR; please specify one with --class
Run with --help for usage help or --verbose for debug output
What class do i put in?
How do i get Pyspark to connect to the consumer?
The command you have is trying to run spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar, and trying to find pyspark-shell as a Java class inside of that.
As the first error says, you missed a --packages after spark-submit, which means you would do
spark-submit --packages ... someApp.jar com.example.YourClass
If you are just locally in Jupyter, you may want to try Kafka-Python, for example, rather than PySpark... Less overhead, and no Java dependencies.

Registering Hive Custom UDF with Spark (Spark SQL) 2.0.0

I am working on a spark 2.0.0 piece where my requirement is to use 'com.facebook.hive.udf.UDFNumberRows' function in my sql context to use in one of the queries. In my cluster with Hive query, I use this as a temporary function just by defining : CREATE TEMPORARY FUNCTION myFunc AS 'com.facebook.hive.udf.UDFNumberRows', which is quite simple.
I tried registering this with sparkSession as below but got an error:
sparkSession.sql("""CREATE TEMPORARY FUNCTION myFunc AS 'com.facebook.hive.udf.UDFNumberRows'""")
Error :
CREATE TEMPORARY FUNCTION rowsequence AS 'com.facebook.hive.udf.UDFNumberRows'
16/11/01 20:46:17 ERROR ApplicationMaster: User class threw exception: java.lang.UnsupportedOperationException: Use sqlContext.udf.register(...) instead.
java.lang.UnsupportedOperationException: Use sqlContext.udf.register(...) instead.
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeFunctionBuilder(SessionCatalog.scala:751)
at org.apache.spark.sql.execution.command.CreateFunctionCommand.run(functions.scala:61)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:186)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
at com.mediamath.spark.attribution.sparkjob.SparkVideoCidJoin$.delayedEndpoint$com$mediamath$spark$attribution$sparkjob$SparkVideoCidJoin$1(SparkVideoCidJoin.scala:75)
at com.mediamath.spark.attribution.sparkjob.SparkVideoCidJoin$delayedInit$body.apply(SparkVideoCidJoin.scala:22)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at com.mediamath.spark.attribution.sparkjob.SparkVideoCidJoin$.main(SparkVideoCidJoin.scala:22)
at com.mediamath.spark.attribution.sparkjob.SparkVideoCidJoin.main(SparkVideoCidJoin.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)
Does anybody have idea how to register it as spark is asking, i.e. with register api in sparkSession and SQLContext:
sqlContext.udf.register(...)
In Spark 2.0,
sparkSession.udf.register(...)
allows you to register Java or Scala UDFs (functions of type Long => Long), but not Hive GenericUDFs that handle LongWritable instead of Long, and that can have a variable number of arguments.
To register Hive UDFs, your first approach was correct:
sparkSession.sql("""CREATE TEMPORARY FUNCTION myFunc AS 'com.facebook.hive.udf.UDFNumberRows'""")
However you must enable Hive support first:
SparkSession.builder().enableHiveSupport()
and make sure that the "spark-hive" dependencies are present in your classpath.
Explanation:
Your error message
java.lang.UnsupportedOperationException: Use sqlContext.udf.register(...) instead
comes from the class SessionCatalog.
By calling SparkSession.builder().enableHiveSupport(), spark
will replace the SessionCatalog with a HiveSessionCatalog in which the method makeFunctionBuilder is implemented.
Lastly:
the UDF you want to use, 'com.facebook.hive.udf.UDFNumberRows', has been written in a time where Windowing Functions were not available in Hive.
I suggest you to use them instead. You can check the Hive Reference,
this Spark-SQL intro, or this if you want to stick to the scala syntax.
The problem that you are facing it's that Spark is not loading the jar library in his classPath.
In our team we are loading external libraries with --jars option.
/usr/bin/spark-submit --jars external_library.jar our_program.py --our_params
You can check if you are loading external libraries in Spark History - Environment Tab. (spark.yarn.secondary.jars)
Then you will be able to register your udf as you said. Once you enable HiveSupport as FurryMachine says.
sparkSession.sql("""
CREATE TEMPORARY FUNCTION myFunc AS
'com.facebook.hive.udf.UDFNumberRows'
""")
You can found more info in spark-summit --help
hadoop:~/projects/neocortex/src$ spark-submit --help
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
You can register a UDF directly using SparkSession as in sparkSession.udf.register("myUDF", (arg1: Int, arg2: String) => arg2 + arg1). Look at the detailed documentation here

Error starting pyspark with options (Without Spack packages)

can anyone tell me why I'm getting the errors below ? According to the
README for the pyspark-cassandra connector, what I am trying below should work (without Spark packages): https://github.com/TargetHolding/pyspark-cassandra
$ pyspark_jar="$HOME/devel/sandbox/Learning/Spark/pyspark-cassandra/target/scala-2.10/pyspark-cassandra-assembly-0.2.2.jar"
$ pyspark_egg="$HOME/devel/sandbox/Learning/Spark/pyspark-cassandra/target/pyspark_cassandra-0.2.2-py2.7.egg"
$ pyspark --jars $pyspark_jar --py_files $pyspark_egg --conf spark.cassandra.connection.host=localhost
This results in:
Exception in thread "main" java.lang.IllegalArgumentException: pyspark does not support any application options.
at org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:222)
at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildPySparkShellCommand(SparkSubmitCommandBuilder.java:239)
at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:113)
at org.apache.spark.launcher.Main.main(Main.java:74)
Figured out the problem. I needed to use
--py-files
instead of
--py_files

Resources