Error starting pyspark with options (Without Spack packages) - apache-spark

can anyone tell me why I'm getting the errors below ? According to the
README for the pyspark-cassandra connector, what I am trying below should work (without Spark packages): https://github.com/TargetHolding/pyspark-cassandra
$ pyspark_jar="$HOME/devel/sandbox/Learning/Spark/pyspark-cassandra/target/scala-2.10/pyspark-cassandra-assembly-0.2.2.jar"
$ pyspark_egg="$HOME/devel/sandbox/Learning/Spark/pyspark-cassandra/target/pyspark_cassandra-0.2.2-py2.7.egg"
$ pyspark --jars $pyspark_jar --py_files $pyspark_egg --conf spark.cassandra.connection.host=localhost
This results in:
Exception in thread "main" java.lang.IllegalArgumentException: pyspark does not support any application options.
at org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:222)
at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildPySparkShellCommand(SparkSubmitCommandBuilder.java:239)
at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:113)
at org.apache.spark.launcher.Main.main(Main.java:74)

Figured out the problem. I needed to use
--py-files
instead of
--py_files

Related

List all additional jars loaded in pyspark

I want to see the jars my spark context is using.
I found the code in Scala:
$ spark-shell --jars --master=spark://datasci:7077 --jars /opt/jars/xgboost4j-spark-0.7-jar-with-dependencies.jar --packages elsevierlabs-os:spark-xml-utils:1.6.0
scala> spark.sparkContext.listJars.foreach(println)
spark://datasci:42661/jars/net.sf.saxon_Saxon-HE-9.6.0-7.jar
spark://datasci:42661/jars/elsevierlabs-os_spark-xml-utils-1.6.0.jar
spark://datasci:42661/jars/org.apache.commons_commons-lang3-3.4.jar
spark://datasci:42661/jars/commons-logging_commons-logging-1.2.jar
spark://datasci:42661/jars/xgboost4j-spark-0.7-jar-with-dependencies.jar
spark://datasci:42661/jars/commons-io_commons-io-2.4.jar
Source: List All Additional Jars Loaded in Spark
But I could not find how to do it in PySpark.
Any suggestions?
Thanks
sparkContext._jsc.sc().listJars()
_jsc is the java spark context
I really got the extra jars with this command:
print(spark.sparkContext._jsc.sc().listJars())

Pyspark Structured streaming locally with Kafka-Jupyter

After looking at the other answers i still cant figure it out.
I am able to use kafkaProducer and kafkaConsumer to send and receive a messages from within my notebook.
producer = KafkaProducer(bootstrap_servers=['127.0.0.1:9092'],value_serializer=lambda m: json.dumps(m).encode('ascii'))
consumer = KafkaConsumer('hr',bootstrap_servers=['127.0.0.1:9092'],group_id='abc' )
I've tried to connect to the stream with both spark context and spark session.
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext("local[*]", "stream")
ssc = StreamingContext(sc, 1)
Which gives me this error
Spark Streaming's Kafka libraries not found in class path. Try one
of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-
kafka-0-8:2.3.2 ...
It seems that i needed to add the JAR to my
!/usr/local/bin/spark-submit --master local[*] /usr/local/Cellar/apache-spark/2.3.0/libexec/jars/spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar pyspark-shell
which returns
Error: No main class set in JAR; please specify one with --class
Run with --help for usage help or --verbose for debug output
What class do i put in?
How do i get Pyspark to connect to the consumer?
The command you have is trying to run spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar, and trying to find pyspark-shell as a Java class inside of that.
As the first error says, you missed a --packages after spark-submit, which means you would do
spark-submit --packages ... someApp.jar com.example.YourClass
If you are just locally in Jupyter, you may want to try Kafka-Python, for example, rather than PySpark... Less overhead, and no Java dependencies.

Spark 2.0 - Dataset<Row> Write to Parquet in Java

I want to write Dataset into a Parquet file in Java, I use
Dataset<Row> ds = getDataFrame();
ds.write().parquet("data.parquet");
This code is run by spark-submit command as given below
sudo spark-submit --class getdata --master yarn --num-executors 4 --executor-cores 1 --jars guava-14.0.1.jar,hadoop-common-2.7.3.jar,hbase-client-1.3.0.jar,hbase-common-1.3.0.jar,hbase-protocol-1.3.0.jar,log4j-1.2.17.jar,metrics-core-2.2.0.jar,ojdbc6.jar,spark-core_2.11-2.0.2.jar,spark-assembly.jar,spark-sql_2.11-2.0.2.jar,hive-beeline-1.2.1.spark2.jar,hive-cli-1.2.1.spark2.jar,hive-exec-1.2.1.spark2.jar,hive-jdbc-1.2.1.spark2.jar,hive-metastore-1.2.1.spark2.jar,parquet-column-1.7.0.jar,parquet-common-1.7.0.jar,parquet-encoding-1.7.0.jar,parquet-format-2.3.0-incubating.jar,parquet-generator-1.7.0.jar,parquet-hadoop-1.7.0.jar,parquet-hadoop-bundle-1.6.0.jar,parquet-hive-1.0.1.jar,parquet-jackson-1.7.0.jar,spark-hive_2.11-2.0.2.jar getdata.jar
I get the following exception.
Exception in thread "main" java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.hive.orc.DefaultSource could not be instantiated
What am I missing? Please help.
A ServiceLoader finds DefaultSource implementations on the classpath invokes a constructor that returns a type which doesn't correspond to the expected return type.
An OrcRelation is returned where a HadoopFsRelation is expected, but OrcRelation doesn't implement HadoopFsRelation. It might be a version conflict, since I can't find HadoopFsRelation in 2.1.0, while it is there in older versions (e.g. 1.6.0).
Do you have multiple Spark versions on your classpath, or mixed Spark/Hive implementations?

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration while running spark Hbase using java API

I am running simple application to get data from HBase in Spark using java API
running spark-submit command e.g.
bin/spark-submit --master spark://192.168.43.75:7077 --class com.scry.NLPAnnotationController --driver-class-path /usr/lib/hbase/hbase-0.98.22-hadoop2/conf:$SPARK_HOME/lib_managed/jars/*.jar:$HBASE_CLASSPATH/*.jar --jars $SPARK_HOME/lib_managed/jars/*.jar:$HBASE_CLASSPATH/*.jar /home/deepak/hbase.jar
It gives error like
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration
Please help to resolve this issue.
Thanks in advance,
Deepak
In the console where you are running spark-submit, first execute below command and then run spark-submit
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HBASE_CLASSPATH
If this works then add this entry in your hadoop-env.sh
Hope it helps...

How to specify multiple dependencies using --packages for spark-submit?

I have the following as the command line to start a spark streaming job.
spark-submit --class com.biz.test \
--packages \
org.apache.spark:spark-streaming-kafka_2.10:1.3.0 \
org.apache.hbase:hbase-common:1.0.0 \
org.apache.hbase:hbase-client:1.0.0 \
org.apache.hbase:hbase-server:1.0.0 \
org.json4s:json4s-jackson:3.2.11 \
./test-spark_2.10-1.0.8.jar \
>spark_log 2>&1 &
The job fails to start with the following error:
Exception in thread "main" java.lang.IllegalArgumentException: Given path is malformed: org.apache.hbase:hbase-common:1.0.0
at org.apache.spark.util.Utils$.resolveURI(Utils.scala:1665)
at org.apache.spark.deploy.SparkSubmitArguments.parse$1(SparkSubmitArguments.scala:432)
at org.apache.spark.deploy.SparkSubmitArguments.parseOpts(SparkSubmitArguments.scala:288)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:87)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:105)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I've tried removing the formatting and returning to a single line, but that doesn't resolve the issue. I've also tried a bunch of variations: different versions, added _2.10 to the end of the artifactId, etc.
According to the docs (spark-submit --help):
The format for the coordinates should be groupId:artifactId:version.
So what I have should be valid and should reference this package.
If it helps, I'm running Cloudera 5.4.4.
What am I doing wrong? How can I reference the hbase packages correctly?
A list of packages should be separated using commas without whitespaces (breaking lines should work just fine) for example
--packages org.apache.spark:spark-streaming-kafka_2.10:1.3.0,\
org.apache.hbase:hbase-common:1.0.0
I found it worthy to use SparkSession in spark version 3.0.0 for mysql and postgres
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('mysql-postgres').config('spark.jars.packages', 'mysql:mysql-connector-java:8.0.20,org.postgresql:postgresql:42.2.16').getOrCreate()
#Mohammad thanks for this input. This worked for me too. I had to load the Kafka and msql packages in a single sparksession. I did something like this:
spark = (SparkSession .builder ... .appName('myapp') # Add kafka and msql package .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,mysql:mysql-connector-java:8.0.26") .getOrCreate())

Resources