Spark Session create in Master Node - apache-spark

I am trying to run my spark code from jupyter notebook to my company server.
So i add the following code
spark = SparkSession.builder.master("spark://host:port") \
.appName("usres information analysis") \
.config("spark.executor.memory","5g") \
.getOrCreate()
in my notebook. But i am getting the following error
An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NullPointerException
Is there any problem with my code?

Related

How do I add bigquery property to a dataproc (jobs submit spark) through gcp shell?

I am trying to load google patent data using bigquery with the following code (Python).
df= spark.read \ .format("bigquery") \ .option("table", table) \ .option("filter", "country_code = '{}' AND application_kind = '{}' AND publication_date >= {} AND publication_date < {}".format(country, kind, date_from, date_to)) \ .load()
Then I got the following error that says
Py4JJavaError: An error occurred while calling o144.load. : java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html
I am suspecting the cause for this error is the lack of bigquery package in my cluster (I checked it on the configuration). I tried to add bigquery using gcp shell command. I get an error of,
$ gcloud dataproc jobs submit spark \ --cluster=my-cluster\ --region=europe-west2 \ --jar=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
But I got an error again
Exception in thread "main" org.apache.spark.SparkException: Failed to get main class in JAR with error 'null'. Please specify one with --class.
How do I install (gcloud dataproc jobs submit spark) bigquery??? Ideally on dataproc.
I don't know what to add in terms of --class

Writing to console through Python(PySpark) KAFKA failing

I have written simple prog. for reading CSV file through kafka topic & writing it to console .
When I run the job I am able to print the Schema but messege is not getting displayed .
Below is the spark config
.config("spark.jars",
"file:///home/hdoop/jars/jsr166e-1.1.0.jar,file:///home/hdoop/jars/spark-cassandra-connector_2.12-3.2.0.jar,file:///home/hdoop/jars/mysql-connector-java-8.0.30.jar,file:///home/hdoop/new_jars/spark-sql-kafka-0-10_2.12-3.2.1.jar,file:///home/hdoop/new_jars/kafka-clients-2.8.0.jar,file:///home/hdoop/new_jars/commons-pool2-2.11.1.jar") \
.config("spark.executor.extraClassPath",
"file:///home/hdoop/jars/jsr166e-1.1.0.jar,file:///home/hdoop/jars/spark-cassandra-connector_2.12-3.2.0.jar,file:///home/hdoop/jars/mysql-connector-java-8.0.30.jar,file:///home/hdoop/new_jars/spark-sql-kafka-0-10_2.12-3.2.1.jar,file:///home/hdoop/new_jars/kafka-clients-2.8.0.jar,file:///home/hdoop/new_jars/commons-pool2-2.11.1.jar") \
.config("spark.executor.extraLibrary",
"file:///home/hdoop/jars/jsr166e-1.1.0.jar,file:///home/hdoop/jars/spark-cassandra-connector_2.12-3.2.0.jar,file:///home/hdoop/jars/mysql-connector-java-8.0.30.jar,file:///home/hdoop/new_jars/spark-sql-kafka-0-10_2.12-3.2.1.jar,file:///home/hdoop/new_jars/kafka-clients-2.8.0.jar,file:///home/hdoop/new_jars/commons-pool2-2.11.1.jar") \
.config("spark.driver.extraClassPath",
"file:///home/hdoop/jars/jsr166e-1.1.0.jar,file:///home/hdoop/jars/spark-cassandra-connector_2.12-3.2.0.jar,file:///home/hdoop/jars/mysql-connector-java-8.0.30.jar,file:///home/hdoop/new_jars/spark-sql-kafka-0-10_2.12-3.2.1.jar,file:///home/hdoop/new_jars/kafka-clients-2.8.0.jar,file:///home/hdoop/new_jars/**commons-pool2-2.11.1.jar**") \
.config("spark.cassandra.connection.host", c_host_nm) \
.config("spark.cassandra.connection.port", c_port_number) \
.getOrCreate()
I am getting error -
pyspark.sql.utils.StreamingQueryException: Query [id = dfa3b327-ac6b-4426-956b-0587501592d2, runId = 49ee4949-47a9-4c8c-b46e-d121ae9ac759] terminated with exception: Writing job aborted
Start of log says -
java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate
Any thoughts on what needs to be done here?
(Python 3.9.2 , Kafka 3.2.1 , pySpark 3.3.0)
Thanks

Not able to configure custom hive-metastore-client in spark

We are facing some challenges working with spark and hive.
We need to connect to hive-metastore from spark and we have to use custom hive-metastore-client inside spark.
code snippet:
spark = SparkSession \
.builder \
.config('spark.sql.hive.metastore.version','3.1.2' ) \
.config('spark.sql.hive.metastore.jars', 'hive-standalone-metastore-3.1.2-sqlquery.jar') \
.config("spark.yarn.dist.jars", "hive-exec-3.1.2.jar") \
.config('hive.metastore.uris', "thrift://localhost:9090") \
the above code works with inbuilt hive-metastore-client but fails with custom one with error:
py4j.protocol.Py4JJavaError: An error occurred while calling o79.databaseExists.
: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/HiveException
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.metadata.HiveException
To resolve this issue, we have configured custom hive-exec in code and also tried to pass using command line:
/usr/local/Cellar/apache-spark/3.1.2/libexec/bin/spark-submit spark-session.py --jars hive-exec-3.1.2.jar
/usr/local/Cellar/apache-spark/3.1.2/libexec/bin/spark-submit spark-session.py --conf spark.executor.extraClassPath hive-exec-3.1.2.jar
/usr/local/Cellar/apache-spark/3.1.2/libexec/bin/spark-submit spark-session.py --driver-class-path hive-exec-3.1.2.jar
But the issue is not resolved yet, any suggestion would help us.

Set up jupyter on EMR to read from cassandra using cql?

When I try to set the spark context in jupyter with
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages datastax:spark-cassandra-connector:2.4.0-s_2.11 --conf spark.cassandra.connection.host=x.x.x.x pyspark-shell'
or
spark = SparkSession.builder \
.appName('SparkCassandraApp') \
.config('spark.cassandra.connection.host', 'x.x.x.x') \
.config('spark.cassandra.connection.port', 'xxxx') \
.config('spark.cassandra.output.consistency.level','ONE') \
.master('local[2]') \
.getOrCreate()
I still cannot make a connection to the cassandra cluster with the code
dataFrame = spark.read.format("org.apache.spark.sql.cassandra").option("keyspace", "keyspace").option("table", "table").load()
dataFrame = dataFrame.limit(100)
dataFrame.show()
Comes up with error:
An error was encountered:
An error occurred while calling o103.load.
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra.
Please find packages at http://spark.apache.org/third-party-projects.html
A similar question was asked here modify jupyter kernel to add cassandra connection in spark
but i do not see a valid answer.

unable load data into hive using pyspark

unable to write the data into hive using pyspark through jupyter notebook .
giving me below error
Py4JJavaError: An error occurred while calling o99.saveAsTable.
: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
Note these steps already tried:
copied the hdfs-site.xml , core-site.xml to /conf of hive
removed metasotore_db and created again using below cmd
$HIVE_HOME/bin/schematool –initschema –dbtype derby
did you use spark-submit for running your script?
Also you should add -> ".enableHiveSupport()" like that:
spark = SparkSession.builder \
.appName("yourapp") \
.enableHiveSupport() \
.getOrCreate()

Resources