Writing to console through Python(PySpark) KAFKA failing - apache-spark

I have written simple prog. for reading CSV file through kafka topic & writing it to console .
When I run the job I am able to print the Schema but messege is not getting displayed .
Below is the spark config
.config("spark.jars",
"file:///home/hdoop/jars/jsr166e-1.1.0.jar,file:///home/hdoop/jars/spark-cassandra-connector_2.12-3.2.0.jar,file:///home/hdoop/jars/mysql-connector-java-8.0.30.jar,file:///home/hdoop/new_jars/spark-sql-kafka-0-10_2.12-3.2.1.jar,file:///home/hdoop/new_jars/kafka-clients-2.8.0.jar,file:///home/hdoop/new_jars/commons-pool2-2.11.1.jar") \
.config("spark.executor.extraClassPath",
"file:///home/hdoop/jars/jsr166e-1.1.0.jar,file:///home/hdoop/jars/spark-cassandra-connector_2.12-3.2.0.jar,file:///home/hdoop/jars/mysql-connector-java-8.0.30.jar,file:///home/hdoop/new_jars/spark-sql-kafka-0-10_2.12-3.2.1.jar,file:///home/hdoop/new_jars/kafka-clients-2.8.0.jar,file:///home/hdoop/new_jars/commons-pool2-2.11.1.jar") \
.config("spark.executor.extraLibrary",
"file:///home/hdoop/jars/jsr166e-1.1.0.jar,file:///home/hdoop/jars/spark-cassandra-connector_2.12-3.2.0.jar,file:///home/hdoop/jars/mysql-connector-java-8.0.30.jar,file:///home/hdoop/new_jars/spark-sql-kafka-0-10_2.12-3.2.1.jar,file:///home/hdoop/new_jars/kafka-clients-2.8.0.jar,file:///home/hdoop/new_jars/commons-pool2-2.11.1.jar") \
.config("spark.driver.extraClassPath",
"file:///home/hdoop/jars/jsr166e-1.1.0.jar,file:///home/hdoop/jars/spark-cassandra-connector_2.12-3.2.0.jar,file:///home/hdoop/jars/mysql-connector-java-8.0.30.jar,file:///home/hdoop/new_jars/spark-sql-kafka-0-10_2.12-3.2.1.jar,file:///home/hdoop/new_jars/kafka-clients-2.8.0.jar,file:///home/hdoop/new_jars/**commons-pool2-2.11.1.jar**") \
.config("spark.cassandra.connection.host", c_host_nm) \
.config("spark.cassandra.connection.port", c_port_number) \
.getOrCreate()
I am getting error -
pyspark.sql.utils.StreamingQueryException: Query [id = dfa3b327-ac6b-4426-956b-0587501592d2, runId = 49ee4949-47a9-4c8c-b46e-d121ae9ac759] terminated with exception: Writing job aborted
Start of log says -
java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate
Any thoughts on what needs to be done here?
(Python 3.9.2 , Kafka 3.2.1 , pySpark 3.3.0)
Thanks

Related

How do I add bigquery property to a dataproc (jobs submit spark) through gcp shell?

I am trying to load google patent data using bigquery with the following code (Python).
df= spark.read \ .format("bigquery") \ .option("table", table) \ .option("filter", "country_code = '{}' AND application_kind = '{}' AND publication_date >= {} AND publication_date < {}".format(country, kind, date_from, date_to)) \ .load()
Then I got the following error that says
Py4JJavaError: An error occurred while calling o144.load. : java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html
I am suspecting the cause for this error is the lack of bigquery package in my cluster (I checked it on the configuration). I tried to add bigquery using gcp shell command. I get an error of,
$ gcloud dataproc jobs submit spark \ --cluster=my-cluster\ --region=europe-west2 \ --jar=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
But I got an error again
Exception in thread "main" org.apache.spark.SparkException: Failed to get main class in JAR with error 'null'. Please specify one with --class.
How do I install (gcloud dataproc jobs submit spark) bigquery??? Ideally on dataproc.
I don't know what to add in terms of --class

Read CSV file on Spark

I am started working with Spark and found out one problem.
I tried reading CSV file using the below code:
df = spark.read.csv("/home/oybek/Serverspace/Serverspace/Athletes.csv")
df.show(5)
Error:
Py4JJavaError: An error occurred while calling o38.csv.
: java.lang.OutOfMemoryError: Java heap space
I am working in Linux Ubuntu, VirtualBox:~/Serverspace.
You can try changing the driver memory by creating a spark session variable like below:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('local[*]') \
.config("spark.driver.memory", "4g") \
.appName('read-csv') \
.getOrCreate()

Set up jupyter on EMR to read from cassandra using cql?

When I try to set the spark context in jupyter with
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages datastax:spark-cassandra-connector:2.4.0-s_2.11 --conf spark.cassandra.connection.host=x.x.x.x pyspark-shell'
or
spark = SparkSession.builder \
.appName('SparkCassandraApp') \
.config('spark.cassandra.connection.host', 'x.x.x.x') \
.config('spark.cassandra.connection.port', 'xxxx') \
.config('spark.cassandra.output.consistency.level','ONE') \
.master('local[2]') \
.getOrCreate()
I still cannot make a connection to the cassandra cluster with the code
dataFrame = spark.read.format("org.apache.spark.sql.cassandra").option("keyspace", "keyspace").option("table", "table").load()
dataFrame = dataFrame.limit(100)
dataFrame.show()
Comes up with error:
An error was encountered:
An error occurred while calling o103.load.
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra.
Please find packages at http://spark.apache.org/third-party-projects.html
A similar question was asked here modify jupyter kernel to add cassandra connection in spark
but i do not see a valid answer.

Cannot run spark-nlp due to Exception: Java gateway process exited before sending its port number

I have a working Pyspark installation running through Jupyter on a Ubuntu VM.
Only one Java version (openjdk version "1.8.0_265"), and I can I can run a local Spark (v2.4.4) session like this without problems:
import pyspark
from pyspark.sql import SparkSession
memory_gb = 24
conf = (
pyspark.SparkConf()
.setMaster('local[*]')
.set('spark.driver.memory', '{}g'.format(memory_gb))
)
spark = SparkSession \
.builder \
.appName("My Name") \
.config(conf=conf) \
.getOrCreate()
Now I want to use spark-nlp. I've installed spark-nlp using pip install spark-nlp in the same virtual environment my Pyspark is in.
However, when I try to use it, I get the error Exception: Java gateway process exited before sending its port number.
I've tried to follow the instructions in the documentation here, but to no success.
So doing
spark = SparkSession \
.builder \
.appName("RevDNS Stats") \
.config(conf=conf) \
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.5")\
.getOrCreate()
only results in the error mentioned above.
How do I fix this?

unable load data into hive using pyspark

unable to write the data into hive using pyspark through jupyter notebook .
giving me below error
Py4JJavaError: An error occurred while calling o99.saveAsTable.
: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
Note these steps already tried:
copied the hdfs-site.xml , core-site.xml to /conf of hive
removed metasotore_db and created again using below cmd
$HIVE_HOME/bin/schematool –initschema –dbtype derby
did you use spark-submit for running your script?
Also you should add -> ".enableHiveSupport()" like that:
spark = SparkSession.builder \
.appName("yourapp") \
.enableHiveSupport() \
.getOrCreate()

Resources