Cannot run spark-nlp due to Exception: Java gateway process exited before sending its port number - apache-spark

I have a working Pyspark installation running through Jupyter on a Ubuntu VM.
Only one Java version (openjdk version "1.8.0_265"), and I can I can run a local Spark (v2.4.4) session like this without problems:
import pyspark
from pyspark.sql import SparkSession
memory_gb = 24
conf = (
.set('spark.driver.memory', '{}g'.format(memory_gb))
spark = SparkSession \
.builder \
.appName("My Name") \
.config(conf=conf) \
Now I want to use spark-nlp. I've installed spark-nlp using pip install spark-nlp in the same virtual environment my Pyspark is in.
However, when I try to use it, I get the error Exception: Java gateway process exited before sending its port number.
I've tried to follow the instructions in the documentation here, but to no success.
So doing
spark = SparkSession \
.builder \
.appName("RevDNS Stats") \
.config(conf=conf) \
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.5")\
only results in the error mentioned above.
How do I fix this?


Failed to run pyspark's .withcolumn function

I am trying to run Pyspark on pycharm in Windows 10, but I kept getting some weird error message on Node 81 related to JVM when trying to execute the simple function .withColumn() and .withColumnRenamed. I have a tmp folder on my desktop (see the attached image), and I set all the environment variables for HADOOP_PATH, JAVA_HOME, PATH, PYTHON_PATH and SPARK_HOME. I was also able to create the spark object with the following lines of code
spark = SparkSession \
.builder \
.master("local[*]") \
.appName("Data Est") \
.config("spark.driver.memory", memory='4g') \
.config("spark.sql.shuffle.partitions", partitions=400) \
.config("spark.sql.broadcastTimeout", -1) \
.config("spark.sql.session.timezone", "UTC") \
.config("spark.local.dir", spark_local_dir=[some directory path on desktop]) \
System Environment Variables - Windows 10 64-bit

Read CSV file on Spark

I am started working with Spark and found out one problem.
I tried reading CSV file using the below code:
df ="/home/oybek/Serverspace/Serverspace/Athletes.csv")
Py4JJavaError: An error occurred while calling o38.csv.
: java.lang.OutOfMemoryError: Java heap space
I am working in Linux Ubuntu, VirtualBox:~/Serverspace.
You can try changing the driver memory by creating a spark session variable like below:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('local[*]') \
.config("spark.driver.memory", "4g") \
.appName('read-csv') \

Set up jupyter on EMR to read from cassandra using cql?

When I try to set the spark context in jupyter with
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages datastax:spark-cassandra-connector:2.4.0-s_2.11 --conf pyspark-shell'
spark = SparkSession.builder \
.appName('SparkCassandraApp') \
.config('', 'x.x.x.x') \
.config('spark.cassandra.connection.port', 'xxxx') \
.config('spark.cassandra.output.consistency.level','ONE') \
.master('local[2]') \
I still cannot make a connection to the cassandra cluster with the code
dataFrame ="org.apache.spark.sql.cassandra").option("keyspace", "keyspace").option("table", "table").load()
dataFrame = dataFrame.limit(100)
Comes up with error:
An error was encountered:
An error occurred while calling o103.load.
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra.
Please find packages at
A similar question was asked here modify jupyter kernel to add cassandra connection in spark
but i do not see a valid answer.

Loading data from GCS using Spark Local

I am trying to read data from GCS buckets on my local machine, for testing purposes. I would like to sample some of the data in the cloud
I have downloaded the GCS Hadoop Connector JAR.
And setup the sparkConf as follow:
conf = SparkConf() \
.setMaster("local[8]") \
.setAppName("Test") \
.set("spark.jars", "path/gcs-connector-hadoop2-latest.jar") \
.set("", "true") \
.set("", "path/to/keyfile")
sc = SparkContext(conf=conf)
spark = SparkSession.builder \
.config(conf=sc.getConf()) \
I have also tried to set the conf like so:
sc._jsc.hadoopConfiguration().set("", "")
sc._jsc.hadoopConfiguration().set("", "path/to/keyfile")
sc._jsc.hadoopConfiguration().set("", "true")
I am using PySpark install via PIP and running the code using the unit test module from IntelliJ
py4j.protocol.Py4JJavaError: An error occurred while calling o128.json.
: No FileSystem for scheme: gs
What should I do?
To solve this issue, you need to add configuration for property in addition to properties that you already configured:
sc._jsc.hadoopConfiguration().set("", "")

How to specify multiple dependencies using --packages for spark-submit?

I have the following as the command line to start a spark streaming job.
spark-submit --class \
--packages \
org.apache.spark:spark-streaming-kafka_2.10:1.3.0 \
org.apache.hbase:hbase-common:1.0.0 \
org.apache.hbase:hbase-client:1.0.0 \
org.apache.hbase:hbase-server:1.0.0 \
org.json4s:json4s-jackson:3.2.11 \
./test-spark_2.10-1.0.8.jar \
>spark_log 2>&1 &
The job fails to start with the following error:
Exception in thread "main" java.lang.IllegalArgumentException: Given path is malformed: org.apache.hbase:hbase-common:1.0.0
at org.apache.spark.util.Utils$.resolveURI(Utils.scala:1665)
at org.apache.spark.deploy.SparkSubmitArguments.parse$1(SparkSubmitArguments.scala:432)
at org.apache.spark.deploy.SparkSubmitArguments.parseOpts(SparkSubmitArguments.scala:288)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:87)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:105)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I've tried removing the formatting and returning to a single line, but that doesn't resolve the issue. I've also tried a bunch of variations: different versions, added _2.10 to the end of the artifactId, etc.
According to the docs (spark-submit --help):
The format for the coordinates should be groupId:artifactId:version.
So what I have should be valid and should reference this package.
If it helps, I'm running Cloudera 5.4.4.
What am I doing wrong? How can I reference the hbase packages correctly?
A list of packages should be separated using commas without whitespaces (breaking lines should work just fine) for example
--packages org.apache.spark:spark-streaming-kafka_2.10:1.3.0,\
I found it worthy to use SparkSession in spark version 3.0.0 for mysql and postgres
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('mysql-postgres').config('spark.jars.packages', 'mysql:mysql-connector-java:8.0.20,org.postgresql:postgresql:42.2.16').getOrCreate()
#Mohammad thanks for this input. This worked for me too. I had to load the Kafka and msql packages in a single sparksession. I did something like this:
spark = (SparkSession .builder ... .appName('myapp') # Add kafka and msql package .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,mysql:mysql-connector-java:8.0.26") .getOrCreate())
