Cannot run spark-nlp due to Exception: Java gateway process exited before sending its port number

Cannot run spark-nlp due to Exception: Java gateway process exited before sending its port number - apache-spark

I have a working Pyspark installation running through Jupyter on a Ubuntu VM.
Only one Java version (openjdk version "1.8.0_265"), and I can I can run a local Spark (v2.4.4) session like this without problems:
import pyspark
from pyspark.sql import SparkSession
memory_gb = 24
conf = (
pyspark.SparkConf()
.setMaster('local[*]')
.set('spark.driver.memory', '{}g'.format(memory_gb))
)
spark = SparkSession \
.builder \
.appName("My Name") \
.config(conf=conf) \
.getOrCreate()
Now I want to use spark-nlp. I've installed spark-nlp using pip install spark-nlp in the same virtual environment my Pyspark is in.
However, when I try to use it, I get the error Exception: Java gateway process exited before sending its port number.
I've tried to follow the instructions in the documentation here, but to no success.
So doing
spark = SparkSession \
.builder \
.appName("RevDNS Stats") \
.config(conf=conf) \
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.5")\
.getOrCreate()
only results in the error mentioned above.
How do I fix this?

Related

Failed to run pyspark's .withcolumn function

I am trying to run Pyspark on pycharm in Windows 10, but I kept getting some weird error message on Node 81 related to JVM when trying to execute the simple function .withColumn() and .withColumnRenamed. I have a tmp folder on my desktop (see the attached image), and I set all the environment variables for HADOOP_PATH, JAVA_HOME, PATH, PYTHON_PATH and SPARK_HOME. I was also able to create the spark object with the following lines of code
spark = SparkSession \
.builder \
.master("local[*]") \
.appName("Data Est") \
.config("spark.driver.memory", memory='4g') \
.config("spark.sql.shuffle.partitions", partitions=400) \
.config("spark.sql.broadcastTimeout", -1) \
.config("spark.sql.session.timezone", "UTC") \
.config("spark.local.dir", spark_local_dir=[some directory path on desktop]) \
.getOrCreate()
System Environment Variables - Windows 10 64-bit

Read CSV file on Spark

I am started working with Spark and found out one problem.
I tried reading CSV file using the below code:
df = spark.read.csv("/home/oybek/Serverspace/Serverspace/Athletes.csv")
df.show(5)
Error:
Py4JJavaError: An error occurred while calling o38.csv.
: java.lang.OutOfMemoryError: Java heap space
I am working in Linux Ubuntu, VirtualBox:~/Serverspace.

You can try changing the driver memory by creating a spark session variable like below:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('local[*]') \
.config("spark.driver.memory", "4g") \
.appName('read-csv') \
.getOrCreate()

Set up jupyter on EMR to read from cassandra using cql?

When I try to set the spark context in jupyter with
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages datastax:spark-cassandra-connector:2.4.0-s_2.11 --conf spark.cassandra.connection.host=x.x.x.x pyspark-shell'
or
spark = SparkSession.builder \
.appName('SparkCassandraApp') \
.config('spark.cassandra.connection.host', 'x.x.x.x') \
.config('spark.cassandra.connection.port', 'xxxx') \
.config('spark.cassandra.output.consistency.level','ONE') \
.master('local[2]') \
.getOrCreate()
I still cannot make a connection to the cassandra cluster with the code
dataFrame = spark.read.format("org.apache.spark.sql.cassandra").option("keyspace", "keyspace").option("table", "table").load()
dataFrame = dataFrame.limit(100)
dataFrame.show()
Comes up with error:
An error was encountered:
An error occurred while calling o103.load.
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra.
Please find packages at http://spark.apache.org/third-party-projects.html
A similar question was asked here modify jupyter kernel to add cassandra connection in spark
but i do not see a valid answer.

Loading data from GCS using Spark Local

I am trying to read data from GCS buckets on my local machine, for testing purposes. I would like to sample some of the data in the cloud
I have downloaded the GCS Hadoop Connector JAR.
And setup the sparkConf as follow:
conf = SparkConf() \
.setMaster("local[8]") \
.setAppName("Test") \
.set("spark.jars", "path/gcs-connector-hadoop2-latest.jar") \
.set("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
.set("spark.hadoop.google.cloud.auth.service.account.json.keyfile", "path/to/keyfile")
sc = SparkContext(conf=conf)
spark = SparkSession.builder \
.config(conf=sc.getConf()) \
.getOrCreate()
spark.read.json("gs://gcs-bucket")
I have also tried to set the conf like so:
sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.json.keyfile", "path/to/keyfile")
sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.enable", "true")
I am using PySpark install via PIP and running the code using the unit test module from IntelliJ
py4j.protocol.Py4JJavaError: An error occurred while calling o128.json.
: java.io.IOException: No FileSystem for scheme: gs
What should I do?
Thanks!

To solve this issue, you need to add configuration for fs.gs.impl property in addition to properties that you already configured:
sc._jsc.hadoopConfiguration().set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")

How to specify multiple dependencies using --packages for spark-submit?

I have the following as the command line to start a spark streaming job.
spark-submit --class com.biz.test \
--packages \
org.apache.spark:spark-streaming-kafka_2.10:1.3.0 \
org.apache.hbase:hbase-common:1.0.0 \
org.apache.hbase:hbase-client:1.0.0 \
org.apache.hbase:hbase-server:1.0.0 \
org.json4s:json4s-jackson:3.2.11 \
./test-spark_2.10-1.0.8.jar \
>spark_log 2>&1 &
The job fails to start with the following error:
Exception in thread "main" java.lang.IllegalArgumentException: Given path is malformed: org.apache.hbase:hbase-common:1.0.0
at org.apache.spark.util.Utils$.resolveURI(Utils.scala:1665)
at org.apache.spark.deploy.SparkSubmitArguments.parse$1(SparkSubmitArguments.scala:432)
at org.apache.spark.deploy.SparkSubmitArguments.parseOpts(SparkSubmitArguments.scala:288)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:87)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:105)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I've tried removing the formatting and returning to a single line, but that doesn't resolve the issue. I've also tried a bunch of variations: different versions, added _2.10 to the end of the artifactId, etc.
According to the docs (spark-submit --help):
The format for the coordinates should be groupId:artifactId:version.
So what I have should be valid and should reference this package.
If it helps, I'm running Cloudera 5.4.4.
What am I doing wrong? How can I reference the hbase packages correctly?

A list of packages should be separated using commas without whitespaces (breaking lines should work just fine) for example
--packages org.apache.spark:spark-streaming-kafka_2.10:1.3.0,\
org.apache.hbase:hbase-common:1.0.0

I found it worthy to use SparkSession in spark version 3.0.0 for mysql and postgres
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('mysql-postgres').config('spark.jars.packages', 'mysql:mysql-connector-java:8.0.20,org.postgresql:postgresql:42.2.16').getOrCreate()

#Mohammad thanks for this input. This worked for me too. I had to load the Kafka and msql packages in a single sparksession. I did something like this:
spark = (SparkSession .builder ... .appName('myapp') # Add kafka and msql package .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,mysql:mysql-connector-java:8.0.26") .getOrCreate())

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Cannot run spark-nlp due to Exception: Java gateway process exited before sending its port number - apache-spark

Related

Failed to run pyspark's .withcolumn function

Read CSV file on Spark

Set up jupyter on EMR to read from cassandra using cql?

Loading data from GCS using Spark Local

How to specify multiple dependencies using --packages for spark-submit?

Categories

Resources