how to run spark from jupyter on yarn client - apache-spark

I have a single cluster deployed using cloudera manager and spark parcel installed,
when typing pyspark in shell, it works yet the running the below code on jupyter throws exception
code
import sys
import py4j
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
conf = SparkConf()
conf.setMaster('yarn-client')
conf.setAppName('SPARK APP')
sc = SparkContext(conf=conf)
# sc= SparkContext.getOrCreate()
# sc.stop()
def mod(x):
import numpy as np
return (x, np.mod(x, 2))
rdd = sc.parallelize(range(1000)).map(mod).take(10)
print (rdd)
Exception
/usr/lib/python3.6/site-packages/pyspark/context.py in _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, jsc, profiler_cls)
187 self._accumulatorServer = accumulators._start_update_server(auth_token)
188 (host, port) = self._accumulatorServer.server_address
--> 189 self._javaAccumulator = self._jvm.PythonAccumulatorV2(host, port, auth_token)
190 self._jsc.sc().register(self._javaAccumulator)
191
TypeError: 'JavaPackage' object is not callable

after searching abit, spark used version 1.6 is not compatible with python 3.7, had to run it using python 2.7

Related

Py4JError: org.apache.spark.api.python.PythonUtils.getPythonAuthSocketTimeout does not exist in the JVM

I am trying to create SparkContext in jupyter notebook but I am getting following Error:
Py4JError: org.apache.spark.api.python.PythonUtils.getPythonAuthSocketTimeout does not exist in the JVM
Here is my code
from pyspark import SparkContext, SparkConf
conf = SparkConf().setMaster("local").setAppName("Groceries")
sc = SparkContext(conf = conf)
Py4JError Traceback (most recent call last)
<ipython-input-20-5058f350f58a> in <module>
1 conf = SparkConf().setMaster("local").setAppName("My App")
----> 2 sc = SparkContext(conf = conf)
~/Documents/python38env/lib/python3.8/site-packages/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
144 SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
145 try:
--> 146 self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
147 conf, jsc, profiler_cls)
148 except:
~/Documents/python38env/lib/python3.8/site-packages/pyspark/context.py in _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, jsc, profiler_cls)
224 self._encryption_enabled = self._jvm.PythonUtils.isEncryptionEnabled(self._jsc)
225 os.environ["SPARK_AUTH_SOCKET_TIMEOUT"] = \
--> 226 str(self._jvm.PythonUtils.getPythonAuthSocketTimeout(self._jsc))
227 os.environ["SPARK_BUFFER_SIZE"] = \
228 str(self._jvm.PythonUtils.getSparkBufferSize(self._jsc))
~/Documents/python38env/lib/python3.8/site-packages/py4j/java_gateway.py in __getattr__(self, name)
1528 answer, self._gateway_client, self._fqn, name)
1529 else:
-> 1530 raise Py4JError(
1531 "{0}.{1} does not exist in the JVM".format(self._fqn, name))
1532
Py4JError: org.apache.spark.api.python.PythonUtils.getPythonAuthSocketTimeout does not exist in the JVM
Python's pyspark and spark cluster versions are inconsistent and this error is reported.
Uninstall the version that is consistent with the current pyspark, then install the same version as the spark cluster. My spark version is 3.0.2 and run the following code:
pip3 uninstall pyspark
pip3 install pyspark==3.0.2
We need to uninstall the default/exsisting/latest version of PySpark from PyCharm/Jupyter Notebook or any tool that we use.
Then check the version of Spark that we have installed in PyCharm/ Jupyter Notebook / CMD. Using the command spark-submit --version (In CMD/Terminal).
Then Install PySpark which matches the version of Spark that you have.
For example, I have Spark 3.0.3, so I have installed PySpark 3.0.3
In CMD/PyCharm Terminal,
pip install pyspark=3.0.3
Or check this if you are a PyCharm user.
I have had the same error today and resolved it with the below code:
Execute this in a separate cell before you have your spark session builder
from pyspark import SparkContext,SQLContext,SparkConf,StorageLevel
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
SparkSession.builder.config(conf=SparkConf())

Can connect to local master but not remote, pyspark

I'm trying out this example from anaconda docs:
from pyspark import SparkConf
from pyspark import SparkContext
import findspark
findspark.init('/home/Snow/anaconda3/lib/python3.8/site-packages/pyspark')
conf = SparkConf()
conf.setMaster('local[*]')
conf.setAppName('spark')
sc = SparkContext(conf=conf)
def mod(x):
import numpy as np
return (x, np.mod(x, 2))
rdd = sc.parallelize(range(1000)).map(mod).take(10)
Locally the script runs fine, without errors. When I change the line conf.setMaster('local[*]') to conf.setMaster('spark://remote_ip:7077') I get the error:
Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem
at scala.Predef$.require(Predef.scala:281)
Why is this happening? I also added SPARK_MASTER_HOST=remote_ip and
SPARK_MASTER_PORT=7077 to ~/anaconda3/lib/python3.8/site-packages/pyspark/bin/load_spark_env.sh.
My spark version is 3.0.1 and server is 3.0.0
I can ping the remote_ip.

Pyspark LDA: TypeError: 'JavaPackage' object is not callable

I installed pyspark and getting a type error when trying to initialize spark context.
Pyspark installation:
The code is as below:
import pyspark
from collections import defaultdict
from pyspark import SparkContext
from pyspark.mllib.linalg import Vector, Vectors
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.sql import SQLContext
# Initialize
sc = SparkContext("local", "Simple App")
sql_context = SQLContext(sc)
Error message is as below:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-11-bedceba80d96> in <module>()
7
8 # Initialize
----> 9 sc = SparkContext("local", "Simple App")
10 sql_context = SQLContext(sc)
c:\users\vishn\appdata\local\enthought\canopy\user\lib\site-packages\pyspark\context.pyc in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
116 try:
117 self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
--> 118 conf, jsc, profiler_cls)
119 except:
120 # If an error occurs, clean up in order to allow future SparkContext creation:
c:\users\vishn\appdata\local\enthought\canopy\user\lib\site-packages\pyspark\context.pyc in _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, jsc, profiler_cls)
186 self._accumulatorServer = accumulators._start_update_server()
187 (host, port) = self._accumulatorServer.server_address
--> 188 self._javaAccumulator = self._jvm.PythonAccumulatorV2(host, port)
189 self._jsc.sc().register(self._javaAccumulator)
190
TypeError: 'JavaPackage' object is not callable

Read csv using pyspark

I am new to spark. And I am trying to read csv file using pyspark. And I referred to PySpark How to read CSV into Dataframe, and manipulate it, Get CSV to Spark dataframe and many more. I tried to read it two ways:
1
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.conf import SparkConf
sc = SparkContext.getOrCreate()
df = spark.read.csv('D:/Users/path/csv/test.csv')
df.show()
2
import pyspark
sc = pyspark.SparkContext()
sql = SQLContext(sc)
df = (sql.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("D:/Users/path/csv/test.csv"))
df.show()
Neither of the codes are working. I am getting the following error:
Py4JJavaError Traceback (most recent call last)
<ipython-input-28-c6263cc7dab9> in <module>()
4
5 sc = SparkContext.getOrCreate()
----> 6 df = spark.read.csv('D:/Users/path/csv/test.csv')
7 df.show()
8
~\opt\spark\spark-2.1.0-bin-hadoop2.7\python\pyspark\sql\readwriter.py in csv(self, path, schema, sep, encoding, quote, escape, comment, header, inferSchema, ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, nullValue, nanValue, positiveInf, negativeInf, dateFormat, timestampFormat, maxColumns, maxCharsPerColumn, maxMalformedLogPerPartition, mode)
378 if isinstance(path, basestring):
379 path = [path]
--> 380 return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
381
382 #since(1.5)
~\opt\spark\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
~\opt\spark\spark-2.1.0-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
~\opt\spark\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
Py4JJavaError: An error occurred while calling o663.csv.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.hive.execution.HiveFileFormat not found
at java.util.ServiceLoader.fail(ServiceLoader.java:239)
at java.util.ServiceLoader.access$300(ServiceLoader.java:185)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:372)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
I don't why it is throwing some hive exception Py4JJavaError: An error occurred while calling o663.csv.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.hive.execution.HiveFileFormat not found. How to resolve this error HiveFileFormat not found.
Can anyone guide me to resolve this error?
Have you tried to use sqlContext.read.csv? This is how I read csvs in Spark 2.1
from pyspark import sql, SparkConf, SparkContext
conf = SparkConf().setAppName("Read_CSV")
sc = SparkContext(conf=conf)
sqlContext = sql.SQLContext(sc)
df = sqlContext.read.csv("path/to/data")
df.show()
First of all, the system needs to recognize Spark Session as the following commands:
from pyspark import SparkConf, SparkContext
sc = SparkContext()
after that, SQL library has to be introduced to the system like this:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
and finally you can read your CSV by the following command:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('path/to/your/file.csv')
Since in PySpark 3.0.1 SQLContext is Deprectaed - to import a CSV file into PySpark.
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python") \
.getOrCreate()
df = spark.read.csv("/path/to/file/csv")
df.show()
Try to specify to use local master by making a configuration object. This would remove some doubts of spark trying to access on hadoop or anywhere as mentioned by somebody in comment.
sc.stop()
conf = (conf.setMaster('local[*]'))
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
if this is not working, then dont use sqlcontext for reading the file. Try spark.read.csv("path/filename.csv") by creating sparksession.
Also, it is best if you use Spark/Hadoop with a Linux operating system as it is a lot simpler in those systems.
Error most likely occurs because you are trying to access a local file.
See below how you should access it:
#Local File
spark.read.option("header","true").option("inferSchema","true").csv("file:///path")
#HDFS file
spark.read.option("header","true").option("inferSchema","true").csv("/path")
.csv(<path>) comes last.

I am trying to run my first word count application in spark through jupyter. But I am getting error in the initialization of SparkContext

I am trying to run my first word count application in spark through jupyter. But I am getting error in the initialization of SparkContext.
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("Spark Count")
sc = SparkContext(conf=conf)
Below is the error:
ValueError Traceback (most recent call last)
<ipython-input-13-6b825dbb354c> in <module>()
----> 1 sc = SparkContext(conf=conf)
/home/master/Desktop/Apps/spark-2.1.0-bin-hadoop2.7/python/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
113 """
114 self._callsite = first_spark_call() or CallSite(None, None, None)
--> 115 SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
116 try:
117 self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
/home/master/Desktop/Apps/spark-2.1.0-bin-hadoop2.7/python/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
270 " created by %s at %s:%s "
271 % (currentAppName, currentMaster,
--> 272 callsite.function, callsite.file, callsite.linenum))
273 else:
274 SparkContext._active_spark_context = instance
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=PySparkShell, master=local[*]) created by <module> at /usr/local/lib/python3.3/site-packages/IPython/utils/py3compat.py:186
I think you already have a SparkContext object that is created by Jupyter automatically. You shouldn't have to create a new one.
Just type sc in a cell and execute it. It should display a reference to an existing context
Hope that helps!
In fact the error indicated it already:
ValueError: Cannot run multiple SparkContexts at once

Resources