PySpark exception while using IPython

PySpark exception while using IPython - apache-spark

I installed PySpark and Ipython notebook in ubuntu 12.04.
After installing when I run the "ipython --profile=pyspark", it is throwing the following exception
ubuntu_user#ubuntu_user-VirtualBox:~$ ipython --profile=pyspark
Python 2.7.3 (default, Jun 22 2015, 19:33:41)
Type "copyright", "credits" or "license" for more information.
IPython 0.12.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
IPython profile: pyspark
Error: Must specify a primary resource (JAR or Python or R file)
Run with --help for usage help or --verbose for debug output
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
/usr/lib/python2.7/dist-packages/IPython/utils/py3compat.pyc in execfile(fname, *where)
173 else:
174 filename = fname
--> 175 __builtin__.execfile(filename, *where)
/home/ubuntu_user/.config/ipython/profile_pyspark/startup/00-pyspark-setup.py in <module>()
6 sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
7
----> 8 execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
9
/home/ubuntu_user/spark/python/pyspark/shell.py in <module>()
41 SparkContext.setSystemProperty("spark.executor.uri", os.environ["SPARK_EXECUTOR_URI"])
42
---> 43 sc = SparkContext(pyFiles=add_files)
44 atexit.register(lambda: sc.stop())
45
/home/ubuntu_user/spark/python/pyspark/context.pyc in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
108 """
109 self._callsite = first_spark_call() or CallSite(None, None, None)
--> 110 SparkContext._ensure_initialized(self, gateway=gateway)
111 try:
112 self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
/home/ubuntu_user/spark/python/pyspark/context.pyc in _ensure_initialized(cls, instance, gateway)
232 with SparkContext._lock:
233 if not SparkContext._gateway:
--> 234 SparkContext._gateway = gateway or launch_gateway()
235 SparkContext._jvm = SparkContext._gateway.jvm
236
/home/ubuntu_user/spark/python/pyspark/java_gateway.pyc in launch_gateway()
92 callback_socket.close()
93 if gateway_port is None:
---> 94 raise Exception("Java gateway process exited before sending the driver its port number")
95
96 # In Windows, ensure the Java child processes do not linger after Python has exited.
Exception: Java gateway process exited before sending the driver its port number
Below is the settings and configuration file.
ubuntu_user#ubuntu_user-VirtualBox:~$ ls /home/ubuntu_user/spark
bin ec2 licenses README.md
CHANGES.txt examples NOTICE RELEASE
conf lib python sbin
data LICENSE R spark-1.5.2-bin-hadoop2.6.tgz
Below is the IPython setting
ubuntu_user#ubuntu_user-VirtualBox:~$ ls .config/ipython/profile_pyspark/
db ipython_config.py log security
history.sqlite ipython_notebook_config.py pid startup
IPython and Spark(PySpark) Configuration
ubuntu_user#ubuntu_user-VirtualBox:~$ vi .config/ipython/profile_pyspark/ipython_notebook_config.py
# Configuration file for ipython-notebook.
c = get_config()
# IPython PySpark
c.NotebookApp.ip = 'localhost'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 7770
ubuntu_user#ubuntu_user-VirtualBox:~$ vi .config/ipython/profile_pyspark/startup/00-pyspark-setup.py
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
Setting the following environment variables in .bashrc or .bash_profile:
ubuntu_user#ubuntu_user-VirtualBox:~$ vi .bashrc
export SPARK_HOME="/home/ubuntu_user/spark"
export PYSPARK_SUBMIT_ARGS="--master local[2]"
I am new for apache spark and IPython. How to solve this issue?

I had the same exception when my virtual machine doesn't have enough memory for Java. So I allocated more memory for my virtual machine and this exception goes away.
Steps: Shut down the VM -> VirtualBox Settings -> "System" tab -> Set the memory
(However, this may be only a workaround. I guess the correct way to fix this exception might be properly configuring Spark in terms of java memory.)

May be there is an error locating the pyspark shell by the spark.
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
This will work for Spark 1.6.1. If you have a different version try locating the .zip file and adding the path to the extract.

Two thoughts:
Where is your JDK? I don't see a JAVA_HOME parameter configured in your file. That might be enough given:
Error: Must specify a primary resource (JAR or Python or R file)
Second, Make sure your port 7770 is open and available to your JVM.

Related

SciKit-Learn Interactive simulation of data through UI

With the current version 0.22.2 there is a interactive tool to enter interactive data and see the results. Its called Libsvm GUI.
I never managed to have it running in Jupyter notebook.
Having seen that there is a binder option. When trying this (which should not depend on my computer environment) errors come up.
https://scikit-learn.org/stable/auto_examples/applications/svm_gui.html
Automatically created module for IPython interactive environment
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-2-e5e1b6a6b155> in <module>
6
7 import matplotlib
----> 8 matplotlib.use('TkAgg')
9
10 from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg
/srv/conda/envs/notebook/lib/python3.7/site-packages/matplotlib/cbook/deprecation.py in wrapper(*args, **kwargs)
305 f"for the old name will be dropped %(removal)s.")
306 kwargs[new] = kwargs.pop(old)
--> 307 return func(*args, **kwargs)
308
309 # wrapper() must keep the same documented signature as func(): if we
/srv/conda/envs/notebook/lib/python3.7/site-packages/matplotlib/__init__.py in use(backend, warn, force)
1305 if force:
1306 from matplotlib.pyplot import switch_backend
-> 1307 switch_backend(name)
1308 else:
1309 # Finally if pyplot is not imported update both rcParams and
/srv/conda/envs/notebook/lib/python3.7/site-packages/matplotlib/pyplot.py in switch_backend(newbackend)
234 "Cannot load backend {!r} which requires the {!r} interactive "
235 "framework, as {!r} is currently running".format(
--> 236 newbackend, required_framework, current_framework))
237
238 rcParams['backend'] = rcParamsDefault['backend'] = newbackend
ImportError: Cannot load backend 'TkAgg' which requires the 'tk' interactive framework, as 'headless' is currently running
Seeing the first error, it seems even with the untouched binder environment there is something wrong. But I am not sure if it is with binder or with the code itself.
What can I try to make it working?

Permission Error and unable to import PIL.image on jupyter

I have to apply this code for computer vision project https://www.quora.com/How-do-I-load-train-and-test-data-from-the-local-drive-for-a-deep-learning-Keras-model which is load train and test data from the local drive for Keras model.
I was tried but appears some errors such as:
PermissionError Traceback (most recent call last)
<ipython-input-10-3806351fb2b0> in <module>
14 for sample in train_batch:
15 img_path = train_path+sample
---> 16 x = image.load_img(img_path)
17 # preprocessing if required
18 x_train.append(x)
~\Anaconda3\lib\site-packages\keras_preprocessing\image\utils.py in load_img(path, grayscale, color_mode, target_size, interpolation)
108 raise ImportError('Could not import PIL.Image. '
109 'The use ofload_imgrequires PIL.')
--> 110 img = pil_image.open(path)
111 if color_mode == 'grayscale':
112 if img.mode != 'L':
~\Anaconda3\lib\site-packages\PIL\Image.py in open(fp, mode)
2768
2769 if filename:
-> 2770 fp = builtins.open(filename, "rb")
2771 exclusive_fp = True
2772
PermissionError: [Errno 13] Permission denied: 'C:\\Users\\ASUS\\Desktop\\step2_dir/datasets/dataset/Alfalfa'
Note: I already sure about installed PIL successfully.
So, I need some help if anyone can try to apply the code and tell me how to fix the errors.
Thanks.

You don't have the permission to access the filepath:
PermissionError: [Errno 13] Permission denied:
C:\\Users\\ASUS\\Desktop\\step2_dir/datasets/dataset/Alfalfa
Easily solvable! Are you on Windows?
You most likely open jupyter notebook from a Powershell, you just have to open a shell with administrator rights (right-click on cmd and/or powershell --> Run as administrator).
If you are on a bash shell run jupyter notebook as a superuser, and you will have the permission needed:
sudo jupyter notebook

Nginx + uWSGI + django NoModuleFoundError: No module named 'saleor'

Python 3.6.8 to create venv
uwsgi installed in venv and is properly symlinked and using the correct python version
nginx installed system wide
just trying to get the most basic uwsgi --ini myinifile.ini to function properly.
Despite days of trying to fix this problem I always end back up here, with my django application not being recognized by uwsgi when I launch it.
uwsgi works properly when uwsgi --uwsgi-file test.py --http-socket :8001
I can python manage.py runserver 0.0.0.0:8001 just fine and django launches the application with 0 permission issues on navigating the application website.
I've seen this error so many times I think I'm taking crazy pills and missing something silly and obvious.
[uWSGI] getting INI configuration from /home/saleor/saleor/saleor/wsgi/uwsgi.ini
[uwsgi-static] added mapping for /static => /app/static
*** Starting uWSGI 2.0.18 (64bit) on [Mon Nov 18 07:36:36 2019] ***
compiled with version: 7.4.0 on 18 November 2019 04:47:07
os: Linux-4.15.0 #1 SMP Thu Aug 23 19:33:51 MSK 2018
nodename: myurl.com
machine: x86_64
clock source: unix
pcre jit disabled
detected number of CPU cores: 2
current working directory: /home/saleor/saleor/venv/bin
detected binary path: /home/saleor/saleor/venv/bin/uwsgi
chdir() to /home/saleor/saleor/saleor
your processes number limit is 62987
your memory page size is 4096 bytes
detected max file descriptor number: 1024
building mime-types dictionary from file /etc/mime.types...554 entry found
lock engine: pthread robust mutexes
thunder lock: disabled (you can enable it with --thunder-lock)
uwsgi socket 0 bound to TCP address :8001 fd 3
Python version: 3.6.8 (default, Oct 7 2019, 12:59:55) [GCC 8.3.0]
PEP 405 virtualenv detected: /home/saleor/saleor/venv
Set PythonHome to /home/saleor/saleor/venv
*** Python threads support is disabled. You can enable it with --enable-threads ***
Python main interpreter initialized at 0x558314cc3dd0
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 145808 bytes (142 KB) for 1 cores
*** Operational MODE: single process ***
ModuleNotFoundError: No module named 'saleor'
unable to load app 0 (mountpoint='') (callable not found or import error)
*** no app loaded. going in full dynamic mode ***
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 7969)
spawned uWSGI worker 1 (pid: 7970, cores: 1)
^CSIGINT/SIGQUIT received...killing workers...
worker 1 buried after 1 seconds
goodbye to uWSGI.
This is my uwsgi.py file named:__init__.py
"""WSGI config for saleor project.
This module contains the WSGI application used by Django's development server
and any production WSGI deployments. It should expose a module-level variable
named ``application``. Django's ``runserver`` and ``runfcgi`` commands discover
this application via the ``WSGI_APPLICATION`` setting.
Usually you will have the standard Django WSGI application here, but it also
might make sense to replace the whole Django WSGI application with a custom one
that later delegates to the Django one. For example, you could introduce WSGI
middleware here, or combine a Django application with an application of another
framework.
"""
from django.core.wsgi import get_wsgi_application
from saleor.wsgi.health_check import health_check
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "saleor.settings")
# This application object is used by any WSGI server configured to use this
# file. This includes Django's development server, if the WSGI_APPLICATION
# setting points here.
application = get_wsgi_application()
# Apply WSGI middleware here.
# from helloworld.wsgi import HelloWorldApplication
# application = HelloWorldApplication(application)
application = health_check(application, "/health/")
And this is my uwsgi.ini
[uwsgi]
die-on-term = true
http-socket = :$(PORT)
log-format = UWSGI uwsgi "%(method) %(uri) %(proto)" %(status) %(size) %(msecs)ms [PID:%(pid):Worker-%(wid)] [RSS:%(rssM)MB]
master = true
max-requests = 100
memory-report = true
module = saleor.wsgi:application
processes = 1
static-map = /static=/app/static
mimefile = /etc/mime.types
env = /home/saelor/saleor/venv
virtualenv = /home/saleor/saleor/venv
chdir = /home/saleor/saleor/saleor
master = true
vacuum = true
chmod-socket = 666
uid = saleor
gid = saleor
I'm just trying to get uwsgi to run my saleor app so I can start to work on getting it and nginx to communicate properly :( I'm lost and my brain seriously hurts.

update to python 3.8
install uwsgi inside your venv
uwsgi --http :8000 --module saleor.wsgi
do not touch any other settings

I have got this error and I solved it this way
Install uwsgi inside your venv
If it is already installed, then deactivate your virtual
environment, then again activate it.
3.uwsgi --http :8000 --module saleor.wsgi

RuntimeError: Unable to start JVM because of Deprecated: convertStrings

I run an automated python job on an EMR cluster that updates Amazon Athena Tables.
It was running well until few days ago (on python 2.7 and 3.7). Here is the script:
from pyathenajdbc import connect
import yaml
config = yaml.load(open('athena-config.yaml', 'r'))
statements = config['statements']
staging_dir = config['staging_dir']
conn = connect(s3_staging_dir=staging_dir, region_name='eu-west-1')
try:
with conn.cursor() as cursor:
for statement in statements:
cursor.execute(statement)
finally:
conn.close()
The athena-config.yaml has a staging directory and few Athena Statements.
Here is the Error:
You are using pip version 9.0.3, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Unrecognized option: -server
create_tables.py:5: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(open('athena-config.yaml', 'r'))
/mnt/conda/lib/python3.7/site-packages/jpype/_core.py:210: UserWarning:
-------------------------------------------------------------------------------
Deprecated: convertStrings was not specified when starting the JVM. The default
behavior in JPype will be False starting in JPype 0.8. The recommended setting
for new code is convertStrings=False. The legacy value of True was assumed for
this session. If you are a user of an application that reported this warning,
please file a ticket with the developer.
-------------------------------------------------------------------------------
""")
Traceback (most recent call last):
File "create_tables.py", line 10, in <module>
region_name='eu-west-1')
File "/mnt/conda/lib/python3.7/site-packages/pyathenajdbc/__init__.py", line 69, in connect
driver_path, log4j_conf, **kwargs)
File "/mnt/conda/lib/python3.7/site-packages/pyathenajdbc/connection.py", line 68, in __init__
self._start_jvm(jvm_path, jvm_options, driver_path, log4j_conf)
File "/mnt/conda/lib/python3.7/site-packages/pyathenajdbc/util.py", line 25, in _wrapper
return wrapped(*args, **kwargs)
File "/mnt/conda/lib/python3.7/site-packages/pyathenajdbc/connection.py", line 97, in _start_jvm
jpype.startJVM(jvm_path, *args)
File "/mnt/conda/lib/python3.7/site-packages/jpype/_core.py", line 219, in startJVM
_jpype.startup(jvmpath, tuple(args), ignoreUnrecognized, convertStrings)
RuntimeError: Unable to start JVM
at loadJVM(native/common/jp_env.cpp:169)
at loadJVM(native/common/jp_env.cpp:179)
at startup(native/python/pyjp_module.cpp:159)
As far as I understand the issue in convertStrings being deprecated. Can anyone help me resolve that? I cannot understand why this """) comes before the traceback, and what changed in past days to break the code. Thanks!

Got the same issue today. Try to downgrade JPype1 to 0.6.3. JPype1 released 0.7.0 today, which is not compatible with old interfaces.

The issue appears to be that the package is calling the JVM with an unrecognized argument -server. The previous version was ignoring those sort of errors allowing things to proceed. To get the same behavior with 0.7.0, the flag ignoreUnrecognized would need to be set to True. Likely this needs to be send to pyathenajdbc to correct the defect which placed the bogus argument into the startJVM in the first place.
Looking at the source the -server is hardcoded into the module.
if not jpype.isJVMStarted():
_logger.debug('JVM path: %s', jvm_path)
args = [
'-server',
'-Djava.class.path={0}'.format(driver_path),
'-Dlog4j.configuration=file:{0}'.format(log4j_conf)
]
if jvm_options:
args.extend(jvm_options)
_logger.debug('JVM args: %s', args)
jpype.startJVM(jvm_path, *args)
cls.class_loader = jpype.java.lang.Thread.currentThread().getContextClassLoader()
It is assuming a particular JVM which accepts -server as an argument.

Netezza Drivers not available in Spark (Python Notebook) in DataScienceExperience

I have a project code in Python Notebook and it ran all good when Spark was hosted in Bluemix.
We are running the following code to connect to Netezza (on premises) which worked fine in Bluemix.
VT = sqlContext.read.format('jdbc').options(url='jdbc:netezza://169.54.xxx.x:xxxx/BACC_PRD_ISCNZ_GAPNZ',user='XXXXXX', password='XXXXXXX', dbtable='GRACE.CDVT_LIVE_SPARK', driver='org.netezza.Driver').load()'
However, after migration to DatascienceExperience, we are getting the following error. I have established the secure gateway and its all working fine, but this code is not running. I think the issue is with the Netezza driver. If it is the case, is there a way we can explicitly import the class/driver so the above code can be executed. Please help how we can address the issue.
Error Message:
/usr/local/src/spark20master/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/usr/local/src/spark20master/spark/python/lib/py4j-0.10.3-src.zip /py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1} {2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
Py4JJavaError: An error occurred while calling o212.load.
: java.lang.ClassNotFoundException: org.netezza.driver
at java.net.URLClassLoader.findClass(URLClassLoader.java:607)
at java.lang.ClassLoader.loadClassHelper(ClassLoader.java:844)
at java.lang.ClassLoader.loadClass(ClassLoader.java:823)
at java.lang.ClassLoader.loadClass(ClassLoader.java:803)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:38)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createC onnectionFactory$1.apply(JdbcUtils.scala:49)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createC onnectionFactory$1.apply(JdbcUtils.scala:49)
at scala.Option.foreach(Option.scala:257)

You can install a jar file by adding a cell with an exclamation mark that runs a unix tool to download the file, in this example wget:
!wget https://some.public.host/yourfile.jar -P ${HOME}/data/libs
After downloading the file you will need to restart your kernel.
Note this approach assumes your jar file is publicly available on the Internet.

Notebooks in Bluemix and notebooks in DSX (Data Science Experience) currently use the same backend, so they have access to the same pre-installed drivers. Netezza isn't among them. As Chris Snow pointed out, users can install additional JARs and Python packages into their service instances.
You probably created a new service instance for DSX, and did not yet install the user JARs and packages that the old one had. It's a one-time setup, therefore easy to forget when you've been using the same instance for a while. Execute these commands in a Python notebook of the old instance on Bluemix to check for user-installed things:
!ls -lF ~/data/libs
!pip freeze
Then install the missing things into your new instance on DSX.

There is another way to connect to Netezza using ingest connector which
is by default enabled in DSX.
http://datascience.ibm.com/docs/content/analyze-data/python_load.html
from ingest import Connectors
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
NetezzaloadOptions = {
Connectors.Netezza.HOST : 'hostorip',
Connectors.Netezza.PORT : 'port',
Connectors.Netezza.DATABASE : 'databasename',
Connectors.Netezza.USERNAME : 'xxxxx',
Connectors.Netezza.PASSWORD : 'xxxx',
Connectors.Netezza.SOURCE_TABLE_NAME : 'tablename'}
NetezzaDF = sqlContext.read.format("com.ibm.spark.discover").options(**NetezzaloadOptions).load()
NetezzaDF.printSchema()
NetezzaDF.show()
Thanks,
Charles.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

PySpark exception while using IPython - apache-spark

Two thoughts: Where is your JDK? I don't see a JAVA_HOME parameter configured in your file. That might be enough given: Error: Must specify a primary resource (JAR or Python or R file) Second, Make sure your port 7770 is open and available to your JVM.

Related

SciKit-Learn Interactive simulation of data through UI

Permission Error and unable to import PIL.image on jupyter

Nginx + uWSGI + django NoModuleFoundError: No module named 'saleor'

RuntimeError: Unable to start JVM because of Deprecated: convertStrings

Netezza Drivers not available in Spark (Python Notebook) in DataScienceExperience

Categories

Resources