Netezza Drivers not available in Spark (Python Notebook) in DataScienceExperience - apache-spark

I have a project code in Python Notebook and it ran all good when Spark was hosted in Bluemix.
We are running the following code to connect to Netezza (on premises) which worked fine in Bluemix.
VT = sqlContext.read.format('jdbc').options(url='jdbc:netezza://169.54.xxx.x:xxxx/BACC_PRD_ISCNZ_GAPNZ',user='XXXXXX', password='XXXXXXX', dbtable='GRACE.CDVT_LIVE_SPARK', driver='org.netezza.Driver').load()'
However, after migration to DatascienceExperience, we are getting the following error. I have established the secure gateway and its all working fine, but this code is not running. I think the issue is with the Netezza driver. If it is the case, is there a way we can explicitly import the class/driver so the above code can be executed. Please help how we can address the issue.
Error Message:
/usr/local/src/spark20master/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/usr/local/src/spark20master/spark/python/lib/py4j-0.10.3-src.zip /py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1} {2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
Py4JJavaError: An error occurred while calling o212.load.
: java.lang.ClassNotFoundException: org.netezza.driver
at java.net.URLClassLoader.findClass(URLClassLoader.java:607)
at java.lang.ClassLoader.loadClassHelper(ClassLoader.java:844)
at java.lang.ClassLoader.loadClass(ClassLoader.java:823)
at java.lang.ClassLoader.loadClass(ClassLoader.java:803)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:38)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createC onnectionFactory$1.apply(JdbcUtils.scala:49)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createC onnectionFactory$1.apply(JdbcUtils.scala:49)
at scala.Option.foreach(Option.scala:257)

You can install a jar file by adding a cell with an exclamation mark that runs a unix tool to download the file, in this example wget:
!wget https://some.public.host/yourfile.jar -P ${HOME}/data/libs
After downloading the file you will need to restart your kernel.
Note this approach assumes your jar file is publicly available on the Internet.

Notebooks in Bluemix and notebooks in DSX (Data Science Experience) currently use the same backend, so they have access to the same pre-installed drivers. Netezza isn't among them. As Chris Snow pointed out, users can install additional JARs and Python packages into their service instances.
You probably created a new service instance for DSX, and did not yet install the user JARs and packages that the old one had. It's a one-time setup, therefore easy to forget when you've been using the same instance for a while. Execute these commands in a Python notebook of the old instance on Bluemix to check for user-installed things:
!ls -lF ~/data/libs
!pip freeze
Then install the missing things into your new instance on DSX.

There is another way to connect to Netezza using ingest connector which
is by default enabled in DSX.
http://datascience.ibm.com/docs/content/analyze-data/python_load.html
from ingest import Connectors
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
NetezzaloadOptions = {
Connectors.Netezza.HOST : 'hostorip',
Connectors.Netezza.PORT : 'port',
Connectors.Netezza.DATABASE : 'databasename',
Connectors.Netezza.USERNAME : 'xxxxx',
Connectors.Netezza.PASSWORD : 'xxxx',
Connectors.Netezza.SOURCE_TABLE_NAME : 'tablename'}
NetezzaDF = sqlContext.read.format("com.ibm.spark.discover").options(**NetezzaloadOptions).load()
NetezzaDF.printSchema()
NetezzaDF.show()
Thanks,
Charles.

Related

Amazon EMR pyspark unable to read a JSON file

I am new to EMR and Bigdata,
We have an EMR step and that was working fine till last month, currently I am getting the below error.
--- Logging error ---
Traceback (most recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/src.zip/src/source/Data_Extraction.py", line 59, in process_job_description
df_job_desc = spark.read.schema(schema_jd).option('multiline',"true").json(self.filepath)
File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 274, in json
return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/mnt/yarn/usercache/hadoop/appcache/application_1660495066893_0006/container_1660495066893_0006_01_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o115.json.
: java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Remote host terminated the handshake
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.list(Jets3tNativeFileSystemStore.java:421)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:654)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:625)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.listStatus(EmrFileSystem.java:473)
at
these json files are presents in S3, I downloaded some of the files to reproduce the issue in local,
when I have smaller set of data, it is working fine, but in EMR im unable to reproduce.
also, I checked Application details of EMR for this step.
it says undefined status for status with the below details.
Details:org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3285)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:282)
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
py4j.commands.CallCommand.execute(CallCommand.java:79)
py4j.GatewayConnection.run(GatewayConnection.java:238)
java.lang.Thread.run(Thread.java:750)
spark session creation
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
spark_builder = (
SparkSession\
.builder\
.config(conf=SparkConf())\
.appName("test"))
spark = spark_builder.getOrCreate()
I am not sure, what went wrong suddenly with this step, please help.
Your error indicates a failed security protocol as suggested by various results from googling all pointing to throttling/rejecting incoming TLS connections. Given that this occurs in the context of a backoff strategy.
You can further try these suggestions for retrying with exponential backoff strategy - here and limiting your requests by utilising the AMID.
Additionally you can check you DNS quotas to check if that is not limiting anything or exhausting your quota
Further add your Application Environment to further check if an outdated version might be causing this-
EMR Release version
Spark Versions
AWS SDK Version
AMI [ Amazon Linux Machine Images ] versions
Java & JVM Details
Hadoop Details
Recommended Environment would be to use - AMI 2.x , EMR - 5.3x and the compatible SDKs towards the same [ Preferably AWSS3JavaClient 1.11x ]
More info about EMR releases can be found here
Additionally provide a clear snippet , how are you exactly reading your json files from S3 , are you doing it in an iterative fashion , 1 after the other or in bulk or batches
References used -
https://github.com/aws/aws-sdk-java/issues/2269
javax.net.ssl.SSLHandshakeException: Remote host closed connection during handshake during web service communicaiton
https://github.com/aws/aws-sdk-java/issues/1405
From your error message: ...SdkClientException: Unable to execute HTTP request: Remote host terminated the handshake, seems like you've got a security protocol that is not accepted by the host or the error indicates that the connection was closed on the service side before the SDK was able to perform handshake. You should add a try/except block and add some delay between retrys, to handle those
errors = 0
while errors < 5:
try:
df_job_desc = spark.read.schema(schema_jd).option('multiline',"true").json(self.filepath)
errors = 0
except:
time.sleep(1)
errors += 1
pass

ConfigError "dnspython must be installed" when requirement already satisfied

Trying to connect a Colab Notebook to a MongoDB on Atlas.
from pymongo import MongoClient
uri = "mongodb+srv://MYUSERNAME:mypassword#mydatabase.mongodb.net/test"
client = MongoClient(uri)
I am getting a CongfigurationError:
"dnspython" module must be installed to use mongodb+srv:// URIs.
I installed the module.
pip install dnspython
Got the message back
Requirement already satisfied: dnspython in /usr/local/lib/python3.6/dist-packages (1.16.0)
Do not know what is wrong.
This worked a few days ago with another colab notebook (and another database).
Here is the entire error message:
ConfigurationError Traceback (most recent call last)
<ipython-input-30-a6c89e14e64f> in <module>()
----> 1 client = MongoClient(uri)
1 frames
/usr/local/lib/python3.6/dist-packages/pymongo/mongo_client.py in __init__(self, host, port, document_class, tz_aware, connect, type_registry, **kwargs)
522 for entity in host:
523 if "://" in entity:
--> 524 res = uri_parser.parse_uri(entity, port, warn=True)
525 seeds.update(res["nodelist"])
526 username = res["username"] or username
/usr/local/lib/python3.6/dist-packages/pymongo/uri_parser.py in parse_uri(uri, default_port, validate, warn)
316 elif uri.startswith(SRV_SCHEME):
317 if not _HAVE_DNSPYTHON:
--> 318 raise ConfigurationError('The "dnspython" module must be '
319 'installed to use mongodb+srv:// URIs')
320 is_srv = True
ConfigurationError: The "dnspython" module must be installed to use mongodb+srv:// URIs
You have to restart runtime to have the changes take effect:
!pip install dnspython
Restart runtime Runtime -> Restart runtime...
Run your code
Try installing pymongo[srv] and [tls]
!pip3 install pymongo[srv]
!pip3 install pymongo[tls]
change mongodb+srv:// to mongodb:// and it will work

Downloading AzureML model raised SSL error

I have very strange error :
If I would like to download one model I get
python3.6/site-packages/urllib3/contrib/pyopenssl.py in recv_into(self, *args, **kwargs)
303 try:
--> 304 return self.connection.recv_into(*args, **kwargs)
305 except OpenSSL.SSL.SysCallError as e:
SSLError: ("read error: Error([('SSL routines', 'ssl3_get_record', 'decryption failed or bad record mac')],)",)
But, if I download another model in same workspace download normally.
model = Model(ws, 'model1')
model.download(target_dir=os.getcwd() + '/outputs/1/', exist_ok=True)
# this download normaly
model = Model(ws, 'model2')
model.download(target_dir=os.getcwd() + '/outputs/2/', exist_ok=True)
# This give me an SSL error
Some points:
This model already worked, but suddenly doesn't wont to download
My network is probably not a problem, because else the first model wouldn't download,...
This is indeed odd. I assume that it this is consistently reproducing between model1 and 2. Which version of openssl are you using?
python -c "import sys; print(sys.OPENSSL_VERSION)"

RuntimeError: Unable to start JVM because of Deprecated: convertStrings

I run an automated python job on an EMR cluster that updates Amazon Athena Tables.
It was running well until few days ago (on python 2.7 and 3.7). Here is the script:
from pyathenajdbc import connect
import yaml
config = yaml.load(open('athena-config.yaml', 'r'))
statements = config['statements']
staging_dir = config['staging_dir']
conn = connect(s3_staging_dir=staging_dir, region_name='eu-west-1')
try:
with conn.cursor() as cursor:
for statement in statements:
cursor.execute(statement)
finally:
conn.close()
The athena-config.yaml has a staging directory and few Athena Statements.
Here is the Error:
You are using pip version 9.0.3, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Unrecognized option: -server
create_tables.py:5: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(open('athena-config.yaml', 'r'))
/mnt/conda/lib/python3.7/site-packages/jpype/_core.py:210: UserWarning:
-------------------------------------------------------------------------------
Deprecated: convertStrings was not specified when starting the JVM. The default
behavior in JPype will be False starting in JPype 0.8. The recommended setting
for new code is convertStrings=False. The legacy value of True was assumed for
this session. If you are a user of an application that reported this warning,
please file a ticket with the developer.
-------------------------------------------------------------------------------
""")
Traceback (most recent call last):
File "create_tables.py", line 10, in <module>
region_name='eu-west-1')
File "/mnt/conda/lib/python3.7/site-packages/pyathenajdbc/__init__.py", line 69, in connect
driver_path, log4j_conf, **kwargs)
File "/mnt/conda/lib/python3.7/site-packages/pyathenajdbc/connection.py", line 68, in __init__
self._start_jvm(jvm_path, jvm_options, driver_path, log4j_conf)
File "/mnt/conda/lib/python3.7/site-packages/pyathenajdbc/util.py", line 25, in _wrapper
return wrapped(*args, **kwargs)
File "/mnt/conda/lib/python3.7/site-packages/pyathenajdbc/connection.py", line 97, in _start_jvm
jpype.startJVM(jvm_path, *args)
File "/mnt/conda/lib/python3.7/site-packages/jpype/_core.py", line 219, in startJVM
_jpype.startup(jvmpath, tuple(args), ignoreUnrecognized, convertStrings)
RuntimeError: Unable to start JVM
at loadJVM(native/common/jp_env.cpp:169)
at loadJVM(native/common/jp_env.cpp:179)
at startup(native/python/pyjp_module.cpp:159)
As far as I understand the issue in convertStrings being deprecated. Can anyone help me resolve that? I cannot understand why this """) comes before the traceback, and what changed in past days to break the code. Thanks!
Got the same issue today. Try to downgrade JPype1 to 0.6.3. JPype1 released 0.7.0 today, which is not compatible with old interfaces.
The issue appears to be that the package is calling the JVM with an unrecognized argument -server. The previous version was ignoring those sort of errors allowing things to proceed. To get the same behavior with 0.7.0, the flag ignoreUnrecognized would need to be set to True. Likely this needs to be send to pyathenajdbc to correct the defect which placed the bogus argument into the startJVM in the first place.
Looking at the source the -server is hardcoded into the module.
if not jpype.isJVMStarted():
_logger.debug('JVM path: %s', jvm_path)
args = [
'-server',
'-Djava.class.path={0}'.format(driver_path),
'-Dlog4j.configuration=file:{0}'.format(log4j_conf)
]
if jvm_options:
args.extend(jvm_options)
_logger.debug('JVM args: %s', args)
jpype.startJVM(jvm_path, *args)
cls.class_loader = jpype.java.lang.Thread.currentThread().getContextClassLoader()
It is assuming a particular JVM which accepts -server as an argument.

PySpark exception while using IPython

I installed PySpark and Ipython notebook in ubuntu 12.04.
After installing when I run the "ipython --profile=pyspark", it is throwing the following exception
ubuntu_user#ubuntu_user-VirtualBox:~$ ipython --profile=pyspark
Python 2.7.3 (default, Jun 22 2015, 19:33:41)
Type "copyright", "credits" or "license" for more information.
IPython 0.12.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
IPython profile: pyspark
Error: Must specify a primary resource (JAR or Python or R file)
Run with --help for usage help or --verbose for debug output
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
/usr/lib/python2.7/dist-packages/IPython/utils/py3compat.pyc in execfile(fname, *where)
173 else:
174 filename = fname
--> 175 __builtin__.execfile(filename, *where)
/home/ubuntu_user/.config/ipython/profile_pyspark/startup/00-pyspark-setup.py in <module>()
6 sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
7
----> 8 execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
9
/home/ubuntu_user/spark/python/pyspark/shell.py in <module>()
41 SparkContext.setSystemProperty("spark.executor.uri", os.environ["SPARK_EXECUTOR_URI"])
42
---> 43 sc = SparkContext(pyFiles=add_files)
44 atexit.register(lambda: sc.stop())
45
/home/ubuntu_user/spark/python/pyspark/context.pyc in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
108 """
109 self._callsite = first_spark_call() or CallSite(None, None, None)
--> 110 SparkContext._ensure_initialized(self, gateway=gateway)
111 try:
112 self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
/home/ubuntu_user/spark/python/pyspark/context.pyc in _ensure_initialized(cls, instance, gateway)
232 with SparkContext._lock:
233 if not SparkContext._gateway:
--> 234 SparkContext._gateway = gateway or launch_gateway()
235 SparkContext._jvm = SparkContext._gateway.jvm
236
/home/ubuntu_user/spark/python/pyspark/java_gateway.pyc in launch_gateway()
92 callback_socket.close()
93 if gateway_port is None:
---> 94 raise Exception("Java gateway process exited before sending the driver its port number")
95
96 # In Windows, ensure the Java child processes do not linger after Python has exited.
Exception: Java gateway process exited before sending the driver its port number
Below is the settings and configuration file.
ubuntu_user#ubuntu_user-VirtualBox:~$ ls /home/ubuntu_user/spark
bin ec2 licenses README.md
CHANGES.txt examples NOTICE RELEASE
conf lib python sbin
data LICENSE R spark-1.5.2-bin-hadoop2.6.tgz
Below is the IPython setting
ubuntu_user#ubuntu_user-VirtualBox:~$ ls .config/ipython/profile_pyspark/
db ipython_config.py log security
history.sqlite ipython_notebook_config.py pid startup
IPython and Spark(PySpark) Configuration
ubuntu_user#ubuntu_user-VirtualBox:~$ vi .config/ipython/profile_pyspark/ipython_notebook_config.py
# Configuration file for ipython-notebook.
c = get_config()
# IPython PySpark
c.NotebookApp.ip = 'localhost'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 7770
ubuntu_user#ubuntu_user-VirtualBox:~$ vi .config/ipython/profile_pyspark/startup/00-pyspark-setup.py
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
Setting the following environment variables in .bashrc or .bash_profile:
ubuntu_user#ubuntu_user-VirtualBox:~$ vi .bashrc
export SPARK_HOME="/home/ubuntu_user/spark"
export PYSPARK_SUBMIT_ARGS="--master local[2]"
I am new for apache spark and IPython. How to solve this issue?
I had the same exception when my virtual machine doesn't have enough memory for Java. So I allocated more memory for my virtual machine and this exception goes away.
Steps: Shut down the VM -> VirtualBox Settings -> "System" tab -> Set the memory
(However, this may be only a workaround. I guess the correct way to fix this exception might be properly configuring Spark in terms of java memory.)
May be there is an error locating the pyspark shell by the spark.
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
This will work for Spark 1.6.1. If you have a different version try locating the .zip file and adding the path to the extract.
Two thoughts:
Where is your JDK? I don't see a JAVA_HOME parameter configured in your file. That might be enough given:
Error: Must specify a primary resource (JAR or Python or R file)
Second, Make sure your port 7770 is open and available to your JVM.

Resources