How to connect to Cloudand/CouchDB using SparkSQL in DataScience Experience? - couchdb

formerly CouchDB was supported via the cloudant connector:
https://github.com/cloudant-labs/spark-cloudant
But this project states that it is no longer active and that it moved to Apache Bahir:
http://bahir.apache.org/docs/spark/2.1.1/spark-sql-cloudant/
So I've installed the JAR in a Scala notebook using the following command:
%AddJar
http://central.maven.org/maven2/org/apache/bahir/spark-sql-cloudant_2.11/2.1.1/spark-sql-cloudant_2.11-2.1.1.jar
Then, from a python notebook, after restarting the kernel, I use the following code to test:
spark = SparkSession\
.builder\
.appName("Cloudant Spark SQL Example in Python using dataframes")\
.config("cloudant.host","0495289b-1beb-4e6d-888e-315f36925447-bluemix.cloudant.com")\
.config("cloudant.username", "0495289b-1beb-4e6d-888e-315f36925447-bluemix")\
.config("cloudant.password","xxx")\
.config("jsonstore.rdd.partitions", 8)\
.getOrCreate()
# ***1. Loading dataframe from Cloudant db
df = spark.read.load("openspace", "org.apache.bahir.cloudant")
df.cache()
df.printSchema()
df.show()
But I get:
java.lang.ClassNotFoundException: org.apache.bahir.cloudant.DefaultSource
(gist of full log)

There is one workaround, it should run in all sorts of jupyther notebook environments and is not exclusive to IBM DataScience Experience:
!pip install --upgrade pixiedust
import pixiedust
pixiedust.installPackage("cloudant-labs:spark-cloudant:2.0.0-s_2.11")
This is of course a workaround, will post the official answer once awailable
EDIT:
Don't forget the restart the jupyter kernel afterwards
EDIT 24.12.18:
Created a yt video on this without workaround, see comments...will update this post as well at a later stage...

Another workaround below. It has been tested and works in DSX Python notebooks:
import pixiedust
# Use play-json version 2.5.9. Latest version is not supported at this time.
pixiedust.installPackage("com.typesafe.play:play-json_2.11:2.5.9")
# Get the latest sql-cloudant library
pixiedust.installPackage("org.apache.bahir:spark-sql-cloudant_2.11:0")
spark = SparkSession\
.builder\
.appName("Cloudant Spark SQL Example in Python using dataframes")\
.config("cloudant.host", host)\
.config("cloudant.username", username)\
.config("cloudant.password", password)\
.getOrCreate()
df = spark.read.load(format="org.apache.bahir.cloudant", database="MY-DB")

Related

pyspark.sql can query local Delta Lake, but fails in case of remote Delta Lake

Setup
Apache Spark v3.2.2 was installed on Ubuntu 22.04.1 LTS (x64). Standalone mode. Plus, openjdk 11.0.16 (x64), Python v3.10.4 (x64) and PySpark (Python package) v3.2.2 were available. I started PySpark with arguments:
pyspark --packages io.delta:delta-core_2.12:2.0.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
Note the Delta Lake feature.
I created a Delta Lake table named loans_delta_method_2, whose data are stored at /home/ubuntu/spark-warehouse/loans_delta_method_2/. Verified:
ls /home/ubuntu/spark-warehouse/loans_delta_method_2/
# _delta_log part-00001-1fb10e5a-c354-438a-bed9-21ebb9942adb-c000.snappy.parquet
# part-00000-081b47cd-b692-4fd8-8ec1-bc359121e360-c000.snappy.parquet part-00003-be1da90f-6b1b-457f-9a21-6101da27618f-c000.snappy.parquet
# part-00000-67fadb6e-b823-4503-9b0f-ac771ac5df7f-c000.snappy.parquet part-00007-a90cde32-d3ec-48b6-97e7-c63ec2ebc1e9-c000.snappy.parquet
With the Python code snippet below, executed in Python Shell (not PySpark Shell), I could query the table:
from pyspark.sql import SparkSession
warehouse_location = "/home/ubuntu/spark-warehouse"
spark = SparkSession.builder.\
master("local[*]").\
appName("TestApp").\
enableHiveSupport().\
config("spark.sql.warehouse.dir", warehouse_location).\
config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension").\
config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog").\
config("spark.jars.packages", "io.delta:delta-core_2.12:2.0.0").\
getOrCreate()
spark.catalog.listTables("default")
# [Table(name='loans_delta_method_2', database='default', description=None, tableType='MANAGED', isTemporary=False)]
Problem
In another Windows 10 machine (x64), openjdk 11.0.16.1 (Temurin) (x64), Python 3.10.6 (x64) and PySpark (Python package) v3.2.2 were installed. Same Python code snippet like above, except the master URL being different:
from pyspark.sql import SparkSession
warehouse_location = "/home/ubuntu/spark-warehouse"
spark = SparkSession.builder.\
master("spark://10.5.129.21:7077").\
appName("TestApp").\
enableHiveSupport().\
config("spark.sql.warehouse.dir", warehouse_location).\
config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension").\
config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog").\
config("spark.jars.packages", "io.delta:delta-core_2.12:2.0.0").\
getOrCreate()
spark.catalog.listTables("default")
# []
But no tables were returned! In addition, if changing the warehouse_location to some incorrect path, still no tables returned! (no crash! no error reported!)
The firewall of both machines was disabled. Both machines were in the same subnet. The binding address of Spark was already changed to "0.0.0.0".
Guess
There may be some magical way to configure the warehouse_location, e.g. file:///home/ubuntu/spark-warehouse, file://home/ubuntu/spark-warehouse, or //home//ubuntu//spark-warehouse (unfortunately, nothing worked).
Spark may have got confused between absolute path and relative path.
Any advice is appreciated.
As discussed in another Stackoverflow post, if accessing Master node with local[*], one can load local files, with prefix file:///.... But as soon as Master/Executors is remote relatively to Driver, the files to be loaded must be accessible to both Driver and Executors. Typically, such files are stored in HDFS.
In my case, the files can be seen by Executors (Ubuntu machine), but not by Driver (Windows machine), which led to Spark unable to load the files. Apache Spark is the backbone of Delta Lake; so Delta Lake has the same restriction as Spark.
Caveat: I haven't tested the case where all Driver, Executors, Master are Linux, and the files are cloned into each machine, e.g. /etc/my_files. In this case, not sure if despite Master accessed with spark://some_host:some_port, Spark can access the files or not.

JavaPackage object is not callable error for pydeequ constraint suggestion

I'm getting a "JavaPackage object is not callable" error while trying to run the PyDeequ constraint suggestion method on databricks.
I have tried running this code on Apache Spark 3.1.2 cluster as well as Apache Spark 3.0.1 cluster but no luck.
suggestionResult = ConstraintSuggestionRunner(spark).onData(df).addConstraintRule(DEFAULT()).run()
print(suggestionResult)
Please refer to the second screenshot attached for the expanded error status.
PyDeequ error screenshot
Expanded PyDeequ error screenshot
I was able to combine some solutions found here, as well as other solutions, to get past the above JavaPackage error in Azure Databricks. Here are the details, if helpful for anyone.
From this link, I downloaded the appropriate JAR file to match my Spark version. In my case, that was deequ_2_0_1_spark_3_2.jar. I then installed this file using the JAR type under Libraries in my cluster configurations.
The following then worked, ran in different cells in a notebook.
%pip install pydeequ
%sh export SPARK_VERSION=3.2.1
df = spark.read.load("abfss://container-name#account.dfs.core.windows.net/path/to/data")
from pyspark.sql import SparkSession
import pydeequ
spark = (SparkSession
.builder
.getOrCreate())
from pydeequ.analyzers import *
analysisResult = AnalysisRunner(spark) \
.onData(df) \
.addAnalyzer(Size()) \
.addAnalyzer(Completeness("column_name")) \
.run()
analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
analysisResult_df.show()

Setting spark.local.dir in Pyspark/Jupyter

I'm using Pyspark from a Jupyter notebook and attempting to write a large parquet dataset to S3.
I get a 'no space left on device' error. I searched around and learned that it's because /tmp is filling up.
I want to now edit spark.local.dir to point to a directory that has space.
How can I set this parameter?
Most solutions I found suggested setting it when using spark-submit. However, I am not using spark-submit and just running it as a script from Jupyter.
Edit: I'm using Sparkmagic to work with an EMR backend.I think spark.local.dir needs to be set in the config JSON, but I am not sure how to specify it there.
I tried adding it in session_configs but it didn't work.
The answer depends on where your SparkContext comes from.
If you are starting Jupyter with pyspark:
PYSPARK_DRIVER_PYTHON='jupyter'\
PYSPARK_DRIVER_PYTHON_OPTS="notebook" \
PYSPARK_PYTHON="python" \
pyspark
then your SparkContext is already initialized when you receive your Python kernel in Jupyter. You should therefore pass a parameter to pyspark (at the end of the command above): --conf spark.local.dir=...
If you are constructing a SparkContext in Python
If you have code in your notebook like:
import pyspark
sc = pyspark.SparkContext()
then you can configure the Spark context before creating it:
import pyspark
conf = pyspark.SparkConf()
conf.set('spark.local.dir', '...')
sc = pyspark.SparkContext(conf=conf)
Configuring Spark from the command line:
It's also possible to configure Spark by editing a configuration file in bash. The file you want to edit is ${SPARK_HOME}/conf/spark-defaults.conf. You can append to it as follows (creating it if it doesn't exist):
echo 'spark.local.dir /foo/bar' >> ${SPARK_HOME}/conf/spark-defaults.conf

Submit an application to a standalone spark cluster running in GCP from Python notebook

I am trying to submit a spark application to a standalone spark(2.1.1) cluster 3 VM running in GCP from my Python 3 notebook(running in local laptop) but for some reason spark session is throwing error "StandaloneAppClient$ClientEndpoint: Failed to connect to master sparkmaster:7077".
Environment Details: IPython and Spark Master are running in one GCP VM called "sparkmaster". 3 additional GCP VMs are running Spark workers and Cassandra Clusters. I connect from my local laptop(MBP) using Chrome to GCP VM IPython notebook in "sparkmaster"
Please note that terminal works:
bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.1 --master spark://sparkmaster:7077 ex.py 1000
Running it from Python Notebook:
import os
os.environ["PYSPARK_SUBMIT_ARGS"] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.1 pyspark-shell'
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark=SparkSession.builder.master("spark://sparkmaster:7077").appName('somatic').getOrCreate() #This step works if make .master('local')
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka1:9092,kafka2:9092,kafka3:9092") \
.option("subscribe", "gene") \
.load()
so far I have tried these:
I have tried to change spark master node spark-defaults.conf and spark-env.sh to add SPARK_MASTER_IP.
Tried to find the STANDALONE_SPARK_MASTER_HOST=hostname -f setting so that I can remove "-f". For some reason my spark master ui shows FQDN:7077 not hostname:7077
passed FQDN as param to .master() and os.environ["PYSPARK_SUBMIT_ARGS"]
Please let me know if you need more details.
After doing some more research I was able to resolve the conflict. It was due to a simple environment variable called SPARK_HOME. In my case it was pointing to Conda's /bin(pyspark was running from this location) whereas my spark setup was present in a diff. path. The simple fix was to add
export SPARK_HOME="/home/<<your location path>>/spark/" to .bashrc file( I want this to be attached to my profile not to the spark session)
How I have done it:
Step 1: ssh to master node in my case it was same as ipython kernel/server VM in GCP
Step 2:
cd ~
sudo nano .bashrc
scroll down to the last line and paste the below line
export SPARK_HOME="/home/your/path/to/spark-2.1.1-bin-hadoop2.7/"
ctrlX and Y and enter to save the changes
Note: I have also added few more details to the environment section for clarity.

Connecting Jupyter notebook to Spark

I have a machine with Hadoop and Spark installed. Below is my current environment.
python3.6
spark1.5.2
Hadoop 2.7.1.2.3.6.0-3796
I was trying to connect jupyter notebook to connect to spark by building up ipython kernel.
2 new files written.
/root/.ipython/profile_pyspark/ipython_notebook_config.py
/root/.ipython/profile_pyspark/startup/00-pyspark-setup.py
/root/anaconda3/share/jupyter/kernels/pyspark/kernel.json
kernel.json
{
"display_name": "PySpark (Spark 2.0.0)",
"language": "python",
"argv": [
"/root/anaconda3/bin/python3",
"-m",
"ipykernel",
"--profile=pyspark"
],
"env": {
"CAPTURE_STANDARD_OUT": "true",
"CAPTURE_STANDARD_ERR": "true",
"SEND_EMPTY_OUTPUT": "false",
"PYSPARK_PYTHON" : "/root/anaconda3/bin/python3",
"SPARK_HOME": "/usr/hdp/current/spark-client/"
}
}
00-pyspark-setup.py
import os
import sys
os.environ["PYSPARK_PYTHON"] = "/root/anaconda3/bin/python"
os.environ["SPARK_HOME"] = "/usr/hdp/current/spark-client"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.8.2.1-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
exec(open(os.path.join(spark_home, 'python/pyspark/shell.py')).read())
ipython_notebook_config.py
c = get_config()
c.NotebookApp.port = 80
Then, when i run the following
jupyter notebook --profile=pyspark
The notebook is running well. Then, i change kernel to 'PySpark (Spark 2.0.0)', and it suppose to use the 'sc' spark context . However, when i type 'sc', it does not shows anything.
So, since sc cannot be initiliazed, if i want to run the following, it failed!
nums = sc.parallelize(xrange(1000000))
Can anybody help me how to configure jupyter notebook to talk to Spark?
Just FYI, python 3.6 isnt supported until version spark 2.1.1. See JIRA https://issues.apache.org/jira/browse/SPARK-19019
There is a number of issues with your question...
1) On top of the answer by Punskr above - Spark 1.5 only works with Python 2; Python 3 support was introduced in Spark 2.0.
2) Even if you switch to Python 2 or upgrade Spark, you will still need to import the relevant modules of Pyspark and initialize the sc variable manually in the notebook
3) You also seem to use an old version of Jupyter, since the profiles functionality is not available in Jupyter >= 4.
To initialise sc "automatically" in Jupyter >=4, see my answer here.
You can make a few environment changes to have pyspark default ipython or a jupyter notebook.
Put the following in your ~/.bashrc
export PYSPARK_PYTHON=python3 ## for python3
export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7000"
See: pyspark on GitHub
Next, run source ~/.bashrc
Then, when you launch pyspark (or with YARN) it will open up a server for you to connect to.
On a local terminal that has ssh capabilities, run
ssh -N -f -L localhost:8000:localhost:7000 <username>#<host>
If you're on Windows, I recommend MobaXterm or Cygwin.
Open up a web browser, and enter the address localhost:8000 to tunnel into your notebook with Spark
Now some precautions, I've never tried this with Python 3 so this may or may not work for you. Regardless, you should really be using Python 2 on Spark 1.5. My company uses Spark 1.5 as well, as NO ONE uses Python 3 because of it.
Update:
Per #desertnaut's comments, setting
export PYSPARK_DRIVER_PYTHON=ipython
may cause issues if the user ever needs to use spark-submit. A work around, if you want to have both notebooks and spark-submit available is to create two new environment variables. Here is an example of what you may create
export PYSPARK_PYTHON=python3 ## for python3
export ipyspark='PYSPARK_DRIVER_PYTHON=ipython pyspark'
export pynb='PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7000"'
where ipyspark and pynb are new commands on a bash terminal.

Resources