Can't import spark within DSX environment - python-3.x

I'm trying to import the KMeans and Vectors classes from spark.mllib. The platform is IBM Cloud (DSX) with python 3.5 and a Junyper Notebook.
I've tried:
import org.apache.spark.mllib.linalg.Vectors
import apache.spark.mllib.linalg.Vectors
import spark.mllib.linalg.Vectors
I've found several examples/tutorials with the first import working for the author. I've was able to confirm that the spark library itself isn't loaded in the environment. Normally, I would download the package and then import. But being new to VMs, I'm not sure how to make this happen.
I've also tried pip install spark without luck. It throws an error that reads:
The following command must be run outside of the IPython shell:
$ pip install spark
The Python package manager (pip) can only be used from outside of IPython.
Please reissue the `pip` command in a separate terminal or command prompt.
But this is in a VM where I don't see the ability to externally access the CLI.
I did find this, but I don't think I have a mismatch problem -- the issue on importing into DSX is covered but I can't quite interpret it for my situation.
I think this is the actual issue I'm having but it is for sparkR and not python.

It looks like you are trying to use Scala code in a Python notebook.
To get the spark session:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
This will print the version of Spark:
spark.version
To import the ML libraries:
from pyspark.ml import Pipeline
from pyspark.ml.clustering import KMeans
from pyspark.ml.clustering import KMeansModel
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import Vectors
Note: This uses the spark.ml package. The spark.mllib package is the RDD-based library and is currently in maintenance mode. The primary ML library is now spark.ml (DataFrame-based).
https://spark.apache.org/docs/latest/ml-guide.html

DSX environments don't have Spark. When you create a new notebook, you have to decide whether it runs in one of the new environments, without Spark, or in the Spark backend.

Related

PySpark session builder doesn't start

I have a problem regarding PySpark in Jupyter notebook. I installed Java, and Spark added the path variables and didn't get an error. However, when I write builder it keeps running and doesn't start. I waited more than 30 minutes to start but it just kept running. Code like below:
import pyspark
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('Practise').getOrCreate()

Notebook to write Java jobs for Spark

I am writing my first Spark job using Java API.
I want to run it using a notebook.
I am looking into Zeppelin and Jupyter.
At Zeppelin documentation I see support for Scala, IPySpark and SparkR. It is not clear to me whether using two interpreters %spark.sql %java will allow me to work with Java API of Spark SQL
Jupyter has "IJava" kernel but I see no support for Spark with Java.
Are there other options?
#Victoriia Zeppelin 0.9.0 has %java interpreter with example here
zeppelin.apache.org
I try to start with it in GoogleCloud, but had some problems...
use magic command %jars path/to/spark.jar in the IJava cell, according to the IJava's author
then take a look on import org.apache.spark.sql.* for example

Spark exception 5063 in TensorFlow extended example code on GPU

I am trying to run an example code of TensorFlow extended at https://www.tensorflow.org/tfx/tutorials/transform/census
on databricks GPU cluster.
My env:
7.1 ML Spark 3.0.0 Scala 2.12 GPU
python 3.7
tensorflow: Version: 2.1.1
tensorflow-transform==0.22.0
apache_beam==2.21.0
When I run
transform_data(train, test, temp)
I got error:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063
It seems that this is a known issue of RDD on Spark.
https://issues.apache.org/jira/browse/SPARK-5063
I tried to search some solutions here, but none of them work for me.
how to deal with error SPARK-5063 in spark
At the example code, I do not see where SparkContext is accessed from worker explicitly.
It is called from Apache Beam ?
Thanks

Getting : Error importing Spark Modules : No module named 'pyspark.streaming.kafka'

I have a requirement to push logs created from pyspark script to kafka. Iam doing POC so using Kafka binaries in windows machine. My versions are - kafka - 2.4.0, spark - 3.0 and python - 3.8.1. I am using pycharm editor.
import sys
import logging
from datetime import datetime
try:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
except ImportError as e:
print("Error importing Spark Modules :", e)
sys.exit(1)
Getting error
Error importing Spark Modules : No module named 'pyspark.streaming.kafka'
What is that I am missing here? Any library is missing? pyspark and spark streaming is working fine. I would appreciate if someone can provide some guidance here.
Spark Streaming was deprecated as of Spark 2.4.
You should be using Structured Streaming instead via pyspark.sql modules
Issue was with the versions I was using for python and spark.
I was using python 3.8 which doesn't support pyspark completely. I changed version to 3.7. Also spark 3 is still in preview, changed that to 2.4.5., it worked.

How to have Apache Spark running on GPU?

I want to integrate apache spark with GPU but spark works on java while gpu uses CUDA/OpenCL so how do we merge them.
It depends on what you want to do. If you want to distribute your computation with GPUs using spark you don't necessary have to use java. You could use python (pyspark) with numba which have a cuda module.
For exemple you can apply this code if you want your worker nodes to compute operation (here gpu_function) on every blocks of your RDD.
rdd = rdd.mapPartition(gpu_function)
with :
def gpu_function(x):
...
input = f(x)
output = ...
gpu_cuda[grid_size,block_size](input,output)
return output
and :
from numba import cuda
#cuda.jit("(float32[:],float32[:])")
def gpu_cuda(input,output)
output = g(input)
I advise you to take a look at the slideshare url : https://fr.slideshare.net/continuumio/gpu-computing-with-apache-spark-and-python ,specificly slide 34.
You only need numba and cuda driver install on every worker node.
There is a few libraries that helps with this dilema.
The Databricks is working in a solution for Spark with TensorFlow that will allow you to use the GPUs of your cluster, or your machine.
If you want to find more about that there is a presentation of Spark Summit Europe 2016 This presentation will show a little bit how TensorFrames works.
Other this is a post about TensoFrames in DataBricks Blog.
And for more code information see the Git of Tensorframes.

Resources