Run Tensorflow and Pyspark Together in same program - apache-spark

When I run a program with tensorflow an pyspark together with spark-submit sample.py it says no module named tensorflow but when I run activate tensorflow and run with python sample.py it says no module named pyspark.
import tensorflow as tf
import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.sql('''select 'spark' as hello ''')
df.show()

Related

Debug PySpark in VS Code

I'm building a project in PySpark using VS Code. I have Python and Spark installed and PySpark is correctly imported and running in a Jupyter notebook. Do do so, I run:
import findspark
findspark.init()
import pyspark
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
[my code... ]
Now, how do I debug my PySpark code in VS Code? I don't want to run findspark in my project. Do I need to create a virtual environment and run a pre-script? What are the best practices?
So the steps I followed and what I learnt:
make sure Python, Spark and the environmental variables are correctly installed and set
Create a virtual environment, and pip install all required libraries (including pyspark)
Run de debugger. The debugger will just analyze the code and not run it. So a SparkContext and a SparkSession are not required.

How to read avro file using pyspark

I am trying to read avro file in jupyter notebook but facing this issue.
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.avro.AvroFileFormat.DefaultSource
and I can't seem to figure out where how to get this dependency from.
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession.builder.appName("readavro").master("local").getOrCreate()
result = spark.read.format('com.databricks.spark.avro').load("file:///C:/Downloads/part-r-00000.avro")
Make sure you add org.apache.spark:spark-avro_2.12:2.4.5 jar to your classpath.
Since spark-avro module is external, there is no .avro API in DataFrameReader or DataFrameWriter. So try
result = spark.read.format('avro').load("file:///C:/Downloads/part-r-00000.avro")
include the avro dependency
$ bin/spark-shell --packages com.databricks:spark-avro_2.12:2.4.5

how to fix usage of pyspark.dbutils on databricks which was used in development by databricks connect?

we have developed the code using databricks connect and used from pyspark.dbutils import DBUtils, while packaging the code to databricks in a wheel file, it fails with an error no module found pyspark.dbutils.
There is no pip install pyspark.dbutils present.
How to fix this?
Dbutils should already be available with databricks-connect, so import it using this script:
from pyspark.sql import SparkSession
from pyspark import dbutils
import argparse
spark = SparkSession.builder.getOrCreate()
setting = spark.conf.get("spark.master")
if "local" in setting:
from pyspark.dbutils import DBUtils
dbutils = DBUtils(spark.sparkContext)
else:
print("Do nothing - dbutils should be available already")

Not able to import sparkdl in jupyter notebook

I am trying to use spark deep learning library(https://github.com/databricks/spark-deep-learning) in jupyter notebook.
When I try to "import sparkdl" in jupyter notebook I am getting error "no module found".
When I am running the below command in cli
pyspark --packages databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11
I am able to import sparkdl in the spark shell and its working.
How can I use this library in jupyter notebook?
Here is the snippet that I use with PySpark 2.4. You will need a connection to the web to be able to install the package.
# Import libraries
from pyspark.sql import SparkSession
# Creating SparkSession
spark = (SparkSession
.builder
.config('spark.jars.packages', 'databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11')
.getOrCreate()
)
# Import Spar-Deep-Learning-Pipelines
import sparkdl
You can check several points:
use %conda list|grep "sparkdl" in jupyter notebook cell to check if sparkdl is installed as you wish.
virtual environment. Is sparkdl installed into another virtual environment?
hope this could help u.
First you have to download sparkdl jar file using below command:
wget https://repos.spark-packages.org/databricks/spark-deep-learning/1.5.0-spark2.4-s_2.11/spark-deep-learning-1.5.0-spark2.4-s_2.11.jar
Second you have to install sparkdl pypi package using below command:
pip install sparkdl
Then you can use the below snippet in jupyter notebook:
import findspark
findspark.init()
from pyspark.conf import SparkConf
from pyspark import SparkContext
conf = SparkConf().set("spark.jars", "./spark-deep-learning-1.5.0-spark2.4-s_2.11.jar")
conf.setAppName("ML")
sc = SparkContext(conf=conf)
from pyspark.sql import SparkSession
spark = SparkSession(sc)
import sparkdl
This solution doesn't require a web connection once you download the jar file Krystof

PySpark in jupyter notebook using spark-csv package [duplicate]

This page was inspiring me to try out spark-csv for reading .csv file in PySpark
I found a couple of posts such as this describing how to use spark-csv
But I am not able to initialize the ipython instance by including either the .jar file or package extension in the start-up that could be done through spark-shell.
That is, instead of
ipython notebook --profile=pyspark
I tried out
ipython notebook --profile=pyspark --packages com.databricks:spark-csv_2.10:1.0.3
but it is not supported.
Please advise.
You can simply pass it in the PYSPARK_SUBMIT_ARGS variable. For example:
export PACKAGES="com.databricks:spark-csv_2.11:1.3.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"
These property can be also set dynamically in your code before SparkContext / SparkSession and corresponding JVM have been started:
packages = "com.databricks:spark-csv_2.11:1.3.0"
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages {0} pyspark-shell".format(packages)
)
I believe you can also add this as a variable to your spark-defaults.conf file. So something like:
spark.jars.packages com.databricks:spark-csv_2.10:1.3.0
This will load the spark-csv library into PySpark every time you launch the driver.
Obviously zero's answer is more flexible because you can add these lines to your PySpark app before you import the PySpark package:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.3.0 pyspark-shell'
from pyspark import SparkContext, SparkConf
This way you are only importing the packages you actually need for your script.

Resources