I have a problem regarding PySpark in Jupyter notebook. I installed Java, and Spark added the path variables and didn't get an error. However, when I write builder it keeps running and doesn't start. I waited more than 30 minutes to start but it just kept running. Code like below:
import pyspark
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('Practise').getOrCreate()
Related
I'm launching pySpark sessions with the following code:
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import *
spark = SparkSession.builder.getOrCreate()
I've noticed that if a notebook is running a pySpark query, and a second notebook tries to start a Spark session, the second Spark session will not start until the first one has finished (i.e. the first session is taking all the resources).
Is there some way to limit the resources of a Spark session or parallelize multiple sessions somehow?
Not long ago,
when I input pyspark in my terminal.
the terminal will finally become...um...like this:
some information
>>>
but now it start with jupyter notebook automatically.
This phenomenon happened with spark-3.0.0-preview2-bin-hadoop3.2
I have used many version of spark.
Is above phenomenon due to my error in configuration or due to spark edition update?
Thanks for your help.
I have a requirement to push logs created from pyspark script to kafka. Iam doing POC so using Kafka binaries in windows machine. My versions are - kafka - 2.4.0, spark - 3.0 and python - 3.8.1. I am using pycharm editor.
import sys
import logging
from datetime import datetime
try:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
except ImportError as e:
print("Error importing Spark Modules :", e)
sys.exit(1)
Getting error
Error importing Spark Modules : No module named 'pyspark.streaming.kafka'
What is that I am missing here? Any library is missing? pyspark and spark streaming is working fine. I would appreciate if someone can provide some guidance here.
Spark Streaming was deprecated as of Spark 2.4.
You should be using Structured Streaming instead via pyspark.sql modules
Issue was with the versions I was using for python and spark.
I was using python 3.8 which doesn't support pyspark completely. I changed version to 3.7. Also spark 3 is still in preview, changed that to 2.4.5., it worked.
I'm trying to import the KMeans and Vectors classes from spark.mllib. The platform is IBM Cloud (DSX) with python 3.5 and a Junyper Notebook.
I've tried:
import org.apache.spark.mllib.linalg.Vectors
import apache.spark.mllib.linalg.Vectors
import spark.mllib.linalg.Vectors
I've found several examples/tutorials with the first import working for the author. I've was able to confirm that the spark library itself isn't loaded in the environment. Normally, I would download the package and then import. But being new to VMs, I'm not sure how to make this happen.
I've also tried pip install spark without luck. It throws an error that reads:
The following command must be run outside of the IPython shell:
$ pip install spark
The Python package manager (pip) can only be used from outside of IPython.
Please reissue the `pip` command in a separate terminal or command prompt.
But this is in a VM where I don't see the ability to externally access the CLI.
I did find this, but I don't think I have a mismatch problem -- the issue on importing into DSX is covered but I can't quite interpret it for my situation.
I think this is the actual issue I'm having but it is for sparkR and not python.
It looks like you are trying to use Scala code in a Python notebook.
To get the spark session:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
This will print the version of Spark:
spark.version
To import the ML libraries:
from pyspark.ml import Pipeline
from pyspark.ml.clustering import KMeans
from pyspark.ml.clustering import KMeansModel
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import Vectors
Note: This uses the spark.ml package. The spark.mllib package is the RDD-based library and is currently in maintenance mode. The primary ML library is now spark.ml (DataFrame-based).
https://spark.apache.org/docs/latest/ml-guide.html
DSX environments don't have Spark. When you create a new notebook, you have to decide whether it runs in one of the new environments, without Spark, or in the Spark backend.
I am developing a Spark application using pyspark shell.
I kickstarted the iPython notebook service using the command below, see here how I created the profile:
IPYTHON_OPTS="notebook --port 8889 --profile pyspark" pyspark
Based on the documentation, there is a sc spark context object already created for me with some default configuration.
"In the PySpark shell, a special interpreter-aware SparkContext is
already created for you, in the variable called sc. Making your own
SparkContext will not work."
I basically have two questions here:
(1) How can I get a summary of the configuration for the default sc object?
I want to know how much memory has been allocated, how many cores I can use...etc. However, I only found a method called getLocalProperty for object sc from pyspark API without knowing what is the key argument that I should call.
(2) Is it possible to modify the sparkcontext working with iPythonnotebook. If you cannot modify the configurations once you started the iPython notebook, if there a file somewhere to configure the sc somewhere?
I am fairly new to Spark, the more information(resource) you can provide, the better it would be. Thanks!
It is not required to use pyspark: you can import the pyspark classes and then instantiate the SparkContext yourself
from pyspark import SparkContext, SparkConf
Set up your custom config:
conf = SparkConf().setAppName(appName).setMaster(master)
# set values into conf here ..
sc = SparkContext(conf=conf)
You may also want to look at the general spark-env.sh
conf/spark-env.sh.template # copy to conf/spark-env.sh and then modify vals as useful to you
eg. some of the values you may customize:
# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append