No module named pyspark Error when using generic function - apache-spark

I am building project in pycharm IDE using pyspark.
The Spark install successfully and can be call easily from command prompt.
The Interpreter also configured correctly in project setting. I also tried with pip install pyspark.
The main.py looks like:-
import os
os.environ["SPARK_HOME"] = "/usr/local/spark"
from pyspark import SparkContext
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F
from genericFunc import genericFunction
from config import constants
spark = genericFunction.start_data_pipeline()
inputDf = genericFunction.read_json(constants.INPUT_FOLDER_PATH+"file-000.json")
inputDf1 = genericFunction.read_json(constants.INPUT_FOLDER_PATH+" file-001.json")
and the generic function looks like:-
from pyspark.sql import SparkSession
print('w')
def start_data_pipeline():
#setting up spark session
'''
This function will set the spark session and return it to the __main__
function.
'''
try:
spark = SparkSession\
.builder\
.appName("Nike ETL")\
.getOrCreate()
return spark
except Exception as e:
raise
def read_json(file_name):
#setting up spark session
'''
This function will set the spark session and return it to the __main__
function.
'''
try:
spark = start_data_pipeline()
spark = spark.read \
.option("header", "true") \
.option("inferSchema", "true")\
.json(file_name)
return spark
except Exception as e:
raise
def load_as_csv(df,file_name):
#setting up spark session
'''
This function will set the spark session and return it to the __main__
function.
'''
try:
df.repartition(1).write.format('com.databricks.spark.csv')\
.save(file_name, header = 'true')
except Exception as e:
raise
Error:
Error:
Unresolved reference 'genericFunc'
"C:\Users\MY PC\PycharmProjects\pythonProject1\venv\Scripts\python.exe" C:/Capgemini/cv/tulsi/test-tulsi/main.py
Traceback (most recent call last):
File "C:/Capgemini/cv/tulsi/test-naveen/main.py", line 6, in <module>
from pyspark import SparkContext
ImportError: No module named pyspark
Process finished with exit code 1
Please help

The problem is that PyCharm creates its own virtual environment (venv) before running a python project and that venv do not have the packages installed - in this case pyspark. So you need to point PyCharm to the correct python shell where the packages are available.
You should go to File -> Settings -> Project -> Python Interpreter
and change the Python Interpreter to correct python that has the packages. To find your python run this your python shell
>>> import os
>>> import sys
>>> os.path.dirname(sys.executable)
'C:\\Doc\\'

Related

'SparkSession' object has no attribute 'textFile'

I am currently using SparkSession and was told that SparkContext is within SparkSession. However, when doing up the code, it is showing me an error that SparkContext does not exist in SparkSession
Below is the code that i have done
import findspark
findspark.init()
from pyspark.sql import SparkSession, Row
import collections
spark = SparkSession.builder.config("spark.sql.warehouse.dir", "file://C:/temp").appName("SparkSQL").getOrCreate()
lines = spark.textFile('C:/Users/file.xslx')
The error is as follow:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_59944/722806425.py in <module>
----> 1 lines = spark.textFile('C:/Users/samue/bt4221_spark/exercise/week5/customer-orders.xslx')
AttributeError: 'SparkSession' object has no attribute 'textFile'
My current version of
findspark: 1.4.2
pyspark: 3.0.3
I dont think its related to any version issue. Any help is greatly appreciated! :)
textFile is present in SparkContext class not in SparkSession.
spark.sparkContext.textFile('filepath')

Can connect to local master but not remote, pyspark

I'm trying out this example from anaconda docs:
from pyspark import SparkConf
from pyspark import SparkContext
import findspark
findspark.init('/home/Snow/anaconda3/lib/python3.8/site-packages/pyspark')
conf = SparkConf()
conf.setMaster('local[*]')
conf.setAppName('spark')
sc = SparkContext(conf=conf)
def mod(x):
import numpy as np
return (x, np.mod(x, 2))
rdd = sc.parallelize(range(1000)).map(mod).take(10)
Locally the script runs fine, without errors. When I change the line conf.setMaster('local[*]') to conf.setMaster('spark://remote_ip:7077') I get the error:
Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem
at scala.Predef$.require(Predef.scala:281)
Why is this happening? I also added SPARK_MASTER_HOST=remote_ip and
SPARK_MASTER_PORT=7077 to ~/anaconda3/lib/python3.8/site-packages/pyspark/bin/load_spark_env.sh.
My spark version is 3.0.1 and server is 3.0.0
I can ping the remote_ip.

Pyspark ModuleNotFoundError: No module named 'mmlspark'

My environment: Ubuntu 64 bit, Spark 2.4.5, Jupyter Notebook.
With internet connection that's fine, I don't get any error:
spark = SparkSession.builder \
.appName("Churn Scoring LightGBM") \
.master("local[4]") \
.config("spark.jars.packages","com.microsoft.ml.spark:mmlspark_2.11:0.18.1") \
.getOrCreate()
from mmlspark.lightgbm import LightGBMClassifier
But without an internet connection I got related jars (This style recommended by cloudera docs):
import os
mmlspark_jars_dir = os.path.join(os.environ["SPARK_HOME"], "mmlspark_jars")
mmlspark_jars = [os.path.join(mmlspark_jars_dir, x) for x in os.listdir(mmlspark_jars_dir)]
print(mmlspark_jars)
['/home/erkan/spark/mmlspark_jars/com.jcraft_jsch-0.1.54.jar',
'/home/erkan/spark/mmlspark_jars/com.microsoft.ml.spark_mmlspark_2.11-0.18.1.jar',
'/home/erkan/spark/mmlspark_jars/commons-codec_commons-codec-1.10.jar',
'/home/erkan/spark/mmlspark_jars/org.scalatest_scalatest_2.11-3.0.5.jar',
'/home/erkan/spark/mmlspark_jars/org.apache.httpcomponents_httpcore-4.4.10.jar',
'/home/erkan/spark/mmlspark_jars/org.openpnp_opencv-3.2.0-1.jar',
'/home/erkan/spark/mmlspark_jars/commons-logging_commons-logging-1.2.jar',
'/home/erkan/spark/mmlspark_jars/com.github.vowpalwabbit_vw-jni-8.7.0.2.jar',
'/home/erkan/spark/mmlspark_jars/org.apache.httpcomponents_httpclient-4.5.6.jar',
'/home/erkan/spark/mmlspark_jars/org.scala-lang_scala-reflect-2.11.12.jar',
'/home/erkan/spark/mmlspark_jars/org.scala-lang.modules_scala-xml_2.11-1.0.6.jar',
'/home/erkan/spark/mmlspark_jars/com.microsoft.cntk_cntk-2.4.jar',
'/home/erkan/spark/mmlspark_jars/io.spray_spray-json_2.11-1.3.2.jar',
'/home/erkan/spark/mmlspark_jars/org.scalactic_scalactic_2.11-3.0.5.jar',
'/home/erkan/spark/mmlspark_jars/com.microsoft.ml.lightgbm_lightgbmlib-2.2.350.jar']
And I had to modify SparkSession like this:
spark = SparkSession.builder \
.appName("Churn Scoring LightGBM") \
.master("local[4]") \
.config("spark.jars", ",".join(mmlspark_jars)) \
.getOrCreate()
I observed from terminal and everything seemed fine SparkSession was created. Then I checked Spark UI
Then I tried to import:
from mmlspark.lightgbm import LightGBMClassifier
And got this error:
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-10-df498625321c> in <module>
----> 1 from mmlspark.lightgbm import LightGBMClassifier
ModuleNotFoundError: No module named 'mmlspark'
I don't understand that although I see the same jars on SparkUI import doesn't work with the second method.

ValueError: Cannot run multiple SparkContexts at once in spark with pyspark

i am new in using spark , i try to run this code on pyspark
from pyspark import SparkConf, SparkContext
import collections
conf = SparkConf().setMaster("local").setAppName("RatingsHistogram")
sc = SparkContext(conf = conf)
but he till me this erore message
Using Python version 3.5.2 (default, Jul 5 2016 11:41:13)
SparkSession available as 'spark'.
>>> from pyspark import SparkConf, SparkContext
>>> import collections
>>> conf = SparkConf().setMaster("local").setAppName("RatingsHistogram")
>>> sc = SparkContext(conf = conf)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\spark\python\pyspark\context.py", line 115, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "C:\spark\python\pyspark\context.py", line 275, in _ensure_initialized
callsite.function, callsite.file, callsite.linenum))
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=PySparkShell, master=local[*]) created by getOrCreate at C:\spark\bin\..\python\pyspark\shell.py:43
>>>
i have version spark 2.1.1 and python 3.5.2 , i search and found it is problem in sc ,he could not read it but no when till why , any one have help here
You can try out this
sc = SparkContext.getOrCreate();
You can try:
sc = SparkContext.getOrCreate(conf=conf)
Your previous session is still on. You can run
sc.stop()
it can run through Jupyter lab also. but you have to use as your previous session is still running and local can not run two sessions at a time
sc = SparkContext.getOrCreate( conf =conf)

jupyter notebook NameError: name 'sc' is not defined

I used the jupyter notebook, pyspark, then, my first command was:
rdd = sc.parallelize([2, 3, 4])
Then, it showed that
NameError Traceback (most recent call last)
<ipython-input-1-c540c4a1d203> in <module>()
----> 1 rdd = sc.parallelize([2, 3, 4])
NameError: name 'sc' is not defined.
How to fix this error 'sc' is not defined.
Have you initialized the SparkContext?
You could try this:
#Initializing PySpark
from pyspark import SparkContext, SparkConf
# #Spark Config
conf = SparkConf().setAppName("sample_app")
sc = SparkContext(conf=conf)
Try this
import findspark
findspark.init()
import pyspark # only run after findspark.init()
from pyspark import SparkContext, SparkConf
# #Spark Config
conf = SparkConf().setAppName("sample_app")
sc = SparkContext(conf=conf)
myrdd = sc.parallelize([('roze', 60), ('Mary', 80), ('stella', 34)])

Resources