'SparkSession' object has no attribute 'textFile' - apache-spark

I am currently using SparkSession and was told that SparkContext is within SparkSession. However, when doing up the code, it is showing me an error that SparkContext does not exist in SparkSession
Below is the code that i have done
import findspark
findspark.init()
from pyspark.sql import SparkSession, Row
import collections
spark = SparkSession.builder.config("spark.sql.warehouse.dir", "file://C:/temp").appName("SparkSQL").getOrCreate()
lines = spark.textFile('C:/Users/file.xslx')
The error is as follow:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_59944/722806425.py in <module>
----> 1 lines = spark.textFile('C:/Users/samue/bt4221_spark/exercise/week5/customer-orders.xslx')
AttributeError: 'SparkSession' object has no attribute 'textFile'
My current version of
findspark: 1.4.2
pyspark: 3.0.3
I dont think its related to any version issue. Any help is greatly appreciated! :)

textFile is present in SparkContext class not in SparkSession.
spark.sparkContext.textFile('filepath')

Related

No module named pyspark Error when using generic function

I am building project in pycharm IDE using pyspark.
The Spark install successfully and can be call easily from command prompt.
The Interpreter also configured correctly in project setting. I also tried with pip install pyspark.
The main.py looks like:-
import os
os.environ["SPARK_HOME"] = "/usr/local/spark"
from pyspark import SparkContext
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F
from genericFunc import genericFunction
from config import constants
spark = genericFunction.start_data_pipeline()
inputDf = genericFunction.read_json(constants.INPUT_FOLDER_PATH+"file-000.json")
inputDf1 = genericFunction.read_json(constants.INPUT_FOLDER_PATH+" file-001.json")
and the generic function looks like:-
from pyspark.sql import SparkSession
print('w')
def start_data_pipeline():
#setting up spark session
'''
This function will set the spark session and return it to the __main__
function.
'''
try:
spark = SparkSession\
.builder\
.appName("Nike ETL")\
.getOrCreate()
return spark
except Exception as e:
raise
def read_json(file_name):
#setting up spark session
'''
This function will set the spark session and return it to the __main__
function.
'''
try:
spark = start_data_pipeline()
spark = spark.read \
.option("header", "true") \
.option("inferSchema", "true")\
.json(file_name)
return spark
except Exception as e:
raise
def load_as_csv(df,file_name):
#setting up spark session
'''
This function will set the spark session and return it to the __main__
function.
'''
try:
df.repartition(1).write.format('com.databricks.spark.csv')\
.save(file_name, header = 'true')
except Exception as e:
raise
Error:
Error:
Unresolved reference 'genericFunc'
"C:\Users\MY PC\PycharmProjects\pythonProject1\venv\Scripts\python.exe" C:/Capgemini/cv/tulsi/test-tulsi/main.py
Traceback (most recent call last):
File "C:/Capgemini/cv/tulsi/test-naveen/main.py", line 6, in <module>
from pyspark import SparkContext
ImportError: No module named pyspark
Process finished with exit code 1
Please help
The problem is that PyCharm creates its own virtual environment (venv) before running a python project and that venv do not have the packages installed - in this case pyspark. So you need to point PyCharm to the correct python shell where the packages are available.
You should go to File -> Settings -> Project -> Python Interpreter
and change the Python Interpreter to correct python that has the packages. To find your python run this your python shell
>>> import os
>>> import sys
>>> os.path.dirname(sys.executable)
'C:\\Doc\\'

Cannot from pandas import Dataframe

from pandas import Dataframe
ImportError Traceback (most recent call last)
in ()
----> 1 from pandas import Dataframe
ImportError: cannot import name 'Dataframe'
I understand there are workarounds but I need to do this for an assignment. I am using Jupiter Python ver 3.6.
Thsnks in Advance
from pandas import DataFrame
Notice capitalization

AttributeError: 'SQLContext' object has no attribute 'jsonFile'

When I perform the following actions.I met this problem in centos 7.0 and spark 2.1.0. I am a freshman in spark. How to fix it?
>>> from pyspark.sql import SQLContext
>>> ssc = SQLContext(sc)
>>> df = ssc.jsonFile('file:///root/work/person.json')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'SQLContext' object has no attribute 'jsonFile'
Use SparkSession with the newer version of Spark and read using
df = spark.read.json('path to json).
jsonFile has been deprecated, please use sqlContext.read.json

jupyter notebook NameError: name 'sc' is not defined

I used the jupyter notebook, pyspark, then, my first command was:
rdd = sc.parallelize([2, 3, 4])
Then, it showed that
NameError Traceback (most recent call last)
<ipython-input-1-c540c4a1d203> in <module>()
----> 1 rdd = sc.parallelize([2, 3, 4])
NameError: name 'sc' is not defined.
How to fix this error 'sc' is not defined.
Have you initialized the SparkContext?
You could try this:
#Initializing PySpark
from pyspark import SparkContext, SparkConf
# #Spark Config
conf = SparkConf().setAppName("sample_app")
sc = SparkContext(conf=conf)
Try this
import findspark
findspark.init()
import pyspark # only run after findspark.init()
from pyspark import SparkContext, SparkConf
# #Spark Config
conf = SparkConf().setAppName("sample_app")
sc = SparkContext(conf=conf)
myrdd = sc.parallelize([('roze', 60), ('Mary', 80), ('stella', 34)])

module 'pyspark_csv' has no attribute 'csvToDataframe'

I am new to spark and facing an error while converting .csv file to dataframe. I am using pyspark_csv module for the conversion but gives an error saying "module 'pyspark_csv' has no attribute 'csvToDataframe".
here is my code:
import findspark
findspark.init()
findspark.find()
import pyspark
sc=pyspark.SparkContext(appName="myAppName")
sqlCtx = pyspark.SQLContext
#csv to dataframe
sc.addPyFile('/usr/spark-1.5.0/python/pyspark_csv.py')
sc.addPyFile('https://raw.githubusercontent.com/seahboonsiew/pyspark-csv/master/pyspark_csv.py')
import pyspark_csv as pycsv
#skipping the header
def skip_header(idx, iterator):
if(idx == 0):
next(iterator)
return iterator
#loading the dataset
data=sc.textFile('gdeltdata/20160427.CSV')
data_header = data.first()
data_body = data.mapPartitionsWithIndex(skip_header)
data_df = pycsv.csvToDataframe(sqlctx, data_body, sep=",", columns=data_header.split('\t'))
AttributeError Traceback (most recent call last)
<ipython-input-10-8e47cd9759e6> in <module>()
----> 1 data_df = pycsv.csvToDataframe(sqlctx, data_body, sep=",", columns=data_header.split('\t'))
AttributeError: module 'pyspark_csv' has no attribute 'csvToDataframe'
As mentioned in https://github.com/seahboonsiew/pyspark-csv
Please try using the following command:
csvToDataFrame
with Frame instead of frame

Resources