object has no attribute 'map' error in pyspark 2.4.4

i am running spark 2.4.4 with python 2.7 and IDE is pycharm.
The Input file contain encoded value in some column like given below.
.ʽ|!3-2-704A------------ (dotted line is space)
i am trying to get result like
I tried below code.
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark") \
.config("spark.some.config.option", "some-value") \
df = spark.read.csv("Customers_v01.csv",header=True,sep=",");
myres = df.map(lambda x :x[1].decode('utf-8'))
File "C:\spark\python\pyspark\sql\dataframe.py", line 1301, in __getattr__
"'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
AttributeError: 'DataFrame' object has no attribute 'map'
i am not sure what cause this error.... kindly help. is there any other way to do it.

map is available on Resilient Distributed Dataset (RDD)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Python Spark").getOrCreate()
df = spark.read.csv("Customers_v01.csv", header=True, sep=",", encoding='utf-8')
myres = df.rdd.map(lambda x: x[1].encode().decode('utf-8'))


Errors Loading csv file with spark-submit

I am new to py spark and I have been running jobs on Jupiter notebook which is running smoothly but having issues running spark-submit for loading a CSV file.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
if __name__ == '__main__':
conf = SparkConf().setAppName("app")
sc = SparkContext(conf=conf)
load csv file
netflix_df = spark.read.format("csv") \
.option("header", "true") \
.option("inferSchema","true") \
The above code works perfectly on Jupiter notebook but doesn't work when trying to run the same code saved in a python file with spark-submit
I get the following errors
NameError: name 'spark' is not defined
when i replace spark.read.format("csv") with sc.read.format("csv")
I get the following error
AttributeError: 'SparkContext' object has no attribute 'read'
You need to create a spark session.
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder()
.master("local[1]") # replace with suitable parameter
#now you use spark.read

Calling JSON data from Restapi in pyspark throwing Error

i am trying query REST API to get a data to dataframe using pyspark.
but it is throwing error as
File "C:/Users/QueryRestapi.py", line 30, in <module>
df = parse_dataframe(json_data)
File "C:/Users/QueryRestapi.py", line 22, in parse_dataframe
rdd = SparkContext.parallelize(mylist)
TypeError: unbound method parallelize() must be called with SparkContext instance as first argument (got list instance instead)
from pyspark import SparkConf,SparkContext
from pyspark.sql import SparkSession
from urllib import urlopen
import json
spark = SparkSession \
.builder \
.appName("DataCleansing") \
def convert_single_object_per_line(json_list):
json_string = ""
for line in json_list:
json_string += json.dumps(line) + "\n"
return json_string
def parse_dataframe(json_data):
r = convert_single_object_per_line(json_data)
mylist = []
for line in r.splitlines():
rdd = SparkContext.parallelize(mylist)
df = sqlContext.jsonRDD(rdd)
return df
url = "https://"mylink"
response = urlopen(url)
data = str(response.read())
json_data = json.loads(data)
df = parse_dataframe(json_data)
Please help me, if i am missing something.......Thanks a lot

Reading Excel (.xlsx) file in pyspark

I am trying to read a .xlsx file from local path in PySpark.
I've written the below code:
from pyspark.shell import sqlContext
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('local') \
.appName('Planning') \
.enableHiveSupport() \
.config('spark.executor.memory', '2g') \
df = sqlContext.read("C:\P_DATA\tyco_93_A.xlsx").show()
TypeError: 'DataFrameReader' object is not callable
You can use pandas to read .xlsx file and then convert that to spark dataframe.
from pyspark.sql import SparkSession
import pandas
spark = SparkSession.builder.appName("Test").getOrCreate()
pdf = pandas.read_excel('excelfile.xlsx', sheet_name='sheetname', inferSchema='true')
df = spark.createDataFrame(pdf)
You could use crealytics package.
Need to add it to spark, either by maven co-ordinates or while starting the spark shell as below.
$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.12:0.13.1
For databricks users- need to add it as a library by navigating
Cluster - 'clusterName' - Libraries - Install New - Provide 'com.crealytics:spark-excel_2.12:0.13.1' under maven coordinates.
df = spark.read
.option("dataAddress", "'Sheet1'!")
.option("header", "true")
.option("inferSchema", "true")
More options are available in below github page.

How do I make pyspark and ML (no RDD) working with large csv?

I'm working with a relatively large CSV file and trying to train a pyspark.ml.classification.LogisticRegression model with it. The code below works well if a sample file contains a few lines (about 200). However, if I run the same code with the actual relatively large file (6e6 lines) I have a socket write exception. I've googled it but I couldn't find some advice. Please help me with this Exception with the large file:
This is the code that gives the Exception:
from pyspark.sql import SparkSession
from pyspark.ml.linalg import Vectors
from pyspark.ml.evaluation import BinaryClassificationEvaluator as Evaluator
from pyspark.mllib.regression import LabeledPoint
from pyspark.ml.classification import LogisticRegression
import warnings
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
def vectorizeData(data):
return data.rdd.map(lambda r: [int(r[-1]), Vectors.dense(r[:-1])]).toDF(['label','features'])
in_file = "C:\\Users\\HCAOA911\\Desktop\\data\\small_sample.csv"
CV_data = spark.read.csv(in_file, header=True)
CV_data = CV_data[['step','amount','oldbalanceOrg','newbalanceOrig',
'oldbalanceDest','newbalanceDest','isFlaggedFraud', 'isFraud']]
training_data, testing_data = CV_data.randomSplit([0.8, 0.2])
xytrain = vectorizeData(training_data)
lr = LogisticRegression(regParam=0.01)
model = lr.fit(xytrain)
xytest = vectorizeData(testing_data)
predicted_train = model.transform(xytrain)
predicted_test = model.transform(xytest)
evaluator = Evaluator()
print("Train %s: %f" % (evaluator.getMetricName(), evaluator.evaluate(predicted_train)))
print("Test %s: %f" % (evaluator.getMetricName(), evaluator.evaluate(predicted_test)))
I'm working with
spark-submit --master local[*] .py
Python 3.6.4
Pyspark 2.2.1
Windows 7
Thank you in advance
I've solved the problem by using better representation for ML models:
In this example, I used the object pyspark.ml.feature.VectorAssembler within a function called vectorizeData().

TypeError: 'Builder' object is not callable Spark structured streaming

On running the example given in the programming guide[link] for python spark structured streaming
I get below Error :
TypeError: 'Builder' object is not callable
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession.builder()\
# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark\
.option('host', 'localhost')\
.option('port', 9999)\
# Split the lines into words
words = lines.select(
split(lines.value, ' ')
# Generate running word count
wordCounts = words.groupBy('word').count()
# Start running the query that prints the running counts to the console
query = wordCounts\
Error :
omkar#rudra:~/thesis/backUp$ spark-submit structured.py
Traceback (most recent call last):
File "/home/omkar/thesis/backUp/structured.py", line 8, in <module>
spark = SparkSession.builder()\
TypeError: 'Builder' object is not callable
spark = SparkSession.builder()\
modify .builder() to .builder as :
spark = SparkSession.builder\
Source : https://issues.apache.org/jira/browse/SPARK-18426
When running python example in Structured Streaming Guide, get the error:
spark = SparkSession.builder().master("local[1]").appName("Example").getOrCreate()
TypeError: 'Builder' object is not callable
This is fixed by changing .builder() to .builder
spark = SparkSession.builder.master("local[1]").appName("Demo").getOrCreate()
After removing this-() in builder while creating sparksession,the code will run.
