object has no attribute 'map' error in pyspark 2.4.4 - apache-spark

i am running spark 2.4.4 with python 2.7 and IDE is pycharm.
The Input file contain encoded value in some column like given below.
.ʽ|!3-2-704A------------ (dotted line is space)
i am trying to get result like
3-2-704A
I tried below code.
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
df = spark.read.csv("Customers_v01.csv",header=True,sep=",");
myres = df.map(lambda x :x[1].decode('utf-8'))
print(myres.collect())
Error:
myres = df.map(lambda x :x[1].decode('utf-8'))
File "C:\spark\python\pyspark\sql\dataframe.py", line 1301, in __getattr__
"'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
AttributeError: 'DataFrame' object has no attribute 'map'
i am not sure what cause this error.... kindly help. is there any other way to do it.

map is available on Resilient Distributed Dataset (RDD)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Python Spark").getOrCreate()
df = spark.read.csv("Customers_v01.csv", header=True, sep=",", encoding='utf-8')
myres = df.rdd.map(lambda x: x[1].encode().decode('utf-8'))
print(myres.collect())

Related

Errors Loading csv file with spark-submit

I am new to py spark and I have been running jobs on Jupiter notebook which is running smoothly but having issues running spark-submit for loading a CSV file.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
if __name__ == '__main__':
conf = SparkConf().setAppName("app")
sc = SparkContext(conf=conf)
load csv file
netflix_df = spark.read.format("csv") \
.option("header", "true") \
.option("inferSchema","true") \
.load("netflix_titles.csv")
The above code works perfectly on Jupiter notebook but doesn't work when trying to run the same code saved in a python file with spark-submit
I get the following errors
NameError: name 'spark' is not defined
when i replace spark.read.format("csv") with sc.read.format("csv")
I get the following error
AttributeError: 'SparkContext' object has no attribute 'read'
You need to create a spark session.
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder()
.master("local[1]") # replace with suitable parameter
.appName("demo")
.getOrCreate()
#now you use spark.read

Calling JSON data from Restapi in pyspark throwing Error

i am trying query REST API to get a data to dataframe using pyspark.
but it is throwing error as
File "C:/Users/QueryRestapi.py", line 30, in <module>
df = parse_dataframe(json_data)
File "C:/Users/QueryRestapi.py", line 22, in parse_dataframe
rdd = SparkContext.parallelize(mylist)
TypeError: unbound method parallelize() must be called with SparkContext instance as first argument (got list instance instead)
Code:
from pyspark import SparkConf,SparkContext
from pyspark.sql import SparkSession
from urllib import urlopen
import json
spark = SparkSession \
.builder \
.appName("DataCleansing") \
.getOrCreate()
def convert_single_object_per_line(json_list):
json_string = ""
for line in json_list:
json_string += json.dumps(line) + "\n"
return json_string
def parse_dataframe(json_data):
r = convert_single_object_per_line(json_data)
mylist = []
for line in r.splitlines():
mylist.append(line)
rdd = SparkContext.parallelize(mylist)
df = sqlContext.jsonRDD(rdd)
return df
url = "https://"mylink"
response = urlopen(url)
data = str(response.read())
json_data = json.loads(data)
df = parse_dataframe(json_data)
Please help me, if i am missing something.......Thanks a lot

Reading Excel (.xlsx) file in pyspark

I am trying to read a .xlsx file from local path in PySpark.
I've written the below code:
from pyspark.shell import sqlContext
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('local') \
.appName('Planning') \
.enableHiveSupport() \
.config('spark.executor.memory', '2g') \
.getOrCreate()
df = sqlContext.read("C:\P_DATA\tyco_93_A.xlsx").show()
Error:
TypeError: 'DataFrameReader' object is not callable
You can use pandas to read .xlsx file and then convert that to spark dataframe.
from pyspark.sql import SparkSession
import pandas
spark = SparkSession.builder.appName("Test").getOrCreate()
pdf = pandas.read_excel('excelfile.xlsx', sheet_name='sheetname', inferSchema='true')
df = spark.createDataFrame(pdf)
df.show()
You could use crealytics package.
Need to add it to spark, either by maven co-ordinates or while starting the spark shell as below.
$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.12:0.13.1
For databricks users- need to add it as a library by navigating
Cluster - 'clusterName' - Libraries - Install New - Provide 'com.crealytics:spark-excel_2.12:0.13.1' under maven coordinates.
df = spark.read
.format("com.crealytics.spark.excel")
.option("dataAddress", "'Sheet1'!")
.option("header", "true")
.option("inferSchema", "true")
.load("C:\P_DATA\tyco_93_A.xlsx")
More options are available in below github page.
https://github.com/crealytics/spark-excel

How do I make pyspark and ML (no RDD) working with large csv?

I'm working with a relatively large CSV file and trying to train a pyspark.ml.classification.LogisticRegression model with it. The code below works well if a sample file contains a few lines (about 200). However, if I run the same code with the actual relatively large file (6e6 lines) I have a socket write exception. I've googled it but I couldn't find some advice. Please help me with this Exception with the large file:
This is the code that gives the Exception:
from pyspark.sql import SparkSession
from pyspark.ml.linalg import Vectors
from pyspark.ml.evaluation import BinaryClassificationEvaluator as Evaluator
from pyspark.mllib.regression import LabeledPoint
from pyspark.ml.classification import LogisticRegression
import warnings
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
warnings.filterwarnings('ignore')
def vectorizeData(data):
return data.rdd.map(lambda r: [int(r[-1]), Vectors.dense(r[:-1])]).toDF(['label','features'])
in_file = "C:\\Users\\HCAOA911\\Desktop\\data\\small_sample.csv"
CV_data = spark.read.csv(in_file, header=True)
CV_data = CV_data[['step','amount','oldbalanceOrg','newbalanceOrig',
'oldbalanceDest','newbalanceDest','isFlaggedFraud', 'isFraud']]
training_data, testing_data = CV_data.randomSplit([0.8, 0.2])
xytrain = vectorizeData(training_data)
lr = LogisticRegression(regParam=0.01)
model = lr.fit(xytrain)
xytest = vectorizeData(testing_data)
predicted_train = model.transform(xytrain)
predicted_test = model.transform(xytest)
evaluator = Evaluator()
print("Train %s: %f" % (evaluator.getMetricName(), evaluator.evaluate(predicted_train)))
print("Test %s: %f" % (evaluator.getMetricName(), evaluator.evaluate(predicted_test)))
I'm working with
spark-submit --master local[*] .py
Python 3.6.4
Pyspark 2.2.1
Windows 7
Thank you in advance
I've solved the problem by using better representation for ML models:
https://github.com/iarroyof/dummy_fraud_detection/blob/master/fraud_pysparkML_test.py
In this example, I used the object pyspark.ml.feature.VectorAssembler within a function called vectorizeData().

TypeError: 'Builder' object is not callable Spark structured streaming

On running the example given in the programming guide[link] for python spark structured streaming
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
I get below Error :
TypeError: 'Builder' object is not callable
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession.builder()\
.appName("StructuredNetworkWordCount")\
.getOrCreate()
# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark\
.readStream\
.format('socket')\
.option('host', 'localhost')\
.option('port', 9999)\
.load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value, ' ')
).alias('word')
)
# Generate running word count
wordCounts = words.groupBy('word').count()
# Start running the query that prints the running counts to the console
query = wordCounts\
.writeStream\
.outputMode('complete')\
.format('console')\
.start()
query.awaitTermination()
Error :
omkar#rudra:~/thesis/backUp$ spark-submit structured.py
Traceback (most recent call last):
File "/home/omkar/thesis/backUp/structured.py", line 8, in <module>
spark = SparkSession.builder()\
TypeError: 'Builder' object is not callable
For
spark = SparkSession.builder()\
.appName("StructuredNetworkWordCount")\
.getOrCreate()
modify .builder() to .builder as :
spark = SparkSession.builder\
.appName("StructuredNetworkWordCount")\
.getOrCreate()
Source : https://issues.apache.org/jira/browse/SPARK-18426
When running python example in Structured Streaming Guide, get the error:
spark = SparkSession.builder().master("local[1]").appName("Example").getOrCreate()
TypeError: 'Builder' object is not callable
This is fixed by changing .builder() to .builder
spark = SparkSession.builder.master("local[1]").appName("Demo").getOrCreate()
After removing this-() in builder while creating sparksession,the code will run.

Resources