Azure Databricks & pyspark - substring errors - python-3.x

Getting two errors with my Databricks Spark script with the following line:
df = spark.createDataFrame(pdDf).withColumn('month', substring(col('dt'), 0, 7))
The first one:
AttributeError: 'Series' object has no attribute 'substr'
and
NameError: name 'substr' is not defined
I wonder what I am doing wrong...

Turned out I had not imported the pyspark.sql.functions
from pyspark.sql.functions import *

Related

Pyspark AttributeError: 'NoneType' object has no attribute 'split''

I am working on a Pyspark using the flatMap function and I am using the split within the function. But I am getting an error which says:
AttributeError: 'NoneType' object has no attribute 'split''
I am watching a video and replicating the same thing I am seeing in the video. It works in the video but I keep getting this error. Below is my code:
datasetfor2019.map(lambda col: col[Conditions])\
.filter(lambda x: x!='')\
.flatMap(lambda x: x.split(','))\
.map(lambda x: (x, 1))\
.reduceByKey(add)\
.sortBy(lambda x: x[1], ascending=False)\
.take(5)
I will like to know what I am doing wrong or if I need to import anything into my Pyspark environment, what could that be?
Thanking you in advance.

Unable to use pyspark udf

I am trying to format the string in one the columns using pyspark udf.
Below is my dataset:
+--------------------+--------------------+
| artists| id|
+--------------------+--------------------+
| ['Mamie Smith']|0cS0A1fUEUd1EW3Fc...|
|"[""Screamin' Jay...|0hbkKFIJm7Z05H8Zl...|
| ['Mamie Smith']|11m7laMUgmOKqI3oY...|
| ['Oscar Velazquez']|19Lc5SfJJ5O1oaxY0...|
| ['Mixe']|2hJjbsLCytGsnAHfd...|
|['Mamie Smith & H...|3HnrHGLE9u2MjHtdo...|
| ['Mamie Smith']|5DlCyqLyX2AOVDTjj...|
|['Mamie Smith & H...|02FzJbHtqElixxCmr...|
|['Francisco Canaro']|02i59gYdjlhBmbbWh...|
| ['Meetya']|06NUxS2XL3efRh0bl...|
| ['Dorville']|07jrRR1CUUoPb1FLf...|
|['Francisco Canaro']|0ANuF7SvPeIHanGcC...|
| ['Ka Koula']|0BEO6nHi1rmTOPiEZ...|
| ['Justrock']|0DH1IROKoPK5XTglU...|
| ['Takis Nikolaou']|0HVjPaxbyfFcg8Rh0...|
|['Aggeliki Karagi...|0Hn7LWy1YcKhPaA2N...|
|['Giorgos Katsaros']|0I6DjrEfd3fKFESHE...|
|['Francisco Canaro']|0KGiP9EW1xtojDHsT...|
|['Giorgos Katsaros']|0KNI2d7l3ByVHU0g2...|
| ['Amalia Vaka']|0LYNwxHYHPW256lO2...|
+--------------------+--------------------+
And code:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pyspark.sql.types as t
import logging as log
session = SparkSession.builder.master("local").appName("First Python App").getOrCreate()
df = session.read.option("header", "true").csv("/home/deepak/Downloads/spotify_data_Set/data.csv")
df = df.select("artists", "id")
# df = df.withColumn("new_atr",f.translate(f.col("artists"),'"', "")).\
# withColumn("new_atr_2" , f.translate(f.col("artists"),'[', ""))
df.show()
def format_column(st):
print(type(st))
print(1)
return st.upper()
session.udf.register("format_str", format_column)
df.select("id",format_column(df.artists)).show(truncate=False)
# schema = t.StructType(
# [
# t.StructField("artists", t.ArrayType(t.StringType()), True),
# t.StructField("id", t.StringType(), True)
#
# ]
# )
df.show(truncate=False)
The UDF is still not complete but with the error, I am not able to move further. When I run the above code I am getting below error:
<class 'pyspark.sql.column.Column'>
1
Traceback (most recent call last):
File "/home/deepak/PycharmProjects/Spark/src/test.py", line 25, in <module>
df.select("id",format_column(df.artists)).show(truncate=False)
File "/home/deepak/PycharmProjects/Spark/src/test.py", line 18, in format_column
return st.upper()
TypeError: 'Column' object is not callable
The syntax looks fine and I am not able to figure out what wrong with the code.
You get this error because you are calling the python function format_column instead of the registered UDF format_str.
You should be using :
from pyspark.sql import functions as F
df.select("id", F.expr("format_str(artists)")).show(truncate=False)
Moreover, the way you registered the UDF you can't use it with DataFrame API but only in Spark SQL. If you want to use it within DataFrame API you should define the function like this :
format_str = F.udf(format_column, StringType())
df.select("id", format_str(df.artists)).show(truncate=False)
Or using annotation syntax:
#F.udf("string")
def format_column(st):
print(type(st))
print(1)
return st.upper()
df.select("id", format_column(df.artists)).show(truncate=False)
That said, you should use Spark built-in functions (upper in this case) unless you have a specific need that can't be done using Spark functions.
well , I see that you are using a predined spark function in the definition of an UDF which is acceptable as you said that you are starting with some examples , your error means that there is no method called upper for a column however you can correct that error using this defintion:
#f.udf("string")
def format_column(st):
print(type(st))
print(1)
return st.upper()
for example :

Join two pyspark dataframes to select all the columns from the first df and some columns from the second df

I tried importing two functions as shown below but I get an error
from pyspark.sql.functions import regexp_replace, col
df1 = sales.alias('a').join(customer.alias('b'),col('b.ID') == col('a.ID'))\
.select([col('a.'+xx) for xx in sales.columns] + col('b.others')
TypeError: 'str' object is not callable
I really don't understand what's wrong with that line of code? Thanks.
PySpark select function expects only string column names and there is no need to send column objects as arrays. So you could just need to do this instead
from pyspark.sql.functions import regexp_replace, col
df1 = sales.alias('a').join(customer.alias('b'),col('b.ID') == col('a.ID'))\
.select(sales.columns + ['others'])

How to get original Python error from a Py4JJavaError raised in a PySpark UDF

I am using PySpark UDFs to execute code on a Spark worker. If an exception is raised in the UDF this is wrapped in a Py4JJavaError and re-raised in Python. In order to process the error correctly I need the original error. Is there a way to get it from the Py4JJavaError?
The string representation of the original error is printed as part of the stack trace, so it would be possible to get at least the type of error by parsing the trace. However, this would be tedious and error-prone.
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(pd.DataFrame({"A": [1, 2, 3]}))
#udf
def test(x):
raise ValueError(f"Got {x}")
df = df.withColumn("B", test("A"))
df.show()
I would expect that I can extract the error that was originally raised, or at least the name of the error and/or the error message without parsing the stack trace.

pyspark error: 'DataFrame' object has no attribute 'map'

I am using pyspark 2.0 to create a DataFrame object by reading a csv using:
data = spark.read.csv('data.csv', header=True)
I find the type of the data using
type(data)
The result is
pyspark.sql.dataframe.DataFrame
I am trying to convert the some columns in data to LabeledPoint in order to apply a classification.
from pyspark.sql.types import *
from pyspark.sql.functions import loc
from pyspark.mllib.regression import LabeledPoint
data.select(['label','features']).
map(lambda row:LabeledPoint(row.label, row.features))
I came across with this problem:
AttributeError: 'DataFrame' object has no attribute 'map'
Any idea on the error? Is there a way to generate a LabelPoint from DataFrame in order to perform classification?
Use .rdd.map:
>>> data.select(...).rdd.map(...)
DataFrame.map has been removed in Spark 2.

Resources