NameError: name 'average' is not defined - apache-spark

How to filter the data from the dataframe using avg for replacing null values
When running this snippet of code:
df.select(colname).agg(avg(colname))
I receive this exception:
name error: avg not defined
what other command can I use?

got it..
have use..
from pyspark.sql.functions import *
from pyspark.sql.types import *

it would be cleaner a solution like this:
import pyspark.sql.functions as F
df.select(colname).agg(F.avg(colname))

Related

How to correctly import pyspark.sql.functions?

from pyspark.sql.functions import isnan, when, count, sum , etc...
It is very tiresome adding all of it. Is there a way to import all of it at once?
You can try to use from pyspark.sql.functions import *. This method may lead to namespace coverage, such as pyspark sum function covering python built-in sum function.
Another insurance method: import pyspark.sql.functions as F, use method: F.sum.
Simply:
import pyspark.sql.functions as f
And use it as:
f.sum(...)
f.to_date(...)

get_json_obj _fails for SelectExpr() but works for Select in Pyspark

i am facing a strange issue , i am trying to display values of my JSON object, it works fine with select() but it dont work with selectExp(), i get a weird error, following in my implementation,
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("JsonPractice").getOrCreate()
my_json_df = spark.range(1).selectExpr(
"""'{"sample_json":{"sample_json1":["1st_vale","2nd_val"]}}' as my_json_column""")
my_json_df.selectExpr(get_json_object(col("my_json_column"), "$.sample_json.sample_json1[1]")).show(2)
my_select_expr = get_json_object(col('my_json_column'), '$.sample_json.sample_json1')
my_json_df.selectExpr(my_select_expr).show()
I am getting following error
raise TypeError("Column is not iterable")
TypeError: Column is not iterable
We don't need to specify col while using selectExpr
my_select_expr = "get_json_object(my_json_column, '$.sample_json.sample_json1')"
my_json_df.selectExpr(my_select_expr).show(10,False)
#or
my_json_df.selectExpr("get_json_object(my_json_column,'$.sample_json.sample_json1')").show(10,False)
#+-----------------------------------------------------------+
#|get_json_object(my_json_column, $.sample_json.sample_json1)|
#+-----------------------------------------------------------+
#|["1st_vale","2nd_val"] |
#+-----------------------------------------------------------+
UPDATE:
from pyspark.sql.functions import *
my_select_expr=get_json_object(col('my_json_column'),'$.sample_json.sample_json1')
my_json_df.select(my_select_expr).show(10,False)
#+-----------------------------------------------------------+
#|get_json_object(my_json_column, $.sample_json.sample_json1)|
#+-----------------------------------------------------------+
#|["1st_vale","2nd_val"] |
#+-----------------------------------------------------------+

How do I add a new date column with constant value to a Spark DataFrame (using PySpark)?

I want to add a column with a default date ('1901-01-01') with exiting dataframe using pyspark?
I used below code snippet
from pyspark.sql import functions as F
strRecordStartTime="1970-01-01"
recrodStartTime=hashNonKeyData.withColumn("RECORD_START_DATE_TIME",
lit(strRecordStartTime).cast("timestamp")
)
It gives me following error
org.apache.spark.sql.AnalysisException: cannot resolve '1970-01-01'
Any pointer is appreciated?
Try to use python native datetime with lit, I'm sorry don't have the access to machine now.
recrodStartTime = hashNonKeyData.withColumn('RECORD_START_DATE_TIME', lit(datetime.datetime(1970, 1, 1))
I have created one spark dataframe:
from pyspark.sql.types import StringType
df1 = spark.createDataFrame(["Ravi","Gaurav","Ketan","Mahesh"], StringType()).toDF("Name")
Now lets add one new column to the exiting dataframe:
from pyspark.sql.functions import lit
import dateutil.parser
yourdate = dateutil.parser.parse('1901-01-01')
df2= df1.withColumn('Age', lit(yourdate)) // addition of new column
df2.show() // to print the dataframe
You can validate your your schema by using below command.
df2.printSchema
Hope that helps.
from pyspark.sql import functions as F
strRecordStartTime = "1970-01-01"
recrodStartTime = hashNonKeyData.withColumn("RECORD_START_DATE_TIME", F.to_date(F.lit(strRecordStartTime)))

How to register UDF with no argument in Pyspark

I have tried Spark UDF with parameter using lambda function and register it. but how could I create udf with not argument and registrar it I have tried this my sample code will expected to show current time
from datetime import datetime
from pyspark.sql.functions import udf
def getTime():
timevalue=datetime.now()
return timevalue
udfGateTime=udf(getTime,TimestampType())
But PySpark is showing
NameError: name 'TimestampType' is not defined
which probably means my UDF is not registered
I was comfortable with this format
spark.udf.register('GATE_TIME', lambda():getTime(), TimestampType())
but does lambda function take empty argument? Though I didn't try it, I am a bit confused. How could I write the code for registering this getTime() function?
lambda expression can be nullary. You're just using incorrect syntax:
spark.udf.register('GATE_TIME', lambda: getTime(), TimestampType())
There is nothing special in lambda expressions in context of Spark. You can use getTime directly:
spark.udf.register('GetTime', getTime, TimestampType())
There is no need for inefficient udf at all. Spark provides required function out-of-the-box:
spark.sql("SELECT current_timestamp()")
or
from pyspark.sql.functions import current_timestamp
spark.range(0, 2).select(current_timestamp())
I have done a bit tweak here and it is working well for now
import datetime
from pyspark.sql.types import*
def getTime():
timevalue=datetime.datetime.now()
return timevalue
def GetVal(x):
if(True):
timevalue=getTime()
return timevalue
spark.udf.register('GetTime', lambda(x):GetVal(x),TimestampType())
spark.sql("select GetTime('currenttime')as value ").show()
instead of currenttime any value can pass at it will give current date time here
The error "NameError: name 'TimestampType' is not defined" seems to be due to the lack of:
import pyspark.sql.types.TimestampType
For more info regarding TimeStampType see this answer https://stackoverflow.com/a/30992905/5088142

PySpark PythonUDF Missing input attributes

I'm trying to use Spark SQL Data Frame to read some data in and apply a bunch of text clean up functions to each row.
import langid
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
from pyspark.sql import HiveContext
hsC = HiveContext(sc)
df = hsC.sql("select * from sometable")
def check_lang(data_str):
language = langid.classify(data_str)
# only english
record = ''
if language[0] == 'en':
# probability of correctly id'ing the language greater than 90%
if language[1] > 0.9:
record = data_str
return record
check_lang_udf = udf(lambda x: check_lang(x), StringType())
clean_df = df.select("Field1", check_lang_udf("TextField"))
However when I attempt to run this I get the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o31.select.
: java.lang.AssertionError: assertion failed: Unable to evaluate PythonUDF. Missing input attributes
I've spent a good deal trying to gather up more information on this but I can't find anything.
As a sidenote, I know the code below works but I'd like to stay with dataframes.
removeNonEn = data.map(lambda record: (record[0], check_lang(record[1])))
I haven't tried this code, but from the API docs suggest this should work:
hsC.registerFunction("check_lang", check_lang)
clean_df = df.selectExpr("Field1", "check_lang('TextField')")

Resources