Is there an inverse function for pyspark's expr? - apache-spark

I know there's a function called expr that turns your spark sql into a spark column with that expression:
>>> from pyspark.sql import functions as F
>>> F.expr("length(name)")
Column<b'length(name)'>
Is there a function that does the opposite - turn your Column into a pyspark's sql string? Something like:
>>> F.inverse_expr(F.length(F.col('name')))
'length(name)'
I found out that Column's __repr__ gives you an idea what the column expression is (like Column<b'length(name)'>, but it doesn't seem to be usable programmatically, without some hacky parsing and string-replacing.

In scala, we can use column#expr to get sql type expression as below-
length($"entities").expr.toString()
// length('entities)
In pyspark-
print(F.length("name")._jc.expr.container)
# length(name)

I tried the accepted answer by #Som in Spark 2.4.2 and Spark 3.2.1 and it didn't work.
The following approach worked for me in pyspark:
import pyspark
from pyspark.sql import Column
def inverse_expr(c: Column) -> str:
"""Convert a column from `Column` type to an equivalent SQL column expression (string)"""
from packaging import version
sql_expression = c._jc.expr().sql()
if version.parse(pyspark.__version__) < version.parse('3.2.0'):
# prior to Spark 3.2.0 f.col('a.b') would be converted to `a.b` instead of the correct `a`.`b`
# this workaround is used to fix this issue
sql_expression = re.sub(
r'''(`[^"'\s]+\.[^"'\s]+?`)''',
lambda x: x.group(0).replace('.', '`.`'),
sql_expression,
flags=re.MULTILINE
)
return sql_expression
>>> from pyspark.sql import functions as F
>>> inverse_expr(F.length(F.col('name')))
'length(`name`)'
>>> inverse_expr(F.length(F.lit('name')))
"length('name')"
>>> inverse_expr(F.length(F.col('table.name')))
'length(`table`.`name`)'

Related

How to convert a pyspark column(pyspark.sql.column.Column) to pyspark dataframe?

I have an use case to map the elements of a pyspark column based on a condition.
Going through this documentation pyspark column, i could not find a function for pyspark column to execute map function.
So tried to use the pyspark dataFrame map function, but not being able to convert the pyspark column to a dataframe
Note: The reason i am using the pyspark column is because i get that as an input from a library(Great expectations) which i use.
#column_condition_partial(engine=SparkDFExecutionEngine)
def _spark(cls, column, ts_formats, **kwargs):
return column.isin([3])
# need to replace the above logic with a map function
# like column.map(lambda x: __valid_date(x))
_spark function arguments are passed from the library
What i have,
A pyspark column with timestamp strings
What i require,
A Pyspark column with boolean(True/False) for each element based on validating the timestamp format
example for dataframe,
df.rdd.map(lambda x: __valid_date(x)).toDF()
__valid_date function returns True/False
So, i either need to convert the pyspark column into dataframe to use the above map function or is there any map function available for the pyspark column?
Looks like you need to return a column object that the framework will use for validation.
I have not used Great expectations, but maybe you can define an UDF for transforming your column. Something like this:
import pyspark.sql.functions as F
import pyspark.sql.types as T
valid_date_udf = udf(lambda x: __valid_date(x), T.BooleanType())
#column_condition_partial(engine=SparkDFExecutionEngine)
def _spark(cls, column, ts_formats, **kwargs):
return valid_date_udf(column)

How to remove special characters from dataframe using udf function

I am a learner in spark sql. Could anyone please help with below scenario?
package name: sparksql,class name:custommethod, method name:removespecialchar
create custom method in scala which takes 1 string as argument and 1 return on type string
Method has to remove all special characters numbers 0 to 9 - ? , / _ ( ) [ ] from dataframe one column using replaceall function.
input: windows-X64 (os system)
output : windows x os system
I have a dataframe called df1 with 6 columns inside another class called sparksql2
3.Import the package, instantiate the custommethod method inside sparksql2 class and register the method generated in above step as a udf for invoking spark sql dataframe.
Call the above udf in the DSL by passing single columnname as an argument to get the special characters removed from dataframe and save the result as json into hdfs location
You don't need UDFs for that you can just use plain spark and define it in a function with regexp_replace.
take this example:
import org.apache.spark.sql.{SparkSession,DataFrame}
import org.apache.spark.sql.functions.regexp_replace
def removeFromColumn(spark: SparkSession, columnName: String, df: DataFrame) =
df.select(regexp_replace(
df(columnName),
"[0-9]|\\[|\\]|\\-|\\?|\\(|\\)|\\,|_|/",
""
).as(columnName))
with this you can use it on a DataFrame without going into the trouble of registering a UDF:
import spark.implicits._
val df = Seq("2res012-?,/_()[]ult").toDF("columnName")
removeFromColumn(spark, "columnName", df)
Output:
+----------+
|columnName|
+----------+
| result|
+----------+

How do I add a new date column with constant value to a Spark DataFrame (using PySpark)?

I want to add a column with a default date ('1901-01-01') with exiting dataframe using pyspark?
I used below code snippet
from pyspark.sql import functions as F
strRecordStartTime="1970-01-01"
recrodStartTime=hashNonKeyData.withColumn("RECORD_START_DATE_TIME",
lit(strRecordStartTime).cast("timestamp")
)
It gives me following error
org.apache.spark.sql.AnalysisException: cannot resolve '1970-01-01'
Any pointer is appreciated?
Try to use python native datetime with lit, I'm sorry don't have the access to machine now.
recrodStartTime = hashNonKeyData.withColumn('RECORD_START_DATE_TIME', lit(datetime.datetime(1970, 1, 1))
I have created one spark dataframe:
from pyspark.sql.types import StringType
df1 = spark.createDataFrame(["Ravi","Gaurav","Ketan","Mahesh"], StringType()).toDF("Name")
Now lets add one new column to the exiting dataframe:
from pyspark.sql.functions import lit
import dateutil.parser
yourdate = dateutil.parser.parse('1901-01-01')
df2= df1.withColumn('Age', lit(yourdate)) // addition of new column
df2.show() // to print the dataframe
You can validate your your schema by using below command.
df2.printSchema
Hope that helps.
from pyspark.sql import functions as F
strRecordStartTime = "1970-01-01"
recrodStartTime = hashNonKeyData.withColumn("RECORD_START_DATE_TIME", F.to_date(F.lit(strRecordStartTime)))

How to register UDF with no argument in Pyspark

I have tried Spark UDF with parameter using lambda function and register it. but how could I create udf with not argument and registrar it I have tried this my sample code will expected to show current time
from datetime import datetime
from pyspark.sql.functions import udf
def getTime():
timevalue=datetime.now()
return timevalue
udfGateTime=udf(getTime,TimestampType())
But PySpark is showing
NameError: name 'TimestampType' is not defined
which probably means my UDF is not registered
I was comfortable with this format
spark.udf.register('GATE_TIME', lambda():getTime(), TimestampType())
but does lambda function take empty argument? Though I didn't try it, I am a bit confused. How could I write the code for registering this getTime() function?
lambda expression can be nullary. You're just using incorrect syntax:
spark.udf.register('GATE_TIME', lambda: getTime(), TimestampType())
There is nothing special in lambda expressions in context of Spark. You can use getTime directly:
spark.udf.register('GetTime', getTime, TimestampType())
There is no need for inefficient udf at all. Spark provides required function out-of-the-box:
spark.sql("SELECT current_timestamp()")
or
from pyspark.sql.functions import current_timestamp
spark.range(0, 2).select(current_timestamp())
I have done a bit tweak here and it is working well for now
import datetime
from pyspark.sql.types import*
def getTime():
timevalue=datetime.datetime.now()
return timevalue
def GetVal(x):
if(True):
timevalue=getTime()
return timevalue
spark.udf.register('GetTime', lambda(x):GetVal(x),TimestampType())
spark.sql("select GetTime('currenttime')as value ").show()
instead of currenttime any value can pass at it will give current date time here
The error "NameError: name 'TimestampType' is not defined" seems to be due to the lack of:
import pyspark.sql.types.TimestampType
For more info regarding TimeStampType see this answer https://stackoverflow.com/a/30992905/5088142

Apply a function to a single column of a csv in Spark

Using Spark I'm reading a csv and want to apply a function to a column on the csv. I have some code that works but it's very hacky. What is the proper way to do this?
My code
SparkContext().addPyFile("myfile.py")
spark = SparkSession\
.builder\
.appName("myApp")\
.getOrCreate()
from myfile import myFunction
df = spark.read.csv(sys.argv[1], header=True,
mode="DROPMALFORMED",)
a = df.rdd.map(lambda line: Row(id=line[0], user_id=line[1], message_id=line[2], message=myFunction(line[3]))).toDF()
I would like to be able to just call the function on the column name instead of mapping each row to line and then calling the function on line[index].
I'm using Spark version 2.0.1
You can simply use User Defined Functions (udf) combined with a withColumn :
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
udf_myFunction = udf(myFunction, IntegerType()) # if the function returns an int
df = df.withColumn("message", udf_myFunction("_3")) #"_3" being the column name of the column you want to consider
This will add a new column to the dataframe df containing the result of myFunction(line[3]).

Resources