How to remove special characters from dataframe using udf function - apache-spark

I am a learner in spark sql. Could anyone please help with below scenario?
package name: sparksql,class name:custommethod, method name:removespecialchar
create custom method in scala which takes 1 string as argument and 1 return on type string
Method has to remove all special characters numbers 0 to 9 - ? , / _ ( ) [ ] from dataframe one column using replaceall function.
input: windows-X64 (os system)
output : windows x os system
I have a dataframe called df1 with 6 columns inside another class called sparksql2
3.Import the package, instantiate the custommethod method inside sparksql2 class and register the method generated in above step as a udf for invoking spark sql dataframe.
Call the above udf in the DSL by passing single columnname as an argument to get the special characters removed from dataframe and save the result as json into hdfs location

You don't need UDFs for that you can just use plain spark and define it in a function with regexp_replace.
take this example:
import org.apache.spark.sql.{SparkSession,DataFrame}
import org.apache.spark.sql.functions.regexp_replace
def removeFromColumn(spark: SparkSession, columnName: String, df: DataFrame) =
df.select(regexp_replace(
df(columnName),
"[0-9]|\\[|\\]|\\-|\\?|\\(|\\)|\\,|_|/",
""
).as(columnName))
with this you can use it on a DataFrame without going into the trouble of registering a UDF:
import spark.implicits._
val df = Seq("2res012-?,/_()[]ult").toDF("columnName")
removeFromColumn(spark, "columnName", df)
Output:
+----------+
|columnName|
+----------+
| result|
+----------+

Related

How to pass more than one column as a parameter to Spark dataframe

I want to pass more than one column name as a parameter to dataframe.
val readData = spark.sqlContext
.read.format("csv")
.option("delimiter",",")
.schema(Schema)
.load("emp.csv")
val cols_list1 = "emp_id,emp_dt"
val cols_list2 = "emp_num"
val RemoveDupli_DF = readData
.withColumn("rnk", row_number().over(Window.partitionBy(s"$cols_list1").orderBy(s"$cols_list2") ))
Above code is working, if i have one column name , whereas with two or more columns, its giving below error.
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'emp_id,emp_dt'
Using Scala 2.x version.
The partitionBy method as multiple signatures:
def partitionBy(colName: String, colNames: String*)
// or
def partitionBy(cols: Column*)
Your code is providing the list of columns as a single string which will fail because there is no column called emp_id,emp_dt. Hence, you get the error message.
You could define your column names (as Strings) in a collection
val cols_seq1 = Seq("emp_id","emp_dt")
and then call partitionsBy like this:
Window.partitionBy(cols_seq1: _*)
The notation : _* tells the compiler to pass each element of cols_seq1 as its own argument into the partitionBy call rather than all of it as a single argument.
As an alternative you could also just use
Window.partitionBy("emp_id", "emp_dt")

PySpark UDF not recognizing number of arguments

I have defined a Python function "DateTimeFormat" which takes three arguments
Spark Dataframe column which has date formats (String)
The input format of column's value like yyyy-mm-dd (String)
The output format i.e. the format in which the input has to be returned like yyyymmdd (String)
I have now registered this function as UDF in Pyspark.
udf_date_time = udf(DateTimeFormat,StringType())
I am trying to call this UDF in dataframe select and it seems to be working fine as long as the input format and output are different like below
df.select(udf_date_time('entry_date',lit('mmddyyyy'),lit('yyyy-mm-dd')))
But it fails, when the input format and output format are same with the following error
df.select('exit_date',udf_date_time('exit_date',lit('yyyy-mm-dd'),lit('yyyy-mm-dd')))
"DateTimeFormat" takes exactly 3 arguments. 2 given
But I'm clearly sending three arguments to the UDF
I have tried the above example on Python 2.7 and Spark 2.1
The function seems to work as expected in normal Python when input and output formats are the same
>>>DateTimeFormat('10152019','mmddyyyy','mmddyyyy')
'10152019'
>>>
But the below code is giving error when run in SPARK
import datetime
# Standard date,timestamp formatter
# Takes string date, its format and output format as arguments
# Returns string formatted date
def DateTimeFormat(col,in_frmt,out_frmt):
date_formatter ={'yyyy':'%Y','mm':'%m','dd':'%d','HH':'%H','MM':'%M','SS':'%S'}
for key,value in date_formatter.items():
in_frmt = in_frmt.replace(key,value)
out_frmt = out_frmt.replace(key,value)
return datetime.datetime.strptime(col,in_frmt).strftime(out_frmt)
Calling UDF using the code below
from pyspark.sql.functions import udf,lit
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
# Create SPARK session
spark = SparkSession.builder.appName("DateChanger").enableHiveSupport().getOrCreate()
df = spark.read.format("csv").option("header", "true").load(file_path)
# Registering UDF
udf_date_time = udf(DateTimeFormat,StringType())
df.select('exit_date',udf_date_time('exit_date',lit('yyyy-mm-dd'),lit('yyyy-mm-dd'))).show()
CSV file input Input file
Expected result is the command
df.select('exit_date',udf_date_time('exit_date',lit('yyyy-mm-dd'),lit('yyyy-mm-dd'))).show()
should NOT throw any error like
DateTimeFormat takes exactly 3 arguments but 2 given
I am not sure if there's a better way to do this but you can try the following.
Here I have assumed that you want your dates to a particular format and have set the default for the output format (out_frmt='yyyy-mm-dd') in your DateTimeFormat function
I have added a new function called udf_score to help with conversions. That might interest you
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf, lit
df = spark.createDataFrame([
["10-15-2019"],
["10-16-2019"],
["10-17-2019"],
], ['exit_date'])
import datetime
def DateTimeFormat(col,in_frmt,out_frmt='yyyy-mm-dd'):
date_formatter ={'yyyy':'%Y','mm':'%m','dd':'%d','HH':'%H','MM':'%M','SS':'%S'}
for key,value in date_formatter.items():
in_frmt = in_frmt.replace(key,value)
out_frmt = out_frmt.replace(key,value)
return datetime.datetime.strptime(col,in_frmt).strftime(out_frmt)
def udf_score(in_frmt):
return udf(lambda l: DateTimeFormat(l, in_frmt))
in_frmt = 'mm-dd-yyyy'
df.select('exit_date',udf_score(in_frmt)('exit_date').alias('new_dates')).show()
+----------+----------+
| exit_date| new_dates|
+----------+----------+
|10-15-2019|2019-10-15|
|10-16-2019|2019-10-16|
|10-17-2019|2019-10-17|
+----------+----------+

DataFrame object has no attribute 'col'

In Spark: The Definitive Guide it says:
If you need to refer to a specific DataFrame’s column, you can use the
col method on the specific DataFrame.
For example (in Python/Pyspark):
df.col("count")
However, when I run the latter code on a dataframe containing a column count I get the error 'DataFrame' object has no attribute 'col'. If I try column I get a similar error.
Is the book wrong, or how should I go about doing this?
I'm on Spark 2.3.1. The dataframe was created with the following:
df = spark.read.format("json").load("/Users/me/Documents/Books/Spark-The-Definitive-Guide/data/flight-data/json/2015-summary.json")
The book you're referring to describes Scala / Java API. In PySpark use []
df["count"]
The book combines the Scala and PySpark API's.
In Scala / Java API, df.col("column_name") or df.apply("column_name") return the Column.
Whereas in pyspark use the below to get the column from DF.
df.colName
df["colName"]
Applicable to Python Only
Given a DataFrame such as
>>> df
DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]
You can access any column with dot notation
>>> df.DEST_COUNTRY_NAME
Column<'DEST_COUNTRY_NAME'>
You can also use key based indexing to do the same
>>> df['DEST_COUNTRY_NAME']
Column<'DEST_COUNTRY_NAME'>
However, in case your column name and a method name on DataFrame clashes,
your column name will be shadowed when using dot notation.
>>> df['count']
Column<'count'>
>>> df.count
<bound method DataFrame.count of DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]>
from pyspark.sql.functions import col
... then continue
In PySpark col can be used in this way:
df.select(col("count")).show()

How should I convert an RDD of org.apache.spark.ml.linalg.Vector to Dataset?

I'm struggling to understand how the conversion among RDDs, DataSets and DataFrames works.
I'm pretty new to Spark, and I get stuck every time I need to pass from a data model to another (especially from RDDs to Datasets and Dataframes).
Could anyone explain me the right way to do it?
As an example, now I have a RDD[org.apache.spark.ml.linalg.Vector] and I need to pass it to my machine learning algorithm, for example a KMeans (Spark DataSet MLlib). So, I need to convert it to Dataset with a single column named "features" which should contain Vector typed rows. How should I do this?
All you need is an Encoder. Imports
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.ml.linalg
RDD:
val rdd = sc.parallelize(Seq(
linalg.Vectors.dense(1.0, 2.0), linalg.Vectors.sparse(2, Array(), Array())
))
Conversion:
val ds = spark.createDataset(rdd)(ExpressionEncoder(): Encoder[linalg.Vector])
.toDF("features")
ds.show
// +---------+
// | features|
// +---------+
// |[1.0,2.0]|
// |(2,[],[])|
// +---------+
ds.printSchema
// root
// |-- features: vector (nullable = true)
To convert a RDD to a dataframe, the easiest way is to use toDF() in Scala. To use this function, it is necessary to import implicits which is done using the SparkSession object. It can be done as follows:
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val df = rdd.toDF("features")
toDF() takes an RDD of tuples. When the RDD is built up of common Scala objects they will be implicitly converted, i.e. there is no need to do anything, and when the RDD has multiple columns there is no need to do anything either, the RDD already contains a tuple. However, in this special case you need to first convert RDD[org.apache.spark.ml.linalg.Vector] to RDD[(org.apache.spark.ml.linalg.Vector)]. Therefore, it is necessary to do a convertion to tuple as follows:
val df = rdd.map(Tuple1(_)).toDF("features")
The above will convert the RDD to a dataframe with a single column called features.
To convert to a dataset the easiest way is to use a case class. Make sure the case class is defined outside the Main object. First convert the RDD to a dataframe, then do the following:
case class A(features: org.apache.spark.ml.linalg.Vector)
val ds = df.as[A]
To show all possible convertions, to access the underlying RDD from a dataframe or dataset can be done using .rdd:
val rdd = df.rdd
Instead of converting back and forth between RDDs and dataframes/datasets it's usually easier to do all the computations using the dataframe API. If there is no suitable function to do what you want, usually it's possible to define an UDF, user defined function. See for example here: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-udfs.html

Apply a function to a single column of a csv in Spark

Using Spark I'm reading a csv and want to apply a function to a column on the csv. I have some code that works but it's very hacky. What is the proper way to do this?
My code
SparkContext().addPyFile("myfile.py")
spark = SparkSession\
.builder\
.appName("myApp")\
.getOrCreate()
from myfile import myFunction
df = spark.read.csv(sys.argv[1], header=True,
mode="DROPMALFORMED",)
a = df.rdd.map(lambda line: Row(id=line[0], user_id=line[1], message_id=line[2], message=myFunction(line[3]))).toDF()
I would like to be able to just call the function on the column name instead of mapping each row to line and then calling the function on line[index].
I'm using Spark version 2.0.1
You can simply use User Defined Functions (udf) combined with a withColumn :
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
udf_myFunction = udf(myFunction, IntegerType()) # if the function returns an int
df = df.withColumn("message", udf_myFunction("_3")) #"_3" being the column name of the column you want to consider
This will add a new column to the dataframe df containing the result of myFunction(line[3]).

Resources