How to pass whole Row to UDF - Spark DataFrame filter - apache-spark

I'm writing filter function for complex JSON dataset with lot's of inner structures. Passing individual columns is too cumbersome.
So I declared the following UDF:
val records:DataFrame = = sqlContext.jsonFile("...")
def myFilterFunction(r:Row):Boolean=???
sqlc.udf.register("myFilter", (r:Row)=>myFilterFunction(r))
Intuitively I'm thinking it will work like this:
records.filter("myFilter(*)=true")
What is the actual syntax?

You have to use struct() function for constructing the row while making a call to the function, follow these steps.
Import Row,
import org.apache.spark.sql._
Define the UDF
def myFilterFunction(r:Row) = {r.get(0)==r.get(1)}
Register the UDF
sqlContext.udf.register("myFilterFunction", myFilterFunction _)
Create the dataFrame
val records = sqlContext.createDataFrame(Seq(("sachin", "sachin"), ("aggarwal", "aggarwal1"))).toDF("text", "text2")
Use the UDF
records.filter(callUdf("myFilterFunction",struct($"text",$"text2"))).show
When u want all columns to be passed to UDF.
records.filter(callUdf("myFilterFunction",struct(records.columns.map(records(_)) : _*))).show
Result:
+------+------+
| text| text2|
+------+------+
|sachin|sachin|
+------+------+

scala> inputDF
res40: org.apache.spark.sql.DataFrame = [email: string, first_name: string ... 3 more fields]
scala> inputDF.printSchema
root
|-- email: string (nullable = true)
|-- first_name: string (nullable = true)
|-- gender: string (nullable = true)
|-- id: long (nullable = true)
|-- last_name: string (nullable = true)
Now, I would like to filter the rows based on the Gender Field. I can accomplish that by using the .filter($"gender" === "Male") but I would like to do with the .filter(function).
So, defined my anonymous functions
val isMaleRow = (r:Row) => {r.getAs("gender") == "Male"}
val isFemaleRow = (r:Row) => { r.getAs("gender") == "Female" }
inputDF.filter(isMaleRow).show()
inputDF.filter(isFemaleRow).show()
I felt the requirement can be done in a better way i.e without declaring as UDF and invoke it.

In addition to the first answer. When we want all columns to be passed to UDF we can use
struct("*")

If you want to take an action over the whole row and process it in a distributed way, take the row in the DataFrame and send to a function as a struct and then convert to a dictionary to execute the specific action, is very important to execute the collect method over the final DataFrame because Spark has the LazyLoad activated and don't work with full data at less you tell it explicitly.
In my case I should send the row of a DataFrame to index as Dictionary object:
Import libraries.
Declare the udf and the lambda must receiving the row structure.
Execute specific function, in this case send to index a dictionary (the row structure converted to a dict).
The DataFrame origin execute a withColum method that indicates to Spark execute this in each row, before make the call to collect, this allows to execute the function in a distribuible way. Don't forget send to a other DataFrame Variable.
Execute the collect method to execute the process and distribute the function.
from pyspark.sql.functions import udf, struct
from pyspark.sql.types import IntegerType
myUdf = udf(lambda row: sendToES(row.asDict()), IntegerType())
dfWithControlCol = df.withColumn("control_col", myUdf(struct([df[x] for x in df.columns])))
dfWithControlCol.collect()

Related

How can extract date from struct type column in PySpark dataframe?

I'm dealing with PySpark dataframe which has struct type column as shown below:
df.printSchema()
#root
#|-- timeframe: struct (nullable = false)
#| |-- start: timestamp (nullable = true)
#| |-- end: timestamp (nullable = true)
So I tried to collect() and pass end timestamps/window of related column for plotting issue:
from pyspark.sql.functions import *
# method 1
ts1 = [val('timeframe.end') for val in df.select(date_format(col('timeframe.end'),"yyyy-MM-dd")).collect()]
# method 2
ts2 = [val('timeframe.end') for val in df.select('timeframe.end').collect()]
So normally when the column is not struct I follow this answer but in this case I couldn't find better ways except this post and this answer which they tries to convert it to arrays. I'm not sure this the best practice.
What I have tried 2 methods as shown above unsuccessfully which outputs belows:
print(ts1) #[Row(2021-12-28='timeframe.end')]
print(ts2) #[Row(2021-12-28 00:00:00='timeframe.end')]
Expected outputs are below:
print(ts1) #[2021-12-28] just date format
print(ts2) #[2021-12-28 00:00:00] just timestamp format
How can I handle this matter?
You can access Row fields using brackets (row["field"]) or with dot (row.field) not with parentheses. Try this instead:
from pyspark.sql import Row
import pyspark.sql.functions as F
df = spark.createDataFrame([Row(timeframe=Row(start="2021-12-28 00:00:00", end="2022-01-06 00:00:00"))])
ts1 = [r["end"] for r in df.select(F.date_format(F.col("timeframe.end"), "yyyy-MM-dd").alias("end")).collect()]
# or
# ts1 = [r.end for r in df.select(F.date_format(F.col("timeframe.end"), "yyyy-MM-dd").alias("end")).collect()]
print(ts1)
#['2022-01-06']
When you do row("timeframe.end") you actually calling the class Row that's why you get those values.

Can IF statement work correctly to build spark dataframe?

I have following code which uses an IF statement to build dataframe conditionally.
Does this work as I expect?
df = sqlContext.read.option("badRecordsPath", badRecordsPath).json([data_path_1, s3_prefix + "batch_01/2/2019-04-28/15723921/15723921_15.json"])
if "scrape_date" not in df.columns:
df = df.withColumn("scrape_date", lit(None).cast(StringType()))
Is this what you are trying to do?
val result = <SOME Dataframe I previously created>
scala> result.printSchema
root
|-- VAR1: string (nullable = true)
|-- VAR2: double (nullable = true)
|-- VAR3: string (nullable = true)
|-- VAR4: string (nullable = true)
scala> result.columns.contains("VAR3")
res13: Boolean = true
scala> result.columns.contains("VAR9")
res14: Boolean = false
So the "result" dataframe has columns "VAR1", "VAR2" and so on.
The next line shows that it contains "VAR3" (result of expression is "true". But it does not contains a column called "VAR9" (result of the expression is "false").
The above is scala, but you should be able to do the same in Python (sorry I did not notice you were asking about python when I replied).
In terms of execution, the if statement will execute locally on the driver node. As a rule of thumb, if something returns an RDD, DataFrame or DataSet, it will be executed in parallel on the executor(s). Since DataFrame.columns returns an Array, any processing of the list of columns will be done in the driver node (because an Array is not an RDD, DataFrame nor DataSet).
Also note that RDD, DataFrame and DataSet will be executed "lazy-lly". That is, Spark will "accumulate" the operations that generate these objects. It will only execute them when you do something that doesn't generate an RDD, DataFrame or DataSet. For example when you do a show or a count or a collect. Part of the reason for doing this is so Spark can optimise the execution of the process. Another is so it only does what is actually needed to generate the answer.

How to write just the `row` value of a DataFrame to a file in spark?

I have a dataframe that has just one column, whose value is a JSON string. I'm trying to write just the values to a file with one record per line.
scala> selddf.printSchema
root
|-- raw_event: string (nullable = true)
The data looks like this:
scala> selddf.show(1)
+--------------------+
| raw_event|
+--------------------+
|{"event_header":{...|
+--------------------+
only showing top 1 row
I am running the following to save it to file:
selddf.select("raw_event").write.json("/data/test")
The output looks like:
{"raw_event":"{\"event_header\":{\"version\":\"1.0\"...}"}
I would like the output to just say:
{\"event_header\":{\"version\":\"1.0\"...}
What am I missing?
The reason this happens is that when you write a json you are writing the dataframe in which the column is raw_event.
Your first option is to simply write it as text:
df.write.text(filename)
Another option (if your json schema is constant to all elements) is using the from_json function to convert this to a legal dataframe. Select the elements (the content of the column which would include all members of the json) and only then save it:
val df = Seq("{\"a\": \"str\", \"b\": [1,2,3], \"c\": {\"d\": 1, \"e\": 2}}").toDF("raw_event")
import org.apache.spark.sql.types._
val schema = StructType(Seq(StructField("a", StringType), StructField("b", ArrayType(IntegerType)), StructField("c", StructType(Seq(StructField("d", IntegerType), StructField("e", IntegerType))))))
df.withColumn("jsonData", from_json($"raw_event", schema)).select("jsonData.*").write.json("bla.json")
The advantage of the second option is that you can test for maleformed rows (which would result in null) and therefore you can add a filter to remove them.
Note that in both cases you don't have escaping for the ". If you want that you would need to use the first option and first do a UDF which adds the escaping.

get datatype of column using pyspark

We are reading data from MongoDB Collection. Collection column has two different values (e.g.: (bson.Int64,int) (int,float) ).
I am trying to get a datatype using pyspark.
My problem is some columns have different datatype.
Assume quantity and weight are the columns
quantity weight
--------- --------
12300 656
123566000000 789.6767
1238 56.22
345 23
345566677777789 21
Actually we didn't defined data type for any column of mongo collection.
When I query to the count from pyspark dataframe
dataframe.count()
I got exception like this
"Cannot cast STRING into a DoubleType (value: BsonString{value=&apos;200.0&apos;})"
Your question is broad, thus my answer will also be broad.
To get the data types of your DataFrame columns, you can use dtypes i.e :
>>> df.dtypes
[('age', 'int'), ('name', 'string')]
This means your column age is of type int and name is of type string.
For anyone else who came here looking for an answer to the exact question in the post title (i.e. the data type of a single column, not multiple columns), I have been unable to find a simple way to do so.
Luckily it's trivial to get the type using dtypes:
def get_dtype(df,colname):
return [dtype for name, dtype in df.dtypes if name == colname][0]
get_dtype(my_df,'column_name')
(note that this will only return the first column's type if there are multiple columns with the same name)
import pandas as pd
pd.set_option('max_colwidth', -1) # to prevent truncating of columns in jupyter
def count_column_types(spark_df):
"""Count number of columns per type"""
return pd.DataFrame(spark_df.dtypes).groupby(1, as_index=False)[0].agg({'count':'count', 'names': lambda x: " | ".join(set(x))}).rename(columns={1:"type"})
Example output in jupyter notebook for a spark dataframe with 4 columns:
count_column_types(my_spark_df)
I don't know how are you reading from mongodb, but if you are using the mongodb connector, the datatypes will be automatically converted to spark types. To get the spark sql types, just use schema atribute like this:
df.schema
Looks like your actual data and your metadata have different types. The actual data is of type string while the metadata is double.
As a solution I would recommend you to recreate the table with the correct datatypes.
df.dtypes to get a list of (colname, dtype) pairs, ex.
[('age', 'int'), ('name', 'string')]
df.schema to get a schema as StructType of StructField, ex.
StructType(List(StructField(age,IntegerType,true),StructField(name,StringType,true)))
df.printSchema() to get a tree view of the schema, ex.
root
|-- age: integer (nullable = true)
|-- name: string (nullable = true)
data = [('A+','good','Robert',550,3000),
('A+','good','Robert',450,4000),
('A+','bad','James',300,4000),
('A','bad','Mike',100,4000),
('B-','not bad','Jenney',250,-1)
]
columns = ["A","B","C","D","E"]
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Temp-Example').getOrCreate()
df = spark.createDataFrame(data=data, schema = columns)
df.printSchema()
# root
# |-- A: string (nullable = true)
# |-- B: string (nullable = true)
# |-- C: string (nullable = true)
# |-- D: long (nullable = true)
# |-- E: long (nullable = true)
you can get datatype by simple code
# get datatype
from collections import defaultdict
import pandas as pd
data_types = defaultdict(list)
for entry in df.schema.fields:
data_types[str(entry.dataType)].append(entry.name)
pd.DataFrame(list((i,len(data_types[i])) for i in data_types) , columns = ["datatype","Nums"])
# datatype Nums
# 0 StringType() 3
# 1 LongType() 2
I am assuming you are looking to get the data type of the data you read.
input_data = [Read from Mongo DB operation]
You can use
type(input_data)
to inspect the data type

Spark sql how to execute sql command in a loop for every record in input DataFrame

Spark sql how to execute sql command in a loop for every record in input DataFrame
I have a DataFrame with following schema
%> input.printSchema
root
|-- _c0: string (nullable = true)
|-- id: string (nullable = true)
I have another DataFrame on which I need to execute sql command
val testtable = testDf.registerTempTable("mytable")
%>testDf.printSchema
root
|-- _1: integer (nullable = true)
sqlContext.sql(s"SELECT * from mytable WHERE _1=$id").show()
$id should be from the input DataFrame and the sql command should execute for all input table ids
Assuming you can work with a single new DataFrame containing all the rows present in testDf that matches the values present in the id column of input, you can do an inner join operation, as stated by Alberto:
val result = input.join(testDf, input("id") == testDf("_1"))
result.show()
Now, if you want a new, different DataFrame for each distinct value present in testDf, the problem is considerably harder. If this is the case, I would suggest you to make sure the data in your lookup table can be collected as a local list, so you could loop through its values and create a new DataFrame for each one as you already thought (this is not recommended):
val localArray: Array[Int] = input.map { case Row(_, id: Integer) => id }.collect
val result: Array[DataFrame] = localArray.map {
i => testDf.where(testDf("_1") === i)
}
Anyway, unless the lookup table is very small, I suggest that you adapt your logic to work with the single joined DataFrame of my first example.

Resources