How to pass data into udf not from row - apache-spark

Let's say I have the following dataframe schema:
+-------+-------+
| body | rules |
+-------+-------+
I have a udf that takes in the body column and rule-list column for each row, and parses and evaluated the conditions for the rules based on the row (and returns a list of booleans whether it each rule matches or not). Right now, every single row in the DF has a copy of these rules because I don't know any other way to pass in these rules to the UDF. This feels very redundant and wasteful to me.
The rules are joined onto the row based on some join conditions, so each row doesn't have the exact same data but there is still a lot of redundancy (each rule is probably listed ~5000 redundant times across 1 million rows). I would prefer to join the ruleIds onto each row instead and pass a map(ruleId -> rule) into the udf. This map may be somewhat large though so however it is passed in would have to be able to handle that (ideally it would be some sort of shared variable stored at the partition level)

You probably want to use an external map parameter to your udf, and broadcast this map to each machine to avoid having a copy of it for each task.
For your UDF, you can do something along those lines:
def yourUDF(rulesMap: Map[String, XXX]): UserDefinedFunction = udf {
(body: YYY, ruleId: String) => applyYourRules(body, rulesMap(ruleId))
} // XXX and YYY are the types you need, I don't know the problem you're trying to solve
And as you map is quite large, you can avoid duplicating it for every task by broadcasting the variable (you can access a broadcasted variable with variable.value):
val rulesMapBroadcasted = spark.sparkContext.broadcast(rulesMap)
df.withColumn("new", yourUDF(rulesMap = rulesMapBroadcasted.value)(col("body"), col("ruleId")))
A broadcasted variable is a read-only variable duplicated only once per machine (in comparison, a classical variable is duplicated once per task), so this is a perfect usage for large lookup table.

Related

How to check Spark DataFrame difference?

I need to check my solution for idempotency and check how much it's different with past solution.
I tried next:
spark.sql('''
select * from t1
except
select * from t2
''').count()
It's gives me information how much this tables different (t1 - my solution, t2 - primal data). If here is many different data, I want to check, where it different.
So, I tried that:
diff = {}
columns = t1.columns
for col in columns:
cntr = spark.sql('''
select {col} from t1
except
select {col} from t2
''').count()
diff[col] = cntr
print(diff)
It's not good for me, because it's works about 1-2 hours (both tables have 30 columns and 30 million lines of data).
Do you guys have an idea how to calculate this quickly?
Except is a kind of a join on all columns at the same time. Does your data have a primary key? It could even be complex, comprising of multiple columns, but it's still much better then taking all 30 columns into account.
Once you figure out the primary key you can do the FULL OUTER JOIN and:
check NULLs on the left
check NULLs on the right
check other columns of matching rows (it's much cheaper to compare the values after the join)
Given that your resource remains unchanged, I think there are three ways that you can optimize:
Join two dataframe once but not looping the except: I assume your dataset should have a key / index, otherwise there is no ordering in your both dataframe and you can't perform except to check the difference. Unless you have limited resource, just do join once to concat two dataframe first instead of multiple except.
Check your data partitioning: Even you use point 1 / the method that you're using, make sure that data partition is in even distribution with optimal number of partition. Most of the time, data skew is one of the critical parts to lower your performance. If your key is a string, use repartition. If you're using a sequence number, use repartitionByRange.
Use the when-otherwise pair to check the difference: once you join two dataframe, you can use a when-otherwise condition to compare the difference, for example: df.select(func.sum(func.when(func.col('df1.col_a')!=func.col('df2.col_a'), func.lit(1))).otherwise(func.lit(0)).alias('diff_in_col_a_count')). Therefore, you can calculate all the difference within one action but not multiple action.

GroupByKey to fill values and then ungroup apache beam

I have csv files that have missing values per groups formed by primary keys (for every group, there's only 1 value populated for 1 field, and I need that field to be populated for all records of the group). I'm processing the entire file with apache beam and therefore, I want to use GroupByKey to fill up the field for each group, and then ungroup it to restore the original data, now with filled data. The equivalent in pandas would be:
dataframe[column_to_be_filled] = dataframe.groupby(primary_key)[column_to_be_filled].ffill().bfill()
I don't know how to achieve this with apache beam. I first used apache beam dataframe, but that'd take a lot of memory.
It's better to process your elements with a pcollection instead of a dataframe to avoid memory issues.
First read your CSV as a pcollection and then you can use GroupByKey and process the grouped elements and yield the results with a separate transformation.
It could be something like this
(pcollection | 'Group by key' >> beam.GroupByKey()
| 'Process grouped elements' >> beam.ParDo(UngroupElements()))
The input pcollection should be list of tuples each one contains the key you want to group with and the element.
And the ptransformation would look like this:
class UngroupElements(beam.ParDo):
def process(element):
k, v = element
for elem in list(v):
# process your element
yield elem
You can try to use exactly the same code as Pandas in Beam: https://beam.apache.org/documentation/dsls/dataframes/overview/
You can use read_csv to read your data into a dataframe, and then apply the same code that you would use in Pandas. Not all Pandas operations are supported (https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas/), but that specific case with the group by key should work.

Order of rows shown changes on selection of columns from dependent pyspark dataframe

Why does the order of rows displayed differ, when I take a subset of the dataframe columns to display, via show?
Here is the original dataframe:
Here dates are in the given order, as you can see, via show.
Now the order of rows displayed via show changes when I select a subset of predict_df by method of column selection for a new dataframe.
Because of Spark dataframe itself is unordered. It's due to parallel processing principles wich Spark uses. Different records may be located in different files (and on different nodes) and different executors may read the data in different time and in different sequence.
So You have to excplicitly specify order in Spark action using orderBy (or sort) method. E.g.:
df.orderBy('date').show()
In this case result will be ordered by date column and would be more predictible. But, if many records have equal date value then within those date subset records also would be unordered. So in this case, in order to obtain strongly ordered data, we have to perform orderBy on set of columns. And values in all rows of those set of columns must be unique. E.g.:
df.orderBy(col("date").asc, col("other_column").desc)
In general unordered datasets is a normal case for data processing systems. Even "traditional" DBMS like PostgeSQL or MS SQL Server in general return unordered records and we have to explicitly use ORDER BY clause in SELECT statement. And even if sometime we may see the same results of one query it isn't guarenteed by DBMS that by another execution result will be the same also. Especially if data reading is performed on a large amout of data.
The situation occurs because the show is an action that is called twice.
As no .cache is applied the whole cycle starts again from the start. Moreover, I tried this a few times and got the same order and not the same order as the questioner observed. Processing is non-deterministic.
As soon as I used .cache, the same result was always gotten.
This means that there is ordering preserved over a narrow transformation on a dataframe, if caching has been applied, otherwise the 2nd action will invoke processing from the start again - the basics are evident here as well. And may be the bottom line is always do ordering explicitly - if it matters.
Like #Ihor Konovalenko and #mck mentioned, Sprk dataframe is unordered by its nature. Also, looks like your dataframe doesn’t have a reliable key to order, so one solution is using monotonically_increasing_id https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html to create id and that will keep your dataframe always ordered. However if your dataframe is big, be aware this function might take some time to generate id for each row.

Pandas Dataframe : inplace column substitution vs creating new dataframe with transformed column

Whenever I want to transform an existing column of a dataframe, I tend to use apply/transform which gives me altogether a new series and it does not modify the existing column in the dataframe.
Suppose the following code performs an operation on a column and returns me a series.
new_col1 = df.col1.apply(...)
After this I have two ways of substituting the new series in the dataframe
modifying the existing col1:
df.col1 = new_col1
Or creating a new dataframe with the transformed column:
df.drop(columns=[col1]).join(new_col1)
I ask this because whenever I use mutable data structures in python like lists I always try to create new lists using list comprehension and not by in-place substitution.
Is there any benefit of following this style in case of pandas dataframes ? What's more pythonic and which of the above two approaches do you recommend ?
Since you are modifying an existing column, the first approach would be faster. Remember that both drop and join returns a copy of new data, so the second approach can be expensive if you have a big data frame with many columns.
Whenever you want to make changes to the original data frame itself, consider using inplace=True attribute in functions like drop/join which by default returns a new copy.
NOTE: Please keep in mind
cons of inplace,
inplace, contrary to what the name implies, often does not prevent copies from - being created, and (almost) never offers any performance benefits
inplace does not work with method chaining
inplace is a common pitfall for beginners, so removing this option will simplify the API
SOURCE: In pandas, is inplace = True considered harmful, or not?

spark dataset : how to get count of occurence of unique values from a column

Trying spark dataset apis which reads a CSV file and count occurrence of unique values in a particular field. One approach which i think should work is not behaving as expected. Let me know what am i overlooking. I am posted both working as well as buggy approach below.
// get all records from a column
val professionColumn = data.select("profession")
// breakdown by professions in descending order
// ***** DOES NOT WORKS ***** //
val breakdownByProfession = professionColumn.groupBy().count().collect()
// ***** WORKS ***** //
val breakdownByProfessiond = data.groupBy("profession").count().sort("count") // WORKS
println ( s"\n\nbreakdown by profession \n")
breakdownByProfession.show()
Also please let me know which approach is more efficient. My guess would be the first one ( the reason to attempt that in first place )
Also what is the best way to save output of such an operation in a text file using dataset APIs
In the first case, since there are no grouping columns specified, the entire dataset is considered as one group -- this behavior holds even though there is only one column present in the dataset. So, you should always pass the list of columns to groupBy().
Now the two options would be: data.select("profession").groupBy("profession").count vs. data.groupBy("profession").count. In most cases, the performance of these two alternatives will be exactly the same since Spark tries to push projections (i.e., column selection) down the operators as much as possible. So, even in the case of data.groupBy("profession").count, Spark first selects the profession column before it does the grouping. You can verify this by looking at the execution plan -- org.apache.spark.sql.Dataset.explain()
In groupBy transformation you need to provide column name as below
val breakdownByProfession = professionColumn.groupBy().count().collect()

Resources