Remove rows from dataframe based on condition in pyspark - apache-spark

I have one dataframe with two columns:
+--------+-----+
| col1| col2|
+--------+-----+
|22 | 12.2|
|1 | 2.1|
|5 | 52.1|
|2 | 62.9|
|77 | 33.3|
I would like to create a new dataframe which will take only rows where
"value of col1" > "value of col2"
Just as a note the col1 has long type and col2 has double type
the result should be like this:
+--------+----+
| col1|col2|
+--------+----+
|22 |12.2|
|77 |33.3|

I think the best way would be to simply use "filter".
df_filtered=df.filter(df.col1>df.col2)
df_filtered.show()
+--------+----+
| col1|col2|
+--------+----+
|22 |12.2|
|77 |33.3|

Another possible way could be using a where function of DF.
For example this:
val output = df.where("col1>col2")
will give you the expected result:
+----+----+
|col1|col2|
+----+----+
| 22|12.2|
| 77|33.3|
+----+----+

The best way to keep rows based on a condition is to use filter, as mentioned by others.
To answer the question as stated in the title, one option to remove rows based on a condition is to use left_anti join in Pyspark.
For example to delete all rows with col1>col2 use:
rows_to_delete = df.filter(df.col1>df.col2)
df_with_rows_deleted = df.join(rows_to_delete, on=[key_column], how='left_anti')

you can use sqlContext to simplify the challenge.
first register as temp table as example:
df.createOrReplaceTempView("tbl1")
then run the sql like
sqlContext.sql("select * from tbl1 where col1 > col2")

Related

show first occurence(s) of a column

I want to use pyspark to create new dataframe based on input where it prints out the first occurrence of each different value column. Would rownumber() work or window(). Not sure best way approach this or would sparksql be best. Basically the second table is what I want output to be where it prints out just the first occurrence of a value column from input. I only interested in first occurrence of the "value" column. If a value is repeated only show the first one seen.
+--------+--------+--------+
| VALUE| DAY | Color
+--------+--------+--------+
|20 |MON | BLUE|
|20 |TUES | BLUE|
|30 |WED | BLUE|
+--------+--------+--------+
+--------+--------+--------+
| VALUE| DAY | Color
+--------+--------+--------+
|20 |MON | BLUE|
|30 |WED | BLUE|
+--------+--------+--------+
Here's how I'd do this without using window. It will likely perform better on large data sets as it can use more of the cluster to do the work. You would need to use 'VALUE' as Department and 'Salary' as 'DATE' in your case.
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("James","Sales",3000),("Michael","Sales",4600),
("Robert","Sales",4100),("Maria","Finance",3000),
("Raman","Finance",3000),("Scott","Finance",3300),
("Jen","Finance",3900),("Jeff","Marketing",3000),
("Kumar","Marketing",2000)]
df = spark.createDataFrame(data,["Name","Department","Salary"])
unGroupedDf = df.select( \
df["Department"], \
f.struct(*[\ # Make a struct with all the record elements.
df["Salary"].alias("Salary"),\ #will be sorted on Salary first
df["Department"].alias("Dept"),\
df["Name"].alias("Name")] )\
.alias("record") )
unGroupedDf.groupBy("Department")\ #group
.agg(f.collect_list("record")\ #Gather all the element in a group
.alias("record"))\
.select(\
f.reverse(\ #Make the sort Descending
f.array_sort(\ #Sort the array ascending
f.col("record")\ #the struct
)\
)[0].alias("record"))\ #grab the "Max element in the array
).select( f.col("record.*") ).show() # use struct as Columns
.show()
+---------+------+-------+
| Dept|Salary| Name|
+---------+------+-------+
| Sales| 4600|Michael|
| Finance| 3900| Jen|
|Marketing| 3000| Jeff|
+---------+------+-------+
Appears to me you want to drop duplicated items by VALUE. if so, use dropDuplicates
df.dropDuplicates(['VALUE']).show()
+-----+---+-----+
|VALUE|DAY|Color|
+-----+---+-----+
| 20|MON| BLUE|
| 30|WED| BLUE|
+-----+---+-----+
Here's how to do it with a window. In this example they us salary as the example. In your case I think you'd use 'DAY' for orderBy and 'Value' for partitionBy.
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("James","Sales",3000),("Michael","Sales",4600),
("Robert","Sales",4100),("Maria","Finance",3000),
("Raman","Finance",3000),("Scott","Finance",3300),
("Jen","Finance",3900),("Jeff","Marketing",3000),
("Kumar","Marketing",2000)]
df = spark.createDataFrame(data,["Name","Department","Salary"])
df.show()
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
w2 = Window.partitionBy("department").orderBy(col("salary"))
df.withColumn("row",row_number().over(w2)) \
.filter(col("row") == 1).drop("row") \
.show()
+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
| James| Sales| 3000|
| Maria| Finance| 3000|
| Kumar| Marketing| 2000|
+-------------+----------+------+
Yes, you'd need to develop a way of ordering days, but I think you get that it's possible and you picked the correct tool. I always like to warn people, this uses a window and they suck all the data to 1 executor to complete the work. This is not particularly efficient. On small datasets this is likely performant. On larger data sets it may take way too long to complete.

Spark, return multiple rows on group?

So, I have a Kafka topic containing the following data, and I'm working on a proof-of-concept whether we can achieve what we're trying to do. I was previous trying to solve it within Kafka, but it seems that Kafka wasn't the right tool, so looking at Spark now :)
The data in its basic form looks like this:
+--+------------+-------+---------+
|id|serialNumber|source |company |
+--+------------+-------+---------+
|1 |123ABC |system1|Acme |
|2 |3285624 |system1|Ajax |
|3 |CDE567 |system1|Emca |
|4 |XX |system2|Ajax |
|5 |3285624 |system2|Ajax&Sons|
|6 |0147852 |system2|Ajax |
|7 |123ABC |system2|Acme |
|8 |CDE567 |system2|Xaja |
+--+------------+-------+---------+
The main grouping column is serialNumber and the result should be that id 1 and 7 should match as it's a full match on the company. Id 2 and 5 should match because the company in id 2 is a full partial match of the company in id 5. Id 3 and 8 should not match as the companies doesn't match.
I expect the end result to be something like this. Note that sources are not fixed to just one or two and in the future it will contain more sources.
+------+-----+------------+-----------------+---------------+
|uuid |id |serialNumber|source |company |
+------+-----+------------+-----------------+---------------+
|<uuid>|[1,7]|123ABC |[system1,system2]|[Acme] |
|<uuid>|[2,5]|3285624 |[system1,system2]|[Ajax,Ajax&Sons|
|<uuid>|[3] |CDE567 |[system1] |[Emca] |
|<uuid>|[4] |XX |[system2] |[Ajax] |
|<uuid>|[6] |0147852 |[system2] |[Ajax] |
|<uuid>|[8] |CDE567 |[system2] |[Xaja] |
+------+-----+------------+-----------------+---------------+
I was looking at groupByKey().mapGroups() but having problems finding examples. Can mapGroups() return more than one row?
You can simply groupBy based on serialNumber column and collect_list of all other columns.
code:
import org.apache.spark.sql.{Dataset, SparkSession}
import org.apache.spark.sql.functions._
val ds = Seq((1,"123ABC", "system1", "Acme"),
(7,"123ABC", "system2", "Acme"))
.toDF("id", "serialNumber", "source", "company")
ds.groupBy("serialNumber")
.agg(
collect_list("id").alias("id"),
collect_list("source").alias("source"),
collect_list("company").alias("company")
)
.show(false)
Output:
+------------+------+------------------+------------+
|serialNumber|id |source |company |
+------------+------+------------------+------------+
|123ABC |[1, 7]|[system1, system2]|[Acme, Acme]|
+------------+------+------------------+------------+
If you dont want duplicate values, use collect_set
ds.groupBy("serialNumber")
.agg(
collect_list("id").alias("id"),
collect_list("source").alias("source"),
collect_set("company").alias("company")
)
.show(false)
Output with collect_set on company column:
+------------+------+------------------+-------+
|serialNumber|id |source |company|
+------------+------+------------------+-------+
|123ABC |[1, 7]|[system1, system2]|[Acme] |
+------------+------+------------------+-------+

Spark update column value from another column based if null or empty

I am newbie in Spark but very curious on Spark performance tuning. I have a large data frame that needs to update the column (say column A) with value of another column (say column B) from the same data frame, if current column value is null or empty. Here is my practice:
val cleanDF = originDF.withColumn("A", when(col("A").isNull || col("A") == "", col("B")))
Here are my questions:
Is there a better way of null check? In Java world there is Apache common library provide API to check a String isBlank. Is there anything similar available in Spark?
How is the performance impact of the or (||) condition if validate on huge dataframe
Is there a better option to run this column update task with better performance? I know UPDATE can be expensive in Spark so just wondering good practice.
1)You can use commons lib. Add --jars "path to your commons jar file" to your spark shell or submit
2)You are doing a one pass on Dataset (plus constant time to check and copy colB to colA), it should be O(N)
3)your colA (potentially, if blank) replaced by colB from the same Row, it is not expensive. I cannot confirm, but I am assuming all the data of a Row resides in the same node.
scala> val df = Seq(
| (null, "A"),
| ("", "B"),
| (" ", "C"),
| ("D", "DB"),
| ("E", null)
| ).toDF("colA","colB")
df: org.apache.spark.sql.DataFrame = [colA: string, colB: string]
scala> df.show
+----+----+
|colA|colB|
+----+----+
|null| A|
| | B|
| | C|
| D| DB|
| E|null|
+----+----+
scala> val myudf = udf{(x:String) => org.apache.commons.lang3.StringUtils.isBlank(x)}
myudf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,BooleanType,Some(List(StringType)))
scala> df.select(when (myudf($"colA"), $"colB").otherwise($"colA") as "colA", $"colB").show(false)
+----+----+
|colA|colB|
+----+----+
|A |A |
|B |B |
|C |C |
|D |DB |
|E |null|
+----+----+

How to join two pyspark data frames on Arraytype operation?

I have two DataFrames, A and B. Each have a column called 'names' and this column is ArrayType(StringType()).
Now I want to left join A and B on the condition that A['names'] and B['names'] have common elements.
Here is an example:
A:
+---------------+
| names|
+---------------+
|['Mike','Jack']|
| ['Peter']|
+---------------+
B:
+---------------+
| names|
+---------------+
|['John','Mike']|
| null|
+---------------+
after the left join, I should have:
+---------------+---------------+
| A_names| B_names|
+---------------+---------------+
|['Mike','Jack']|['John','Mike']|
| ['Peter']| null|
+---------------+---------------+
In your case you have to explode the values- explode will produce one row per value in your arrays and then you can join them and reduce the final result back to your desired format.
In the code example, I exploded the names and joined the DataFrames on the newly created column (B_names). Finally the result will be grouped by "names" to remove the produced duplicates.
For the group by aggregate function, you can use pyspark.sql.functions.first(), with the parameter ignorenulls set to True.
from pyspark.sql.functions import col, explode, first
test_list = [['Mike', 'Jack']], [['Peter']]
test_df = spark.createDataFrame(test_list, ["names"])
test_list2 = [["John","Mike"]],[["Kate"]]
test_df2 = spark.createDataFrame(test_list2, ["names"])
test_df2 = test_df2.select(
col("names").alias("B_names"),
explode("names").alias("single_names")
)
test_df.select(col("names").alias("A_names"), explode("names").alias("single_names"))\
.join(test_df2, on="single_names", how="left" )\
.groupBy("A_names").agg(first("B_names", ignorenulls=True).alias("B_names")).show()
Result:
+------------+------------+
| A_names| B_names|
+------------+------------+
|[Mike, Jack]|[John, Mike]|
| [Peter]| null|
+------------+------------+

Subtracting DataFrames by a single ID column - duplicate columns behave differently

I am trying to compare two DataFrames with the same schema (in Spark 1.6.0, using Scala) to determine which rows in the newer table have been added (i.e. are not present in the older table).
I need to do this by ID (i.e. examining a single column, not the whole row, to see what is new). Some rows may have changed between the versions, in that they have the same id in both versions, but the other columns have changed - I do not want these in the output, so I cannot simply subtract the two versions.
Based on various suggestions, I am doing a left-outer join on the chosen ID column, then selecting rows with nulls in columns from the right side of the join (indicating that they were not present in the older version of the table):
def diffBy(field:String, newer:DataFrame, older:DataFrame): DataFrame = {
newer.join(older, newer(field) === older(field), "left_outer")
.select(older(field).isNull)
// TODO just select the leftmost columns, removing the nulls
}
However, this does not work. (row 3 exists only in the newer version, so should be output):
scala> newer.show
+---+-------+
| id| value|
+---+-------+
| 3| three|
| 2|two-new|
+---+-------+
scala> older.show
+---+-------+
| id| value|
+---+-------+
| 1| one|
| 2|two-old|
+---+-------+
scala> diffBy("id", newer, older).show
+---+-----+---+-----+
| id|value| id|value|
+---+-----+---+-----+
+---+-----+---+-----+
The join is working as expected:
scala> val joined = newer.join(older, newer("id") === older("id"), "left_outer")
scala> joined.show
+---+-------+----+-------+
| id| value| id| value|
+---+-------+----+-------+
| 2|two-new| 2|two-old|
| 3| three|null| null|
+---+-------+----+-------+
So the problem is in the selection of the column for filtering.
joined.where(older("id").isNull).show
+---+-----+---+-----+
| id|value| id|value|
+---+-----+---+-----+
+---+-----+---+-----+
Perhaps it is due to the duplicate id column names in the join? But if I use the value column (which is also duplicated) instead to detect nulls, it works as expected:
joined.where(older("value").isNull).show
+---+-----+----+-----+
| id|value| id|value|
+---+-----+----+-----+
| 3|three|null| null|
+---+-----+----+-----+
What is going on here - and why is the behaviour different for id and value?
You can solve the problem using a special spark join called "leftanti" . It is equivalent to minus (in Oracle PL SQL).
val joined = newer.join(older, newer("id") === older("id"), "leftanti")
This will only select columns from newer.
I have found a solution to my problem, though not an explanation for why it occurs.
It seems to be necessary to create an alias in order to refer unambiguously to the rightmost id column, and then use a textual WHERE clause so that I can substitute in the qualified column name from the variable field:
newer.join(older.as("o"), newer(field) === older(field), "left_outer")
.where(s"o.$field IS NULL")

Resources