Let's say I have two DataFrames -- df1 and df2 -- both with the columns foo and bar. The column foo is a CRC32 hash value like 123456, the column bar is a boolean field that defaults to False.
In pyspark, what is an efficient way to compare the values of foo across the two DataFrames, writing the column bar to True in the event they do not match.
e.g., given the following two DataFrames:
# df1
foo | bar
-------|------
123456 | False
444555 | False
666777 | False
888999 | False
# df2
foo | bar
-------|------
938894 | False
129803 | False
666777 | False
888999 | False
I woud like a new DataFrame that looks like the following, with two True columns where they hashes have changed:
# df3
foo | bar
-------|------
938894 | True <---
129803 | True <---
666777 | False
888999 | False
Any guidance would be much appreciated.
UPDATE 7/1/2018
After successful use of the accepted answer for quite some time, encountered a situation makes the solution not a great fit. If multiple rows from one of the joined DataFrames have the same value for foo as a row from the other DataFrame in the join, it results in a cartesian product growth of rows on that shared value.
In my case, I had I had CRC32 hash values based on an empty string, which results in 0 for the hash. I also should have added, that I do have a unique string to match the rows on, under id here (may have oversimplified situation), and perhaps this is the thing to join on:
It would create situations like this:
# df1
id |foo | bar
-----|-------|------
abc |123456 | False
def |444555 | False
ghi |0 | False
jkl |0 | False
# df2
id |foo | bar
-----|-------|------
abc |123456 | False
def |999999 | False
ghi |666777 | False
jkl |0 | False
And with the selected answer, would get a DataFrame with more rows than desired:
# df3
id |foo | bar
-----|-------|------
abc |123456 | False
def |999999 | True <---
ghi |0 | False
jkl |0 | False
jkl |0 | False # extra row add through join
I'm going to keep the answer as selected, because it's a great answer to the question as originally posed. But, any suggestions for how to handle DataFrames where the column foo may match, would be appreciated.
ANOTHER UPDATE 7/1/2018, ALTERNATE ANSWER
I was over complicating the issue without the id column to join on. When using that, it's relatively straightforward to join and write transformed column based on direct comparison of fingerprint column:
df2.alias("df2").join(df1.alias("df1"), df1.id == df2.id, 'left')\
.select(f.col('df2.foo'), f.when(df1.fingerprint != df2.fingerprint, f.lit(True)).otherwise(f.col('df2.bar')).alias('bar'))\
.show(truncate=False)
A aliased left join of df2 with df1 and use of when function to check for the not matched logic should give you your desired output
df2.alias("df2").join(df1.alias("df1"), df1.foo == df2.foo, 'left')\
.select(f.col('df2.foo'), f.when(f.isnull(f.col('df1.foo')), f.lit(True)).otherwise(f.col('df2.bar')).alias('bar'))\
.show(truncate=False)
which should give you
+------+-----+
|foo |bar |
+------+-----+
|129803|true |
|938894|true |
|888999|false|
|666777|false|
+------+-----+
I would suggest using a left join and write the code such that when the data is null then you output false and vice versa.
Related
I'm trying to obtain the following:
+--------------------+
|work_time | day_shift|
+--------------------+
| 00:45:40 | No |
| 10:05:47 | Yes |
| 15:25:28 | Yes |
| 19:38:52 | No |
where I classify the "work_time" into "day_shift".
"Yes" - if the time falls between 09:00:00 and 18:00:00
"No" - otherwise
My "work_time" is in datetime format showing only the time. I tried the following, but I'm just getting "No" for everything.
df = df.withColumn('day_shift', when(df.work_time >= to_timestamp(lit('09:00:00'), 'HH:mm:ss') & df.work_time <= to_timestamp(lit('18:00:00'), 'Yes').otherwise('No'))
You can use Column class method between. It works for both, timestamps and strings in format "HH:mm:ss". Use this:
F.col("work_time").between("09:00:00", "18:00:00")
Full test:
from pyspark.sql import functions as F
df = spark.createDataFrame([('00:45:40',), ('10:05:47',), ('15:25:28',), ('19:38:52',)], ['work_time'])
day_shift = F.col("work_time").between("09:00:00", "18:00:00")
df = df.withColumn("day_shift", F.when(day_shift, "Yes").otherwise("No"))
df.show()
# +---------+---------+
# |work_time|day_shift|
# +---------+---------+
# | 00:45:40| No|
# | 10:05:47| Yes|
# | 15:25:28| Yes|
# | 19:38:52| No|
# +---------+---------+
First of all, spark doesn't have so-called "Time" data type, it only supports either TimestampType or DateType. Therefore, I believe the work_time in your dataframe is a string.
Secondly, when you check your func.to_timestamp(func.lit('09:00:00'), 'HH:mm:ss') in selection statement, it will show:
+--------------------------------+
|to_timestamp(09:00:00, HH:mm:ss)|
+--------------------------------+
|1970-01-01 09:00:00 |
+--------------------------------+
only showing top 1 row
The best way to achieve is either split your work_time column to hour, minute and second column respectively and do the filtering, or add a date value in your work_time column before any timestamp filtering.
I have text file which looks like:
:1: some first row of first attribute
second row of first attribute
:55: first row of fifty fifth
:100: some other text
also other
another one
I would like to parse it such manner:
+----------+-----------------------------------+
| AttrNr | Row |
+----------+-----------------------------------+
| 1 | some first row of first attribute |
+----------+-----------------------------------+
| 1 | second row of first attribute |
+----------+-----------------------------------+
| 1 | 3rd value with test: 1,2,3 |
+----------+-----------------------------------+
| 55 | first row of fifty fifth |
+----------+-----------------------------------+
| 100 | some other text |
+----------+-----------------------------------+
| 100 | also other |
+----------+-----------------------------------+
| 100 | another one |
+----------+-----------------------------------+
Parsing should be done according :n: delimeter. ":" symbol might appear in values.
The final output can be achieved by using a set of Window functions that are available on Spark but your data lacks a lot of essential details like a partitioning column, a column with which we can order the data to know which row comes after the first one.
Assuming you are working on a distributed system here the following answer might not work at all. It works for the provided example but things will be different when you are in a distributed environment with a huge file.
Creating a DataFrame from the text file:
Reading the text file as an RDD:
val rdd = sc.parallelize(Seq(
(":1: some first row of first attribute"),
("second row of first attribute"),
(":55: first row of fifty fifth"),
(":100: some other text"),
("also other"),
("another one")
))
// Or use spark.sparkContext.textFile if you are reading from a file
Iterate over the RDD to split the columns in the required format and generate a DataFrame
val df = rdd.map{ c=>
if(c.startsWith(":")) (c.split(" ", 2)(0), c.split(" ", 2)(1))
else (null.asInstanceOf[String], c )
}.toDF("AttrNr", "Row")
//df: org.apache.spark.sql.DataFrame = [AttrNr: string, Row: string]
df.show(false)
// +------+---------------------------------+
// |AttrNr|Row |
// +------+---------------------------------+
// |:1: |some first row of first attribute|
// |null |second row of first attribute |
// |:55: |first row of fifty fifth |
// |:100: |some other text |
// |null |also other |
// |null |another one |
// +------+---------------------------------+
The following set of commands are just a hack and are not performance effective at all and shouldn't be used in a production-like environment. last provides your the last not null column. Partition and ordering is done manually here because your data does not provide those set of columns.
df.withColumn("p", lit(1))
.withColumn("AttrNr",
last($"AttrNr", true).over(Window.partitionBy($"p").orderBy(lit(1)).rowsBetween(Window.unboundedPreceding, 0) ) )
// +------+---------------------------------+
// |AttrNr|Row |
// +------+---------------------------------+
// |:1: |some first row of first attribute|
// |:1: |second row of first attribute |
// |:55: |first row of fifty fifth |
// |:100: |some other text |
// |:100: |also other |
// |:100: |another one |
// +------+---------------------------------+
Actually I solved it with SQL. But I was wondering maybe it's more simple way. I'm using spark 2.3 without high order functions.
import org.apache.spark.sql.expressions.Window
val df = Seq((":1: some first row of first attribute"),
("second row of first attribute"),
("3rd value with test: 1,2,3"),
(":55: first row of fifty fifth"),
(":100: some other text"),
("also other"),
("another one")).toDF("_c0")
df.createOrReplaceTempView("test1")
spark.sql("""select _c0, split(_c0, ":") arr from test1""").createOrReplaceTempView("test2")
val testDF = spark.sql("""
select arr[1] t0,
cast(arr[1] as int) t1,
case when arr[1] = cast(arr[1] as int)
then replace(concat_ws(":",arr),concat(concat(":",arr[1]),":"),"")
else concat_ws(":",arr)
end Row
,monotonically_increasing_id() mrn
from test2""")
val fnc = Window.orderBy("mrn")
val testDF2 = testDF.withColumn("AttrNr", last('t1,true).over(fnc))
testDF2.drop("t0","t1","mrn").show(false)
+----------------------------------+------+
|Row |AttrNr|
+----------------------------------+------+
| some first row of first attribute|1 |
|second row of first attribute |1 |
|3rd value with test: 1,2,3 |1 |
| first row of fifty fifth |55 |
| some other text |100 |
|also other |100 |
|another one |100 |
+----------------------------------+------+
Column "AttrNr" can be received with "regexp_extract" function:
df
.withColumn("AttrNr", regexp_extract($"_c0", "^:([\\d].*):", 0))
.withColumn("Row", when(length($"AttrNr") === lit(0), $"_c0").otherwise(expr("substring(_c0, length(AttrNr) + 2)")))
.withColumn("AttrNr", when(length($"AttrNr") === lit(0), null.asInstanceOf[String]).otherwise(expr("substring(_c0, 2, length(AttrNr) - 2)")))
// Window with no partitioning, bad for performance
.withColumn("AttrNr", last($"AttrNr", true).over(Window.orderBy(lit(1)).rowsBetween(Window.unboundedPreceding, 0)))
.drop("_c0")
I have a spark dataframe that has an ID column and along with other columns, it has an array column that contains the IDs of its related records, as its value.
example dataframe will be of
ID | NAME | RELATED_IDLIST
--------------------------
123 | mike | [345,456]
345 | alen | [789]
456 | sam | [789,999]
789 | marc | [111]
555 | dan | [333]
From the above, I need to append all the related child Id's to the array column of the parent ID. The resultant DF should be like
ID | NAME | RELATED_IDLIST
--------------------------
123 | mike | [345,456,789,999,111]
345 | alen | [789,111]
456 | sam | [789,999,111]
789 | marc | [111]
555 | dan | [333]
need help on how to do it. thanks
One way to handle this task is to do self leftjoin, update the RELATED_IDLIST, do this several iterations until some conditions satisfy (this works only when the max-depth of the whole hierarchy is small). For Spark 2.3, we can convert the ArrayType column into a comma-delimitered StringType column, use SQL builtin function find_in_set and a new column PROCESSED_IDLIST to set up the join-conditions, see below for the main logic:
Functions:
from pyspark.sql import functions as F
import pandas as pd
# define a function which takes a dataframe as input, does a self left-join and then return another
# dataframe with exactly the same schema as the input dataframe. do the same repeatly until some conditions satisfy
def recursive_join(d, max_iter=10):
# function to find direct child-IDs and merge into RELATED_IDLIST
def find_child_idlist(_df):
return _df.alias('d1').join(
_df.alias('d2'),
F.expr("find_in_set(d2.ID,d1.RELATED_IDLIST)>0 AND find_in_set(d2.ID,d1.PROCESSED_IDLIST)<1"),
"left"
).groupby("d1.ID", "d1.NAME").agg(
F.expr("""
/* combine d1.RELATED_IDLIST with all matched entries from collect_set(d2.RELATED_IDLIST)
* and remove trailing comma from when all d2.RELATED_IDLIST are NULL */
trim(TRAILING ',' FROM
concat_ws(",", first(d1.RELATED_IDLIST), concat_ws(",", collect_list(d2.RELATED_IDLIST)))
) as RELATED_IDLIST"""),
F.expr("first(d1.RELATED_IDLIST) as PROCESSED_IDLIST")
)
# below the main code logic
d = find_child_idlist(d).persist()
if (d.filter("RELATED_IDLIST!=PROCESSED_IDLIST").count() > 0) & (max_iter > 1):
d = recursive_join(d, max_iter-1)
return d
# define pandas_udf to remove duplicate from an ArrayType column
get_uniq = F.pandas_udf(lambda s: pd.Series([ list(set(x)) for x in s ]), "array<int>")
Where:
in the function find_child_idlist(), the left-join must satisfy the following two conditions:
d2.ID is in d1.RELATED_IDLIST: find_in_set(d2.ID,d1.RELATED_IDLIST)>0
d2.ID not in d1.PROCESSED_IDLIST: find_in_set(d2.ID,d1.PROCESSED_IDLIST)<1
quit the recursive_join when no row satisfying RELATED_IDLIST!=PROCESSED_IDLIST or max_iter > 1
Processing:
set up dataframe:
df = spark.createDataFrame([
(123, "mike", [345,456]), (345, "alen", [789]), (456, "sam", [789,999]),
(789, "marc", [111]), (555, "dan", [333])
],["ID", "NAME", "RELATED_IDLIST"])
add a new column PROCESSED_IDLIST to save RELATED_IDLIST in the previous join, and do recursive_join()
df1 = df.withColumn('RELATED_IDLIST', F.concat_ws(',','RELATED_IDLIST')) \
.withColumn('PROCESSED_IDLIST', F.col('ID'))
df_new = recursive_join(df1, 5)
df_new.show(10,0)
+---+----+-----------------------+-----------------------+
|ID |NAME|RELATED_IDLIST |PROCESSED_IDLIST |
+---+----+-----------------------+-----------------------+
|555|dan |333 |333 |
|789|marc|111 |111 |
|345|alen|789,111 |789,111 |
|123|mike|345,456,789,789,999,111|345,456,789,789,999,111|
|456|sam |789,999,111 |789,999,111 |
+---+----+-----------------------+-----------------------+
split RELATED_IDLIST into array of integers and then use pandas_udf function to drop duplicate array elements:
df_new.withColumn("RELATED_IDLIST", get_uniq(F.split('RELATED_IDLIST', ',').cast('array<int>'))).show(10,0)
+---+----+-------------------------+-----------------------+
|ID |NAME|RELATED_IDLIST |PROCESSED_IDLIST |
+---+----+-------------------------+-----------------------+
|555|dan |[333] |333 |
|789|marc|[111] |111 |
|345|alen|[789, 111] |789,111 |
|123|mike|[999, 456, 111, 789, 345]|345,456,789,789,999,111|
|456|sam |[111, 789, 999] |789,999,111 |
+---+----+-------------------------+-----------------------+
| Col1 | Col2 | Col3 |
|------|------|------|
| m | n | o |
| m | q | e |
| a | b | r |
Let's say I have a pandas DataFrame as shown above. Notice the col1 values are same for the 0th and 1st row. Is there way to find all the duplicate entries on the dataframe based on Col1 only.
Additionally i wold also like to add another column say is_duplicate which would say True for all the duplicate instances of my DataFrame and False otherwise.
Note: I want to find the duplicates based only on basis of the value in Col1 the other columuns can be or might not be duplicates, They should'nt be taken into consideration.
.duplicated() has exactly that functionality:
df['is_duplicate'] = df.duplicated('Col1')
I found it :
df["is_duplicate"] = df.Col1.duplicated(keep=False)
I have a DataFrame with two categorical columns, similar to the following example:
+----+-------+-------+
| ID | Cat A | Cat B |
+----+-------+-------+
| 1 | A | B |
| 2 | B | C |
| 5 | A | B |
| 7 | B | C |
| 8 | A | C |
+----+-------+-------+
I have some processing to do that needs two steps: The first one needs the data to be grouped by both categorical columns. In the example, it would generate the following DataFrame:
+-------+-------+-----+
| Cat A | Cat B | Cnt |
+-------+-------+-----+
| A | B | 2 |
| B | C | 2 |
| A | C | 1 |
+-------+-------+-----+
Then, the next step consists on grouping only by CatA, to calculate a new aggregation, for example:
+-----+-----+
| Cat | Cnt |
+-----+-----+
| A | 3 |
| B | 2 |
+-----+-----+
Now come the questions:
In my solution, I create the intermediate dataframe by doing
val df2 = df.groupBy("catA", "catB").agg(...)
and then I aggregate this df2 to get the last one:
val df3 = df2.groupBy("catA").agg(...)
I assume it is more efficient than aggregating the first DF again. Is it a good assumption? Or it makes no difference?
Are there any suggestions of a more efficient way to achieve the same results?
Generally speaking it looks like a good approach and should be more efficient than aggregating data twice. Since shuffle files are implicitly cached at least part of the work should be performed only once. So when you call an action on df2 and subsequently on df3 you should see that stages corresponding to df2 have been skipped. Also partial structure enforced by the first shuffle may reduce memory requirements for the aggregation buffer during the second agg.
Unfortunately DataFrame aggregations, unlike RDD aggregations, cannot use custom partitioner. It means that you cannot compute both data frames using a single shuffle based on a value of catA. It means that second aggregation will require separate exchange hash partitioning. I doubt it justifies switching to RDDs.