Remove rows where groups of two columns have differences

Remove rows where groups of two columns have differences - apache-spark

Is it possible to remove rows if the values in the Block column occurs at least twice which has different values in the ID column?
My data looks like this:
ID
Block
1
A
1
C
1
C
3
A
3
B
In the above case, the value A in the Block column occurs twice, which has values 1 and 3 in the ID column. So the rows are removed.
The expected output should be:
ID
Block
1
C
1
C
3
B
I tried to use the dropDuplicates after the groupBy, but I don't know how to filter with this type of condition. It appears that I would need a set for the Block column to check with the ID column.

One way to do it is using window functions. The first one (lag) marks the row if it is different than the previous. The second (sum) marks all "Block" rows for previously marked rows. Lastly, deleting roes and the helper (_flag) column.
Input:
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[(1, 'A'),
(1, 'C'),
(1, 'C'),
(3, 'A'),
(3, 'B')],
['ID', 'Block'])
Script:
w1 = W.partitionBy('Block').orderBy('ID')
w2 = W.partitionBy('Block')
grp = F.when(F.lag('ID').over(w1) != F.col('ID'), 1).otherwise(0)
df = df.withColumn('_flag', F.sum(grp).over(w2) == 0) \
.filter('_flag').drop('_flag')
df.show()
# +---+-----+
# | ID|Block|
# +---+-----+
# | 3| B|
# | 1| C|
# | 1| C|
# +---+-----+

Use window functions. get ranks per group of blocks and through away any rows that rank higher than 1. Code below
(df.withColumn('index', row_number().over(Window.partitionBy().orderBy('ID','Block')))#create an index to reorder after comps
.withColumn('BlockRank', rank().over(Window.partitionBy('Block').orderBy('ID'))).orderBy('index')#Rank per Block
.where(col('BlockRank')==1)
.drop('index','BlockRank')
).show()
+---+-----+
| ID|Block|
+---+-----+
| 1| A|
| 1| C|
| 1| C|
| 3| B|
+---+-----+

Related

Spark dataframe slice

I've a table with (millions of) entries along the lines of the following example read into a Spark dataframe (sdf):
Id
C1
C2
xx1
c118
c219
xx1
c113
c218
xx1
c118
c214
acb
c121
c201
e3d
c181
c221
e3d
c132
c252
abq
c141
c290
...
...
...
vy1
c13023
C23021
I'd like to get a smaller subset of these Id's for further processing. I identify the unique set of Id's in the table using sdf_id = sdf.select("Id").dropDuplicates().
What is the efficient way from here to filter data (C1, C2) related to, let's say, 100 randomly selected Id's?

There are several ways to achieve what you want.
My sample data
df = spark.createDataFrame([
(1, 'a'),
(1, 'b'),
(1, 'c'),
(2, 'd'),
(2, 'e'),
(3, 'f'),
], ['id', 'col'])
The initial step is getting the sample IDs that you wanted
ids = df.select('id').distinct().sample(0.2) # 2 is 20%, you can adjust this
+---+
| id|
+---+
| 1|
+---+
Approach #1: using inner join
Since you have two dataframes, you can just perform a single inner join to get all records from df for each id in ids. Note that F.broadcast is to boost up the performance because ids suppose to be small enough. Feel free to take it away if you want to. Performance-wise, this approach is preferred.
df.join(F.broadcast(ids), on=['id'], how='inner').show()
+---+---+
| id|col|
+---+---+
| 1| a|
| 1| b|
| 1| c|
+---+---+
Approach #2: using isin
You can't simply get the list of IDs via ids.collect(), because that would return a list of Row, you have to loop through it to get the exact column that you want (id in this case).
df.where(F.col('id').isin([r['id'] for r in ids.collect()])).show()
+---+---+
| id|col|
+---+---+
| 1| a|
| 1| b|
| 1| c|
+---+---+

Since you already have the list of unique ids , you can further sample it to your desired fraction and filter based on that
There are other ways you can sample random ids , which can be found here
Sampling
### Assuming the DF is 1 mil records , 100 records would be 0.01%
sdf_id = sdf.select("Id").dropDuplicates().sample(0.01).collect()
Filter
sdf_filtered = sdf.filter(F.col('Id').isin(sdf_id))

Mapping key and list of values to key value using pyspark

I have a dataset which consists of two columns C1 and C2.The columns are associated with a relation of many to many.
What I would like to do is find for each C2 the value C1 which has the most associations with C2 values overall.
For example:
C1 | C2
1 | 2
1 | 5
1 | 9
2 | 9
2 | 8
We can see here that 1 is matched to 3 values of C2 while 2 is matched to 2 so i would like as output:
Out1 |Out2| matches
2 | 1 | 3
5 | 1 | 3
9 | 1 | 3 (1 wins because 3>2)
8 | 2 | 2
What I have done so far is:
dataset = sc.textFile("...").\
map(lambda line: (line.split(",")[0],list(line.split(",")[1]) ) ).\
reduceByKey(lambda x , y : x+y )
What this does is for each C1 value gather all the C2 matches,the count of this list is our desired matches column. What I would like now is somehow use each value in this list as a new key and have a mapping like :
(Key ,Value_list[value1,value2,...]) -->(value1 , key ),(value2 , key)...
How could this be done using spark? Any advice would be really helpful.
Thanks in advance!

The dataframe API is perhaps easier for this kind of task. You can group by C1, get the count, then group by C2, and get the value of C1 that corresponds to the highest number of matches.
import pyspark.sql.functions as F
df = spark.read.csv('file.csv', header=True, inferSchema=True)
df2 = (df.groupBy('C1')
.count()
.join(df, 'C1')
.groupBy(F.col('C2').alias('Out1'))
.agg(
F.max(
F.struct(F.col('count').alias('matches'), F.col('C1').alias('Out2'))
).alias('c')
)
.select('Out1', 'c.Out2', 'c.matches')
.orderBy('Out1')
)
df2.show()
+----+----+-------+
|Out1|Out2|matches|
+----+----+-------+
| 2| 1| 3|
| 5| 1| 3|
| 8| 2| 2|
| 9| 1| 3|
+----+----+-------+

We can get the desired result easily using dataframe API.
from pyspark.sql import *
import pyspark.sql.functions as fun
from pyspark.sql.window import Window
spark = SparkSession.builder.master("local[*]").getOrCreate()
# preparing sample dataframe
data = [(1, 2), (1, 5), (1, 9), (2, 9), (2, 8)]
schema = ["c1", "c2"]
df = spark.createDataFrame(data, schema)
output = df.withColumn("matches", fun.count("c1").over(Window.partitionBy("c1"))) \
.groupby(fun.col('C2').alias('out1')) \
.agg(fun.first(fun.col("c1")).alias("out2"), fun.max("matches").alias("matches"))
output.show()
# output
+----+----+-------+
|Out1|out2|matches|
+----+----+-------+
| 9| 1| 3|
| 5| 1| 3|
| 8| 2| 2|
| 2| 1| 3|
+----+----+-------+

Drop rows containing specific value in PySpark dataframe

I have a pyspark dataframe like:
A B C
1 NA 9
4 2 5
6 4 2
5 1 NA
I want to delete rows which contain value "NA". In this case first and the last row. How to implement this using Python and Spark?
Update based on comment:
Looking for a solution that removes rows that have the string: NA in any of the many columns.

Just use a dataframe filter expression:
l = [('1','NA','9')
,('4','2', '5')
,('6','4','2')
,('5','NA','1')]
df = spark.createDataFrame(l,['A','B','C'])
#The following command requires that the checked columns are strings!
df = df.filter((df.A != 'NA') & (df.B != 'NA') & (df.C != 'NA'))
df.show()
+---+---+---+
| A| B| C|
+---+---+---+
| 4| 2| 5|
| 6| 4| 2|
+---+---+---+
#bluephantom: In the case you have hundreds of columns, just generate a string expression via list comprehension:
#In my example are columns need to be checked
listOfRelevantStringColumns = df.columns
expr = ' and '.join('(%s != "NA")' % col_name for col_name in listOfRelevantStringColumns)
df.filter(expr).show()

In case if you want to remove the row
df = df.filter((df.A != 'NA') | (df.B != 'NA'))
But sometimes we need to replace with mean(in case of numeric column) or most frequent value(in case of categorical). for that you need to add column with same name which replace the original column i-e "A"
from pyspark.sql.functions import mean,col,when,count
df=df.withColumn("A",when(df.A=="NA",mean(df.A)).otherwise(df.A))

In Scala I did this differently, but got to this using pyspark. Not my favourite answer, but it is because of lesser pyspark knowledge my side. Things seem easier in Scala. Unlike an array there is no global match against all columns that can stop as soon as one found. Dynamic in terms of number of columns.
Assumptions made on data not having ~~ as part of data, could have split to array but decided not to do here. Using None instead of NA.
from pyspark.sql import functions as f
data = [(1, None, 4, None),
(2, 'c', 3, 'd'),
(None, None, None, None),
(3, None, None, 'z')]
df = spark.createDataFrame(data, ['k', 'v1', 'v2', 'v3'])
columns = df.columns
columns_Count = len(df.columns)
# colCompare is String
df2 = df.select(df['*'], f.concat_ws('~~', *columns).alias('colCompare') )
df3 = df2.filter(f.size(f.split(f.col("colCompare"), r"~~")) == columns_Count).drop("colCompare")
df3.show()
returns:
+---+---+---+---+
| k| v1| v2| v3|
+---+---+---+---+
| 2| c| 3| d|
+---+---+---+---+

How to overwrite entire existing column in Spark dataframe with new column?

I want to overwrite a spark column with a new column which is a binary flag.
I tried directly overwriting the column id2 but why is it not working like a inplace operation in Pandas?
How to do it without using withcolumn() to create new column and drop() to drop the old column?
I know that spark dataframe is immutable, is that the reason or there is a different way to overwrite without using withcolumn() & drop()?
df2 = spark.createDataFrame(
[(1, 1, float('nan')), (1, 2, float(5)), (1, 3, float('nan')), (1, 4, float('nan')), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))],
('session', "timestamp1", "id2"))
df2.select(df2.id2 > 0).show()
+---------+
|(id2 > 0)|
+---------+
| true|
| true|
| true|
| true|
| true|
| true|
| true|
+---------+
# Attempting to overwriting df2.id2
df2.id2=df2.select(df2.id2 > 0).withColumnRenamed('(id2 > 0)','id2')
df2.show()
#Overwriting unsucessful
+-------+----------+----+
|session|timestamp1| id2|
+-------+----------+----+
| 1| 1| NaN|
| 1| 2| 5.0|
| 1| 3| NaN|
| 1| 4| NaN|
| 1| 5|10.0|
| 1| 6| NaN|
| 1| 6| NaN|
+-------+----------+----+

You can use
d1.withColumnRenamed("colName", "newColName")
d1.withColumn("newColName", $"colName")
The withColumnRenamed renames the existing column to new name.
The withColumn creates a new column with a given name. It creates a new column with same name if there exist already and drops the old one.
In your case changes are not applied to the original dataframe df2, it changes the name of column and return as a new dataframe which should be assigned to new variable for the further use.
d3 = df2.select((df2.id2 > 0).alias("id2"))
Above should work fine in your case.
Hope this helps!

As stated above it's not possible to overwrite DataFrame object, which is immutable collection, so all transformations return new DataFrame.
The fastest way to achieve your desired effect is to use withColumn:
df = df.withColumn("col", some expression)
where col is name of column which you want to "replace". After running this value of df variable will be replaced by new DataFrame with new value of column col. You might want to assign this to new variable.
In your case it can look:
df2 = df2.withColumn("id2", (df2.id2 > 0) & (df2.id2 != float('nan')))
I've added comparison to nan, because I'm assuming you don't want to treat nan as greater than 0.

If you're working with multiple columns of the same name in different joined tables you can use the table alias in the colName in withColumn.
Eg. df1.join(df2, df1.id = df2.other_id).withColumn('df1.my_col', F.greatest(df1.my_col, df2.my_col))
And if you only want to keep the columns from df1 you can also call .select('df1.*')
If you instead do df1.join(df2, df1.id = df2.other_id).withColumn('my_col', F.greatest(df1.my_col, df2.my_col))
I think it overwrites the last column which is called my_col. So it outputs:
id, my_col (df1.my_col original value), id, other_id, my_col (newly computed my_col)

Get IDs for duplicate rows (considering all other columns) in Apache Spark

I have a Spark sql dataframe, consisting of an ID column and n "data" columns, i.e.
id | dat1 | dat2 | ... | datn
The id columnn is uniquely determined, whereas, looking at dat1 ... datn there may be duplicates.
My goal is to find the ids of those duplicates.
My approach so far:
get the duplicate rows using groupBy:
dup_df = df.groupBy(df.columns[1:]).count().filter('count > 1')
join the dup_df with the entire df to get the duplicate rows including id:
df.join(dup_df, df.columns[1:])
I am quite certain that this is basically correct, it fails because the dat1 ... datn columns contain null values.
To do the join on null values, I found .e.g this SO post. But this would require to construct a huge "string join condition".
Thus my questions:
Is there a simple / more generic / more pythonic way to do joins on null values?
Or, even better, is there another (easier, more beautiful, ...) method to get the desired ids?
BTW: I am using Spark 2.1.0 and Python 3.5.3

If number ids per group is relatively small you can groupBy and collect_list. Required imports
from pyspark.sql.functions import collect_list, size
example data:
df = sc.parallelize([
(1, "a", "b", 3),
(2, None, "f", None),
(3, "g", "h", 4),
(4, None, "f", None),
(5, "a", "b", 3)
]).toDF(["id"])
query:
(df
.groupBy(df.columns[1:])
.agg(collect_list("id").alias("ids"))
.where(size("ids") > 1))
and the result:
+----+---+----+------+
| _2| _3| _4| ids|
+----+---+----+------+
|null| f|null|[2, 4]|
| a| b| 3|[1, 5]|
+----+---+----+------+
You can apply explode twice (or use an udf) to an output equivalent to the one returned from join.
You can also identify groups using minimal id per group. A few additional imports:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, count, min
window definition:
w = Window.partitionBy(df.columns[1:])
query:
(df
.select(
"*",
count("*").over(w).alias("_cnt"),
min("id").over(w).alias("group"))
.where(col("_cnt") > 1))
and the result:
+---+----+---+----+----+-----+
| id| _2| _3| _4|_cnt|group|
+---+----+---+----+----+-----+
| 2|null| f|null| 2| 2|
| 4|null| f|null| 2| 2|
| 1| a| b| 3| 2| 1|
| 5| a| b| 3| 2| 1|
+---+----+---+----+----+-----+
You can further use group column for self join.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Remove rows where groups of two columns have differences - apache-spark

Related

Spark dataframe slice

Mapping key and list of values to key value using pyspark

Drop rows containing specific value in PySpark dataframe

How to overwrite entire existing column in Spark dataframe with new column?

Get IDs for duplicate rows (considering all other columns) in Apache Spark

Categories

Resources