Spark dataframe slice - apache-spark

I've a table with (millions of) entries along the lines of the following example read into a Spark dataframe (sdf):
Id
C1
C2
xx1
c118
c219
xx1
c113
c218
xx1
c118
c214
acb
c121
c201
e3d
c181
c221
e3d
c132
c252
abq
c141
c290
...
...
...
vy1
c13023
C23021
I'd like to get a smaller subset of these Id's for further processing. I identify the unique set of Id's in the table using sdf_id = sdf.select("Id").dropDuplicates().
What is the efficient way from here to filter data (C1, C2) related to, let's say, 100 randomly selected Id's?

There are several ways to achieve what you want.
My sample data
df = spark.createDataFrame([
(1, 'a'),
(1, 'b'),
(1, 'c'),
(2, 'd'),
(2, 'e'),
(3, 'f'),
], ['id', 'col'])
The initial step is getting the sample IDs that you wanted
ids = df.select('id').distinct().sample(0.2) # 2 is 20%, you can adjust this
+---+
| id|
+---+
| 1|
+---+
Approach #1: using inner join
Since you have two dataframes, you can just perform a single inner join to get all records from df for each id in ids. Note that F.broadcast is to boost up the performance because ids suppose to be small enough. Feel free to take it away if you want to. Performance-wise, this approach is preferred.
df.join(F.broadcast(ids), on=['id'], how='inner').show()
+---+---+
| id|col|
+---+---+
| 1| a|
| 1| b|
| 1| c|
+---+---+
Approach #2: using isin
You can't simply get the list of IDs via ids.collect(), because that would return a list of Row, you have to loop through it to get the exact column that you want (id in this case).
df.where(F.col('id').isin([r['id'] for r in ids.collect()])).show()
+---+---+
| id|col|
+---+---+
| 1| a|
| 1| b|
| 1| c|
+---+---+

Since you already have the list of unique ids , you can further sample it to your desired fraction and filter based on that
There are other ways you can sample random ids , which can be found here
Sampling
### Assuming the DF is 1 mil records , 100 records would be 0.01%
sdf_id = sdf.select("Id").dropDuplicates().sample(0.01).collect()
Filter
sdf_filtered = sdf.filter(F.col('Id').isin(sdf_id))

Related

Remove rows where groups of two columns have differences

Is it possible to remove rows if the values in the Block column occurs at least twice which has different values in the ID column?
My data looks like this:
ID
Block
1
A
1
C
1
C
3
A
3
B
In the above case, the value A in the Block column occurs twice, which has values 1 and 3 in the ID column. So the rows are removed.
The expected output should be:
ID
Block
1
C
1
C
3
B
I tried to use the dropDuplicates after the groupBy, but I don't know how to filter with this type of condition. It appears that I would need a set for the Block column to check with the ID column.
One way to do it is using window functions. The first one (lag) marks the row if it is different than the previous. The second (sum) marks all "Block" rows for previously marked rows. Lastly, deleting roes and the helper (_flag) column.
Input:
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[(1, 'A'),
(1, 'C'),
(1, 'C'),
(3, 'A'),
(3, 'B')],
['ID', 'Block'])
Script:
w1 = W.partitionBy('Block').orderBy('ID')
w2 = W.partitionBy('Block')
grp = F.when(F.lag('ID').over(w1) != F.col('ID'), 1).otherwise(0)
df = df.withColumn('_flag', F.sum(grp).over(w2) == 0) \
.filter('_flag').drop('_flag')
df.show()
# +---+-----+
# | ID|Block|
# +---+-----+
# | 3| B|
# | 1| C|
# | 1| C|
# +---+-----+
Use window functions. get ranks per group of blocks and through away any rows that rank higher than 1. Code below
(df.withColumn('index', row_number().over(Window.partitionBy().orderBy('ID','Block')))#create an index to reorder after comps
.withColumn('BlockRank', rank().over(Window.partitionBy('Block').orderBy('ID'))).orderBy('index')#Rank per Block
.where(col('BlockRank')==1)
.drop('index','BlockRank')
).show()
+---+-----+
| ID|Block|
+---+-----+
| 1| A|
| 1| C|
| 1| C|
| 3| B|
+---+-----+

Get the distinct elements of a column grouped by another column on a PySpark Dataframe

I have a pyspark DF of ids and purchases which I'm trying to transform for use with FP growth.
Currently i have multiple rows for a given id with each row only relating to a single purchase.
I'd like to transform this dataframe to a form where there are two columns, one for id (with a single row per id ) and the second column containing a list of distinct purchases for that id.
I've tried to use a User Defined Function (UDF) to map the distinct purchases onto the distinct ids but I get a "py4j.Py4JException: Method getstate([]) does not exist". Thanks to #Mithril
I see that "You can't use sparkSession object , spark.DataFrame object or other Spark distributed objects in udf and pandas_udf, because they are unpickled."
So I've implemented the TERRIBLE approach below (which will work but is not scalable):
#Lets create some fake transactions
customers = [1,2,3,1,1]
purschases = ['cake','tea','beer','fruit','cake']
# Lets create a spark DF to capture the transactions
transactions = zip(customers,purschases)
spk_df_1 = spark.createDataFrame(list(transactions) , ["id", "item"])
# Lets have a look at the resulting spark dataframe
spk_df_1.show()
# Lets capture the ids and list of their distinct pruschases in a
# list of tuples
purschases_lst = []
nums1 = []
import pyspark.sql.functions as f
# for each distinct id lets get the list of their distinct pruschases
for id in spark.sql("SELECT distinct(id) FROM TBLdf ").rdd.map(lambda row : row[0]).collect():
purschase = df.filter(f.col("id") == id).select("item").distinct().rdd.map(lambda row : row[0]).collect()
nums1.append((id,purschase))
# Lets see what our list of transaction tuples looks like
print(nums1)
print("\n")
# lets turn the list of transaction tuples into a pandas dataframe
df_pd = pd.DataFrame(nums1)
# Finally lets turn our pandas dataframe into a pyspark Dataframe
df2 = spark.createDataFrame(df_pd)
df2.show()
Output:
+---+-----+
| id| item|
+---+-----+
| 1| cake|
| 2| tea|
| 3| beer|
| 1|fruit|
| 1| cake|
+---+-----+
[(1, ['fruit', 'cake']), (3, ['beer']), (2, ['tea'])]
+---+-------------+
| 0| 1|
+---+-------------+
| 1|[fruit, cake]|
| 3| [beer]|
| 2| [tea]|
+---+-------------+
If anybody has any suggestions I'd greatly appreciate it.
That is a task for collect_set, which creates a set of items without duplicates:
import pyspark.sql.functions as F
#Lets create some fake transactions
customers = [1,2,3,1,1]
purschases = ['cake','tea','beer','fruit','cake']
# Lets create a spark DF to capture the transactions
transactions = zip(customers,purschases)
spk_df_1 = spark.createDataFrame(list(transactions) , ["id", "item"])
spk_df_1.show()
spk_df_1.groupby('id').agg(F.collect_set('item')).show()
Output:
+---+-----+
| id| item|
+---+-----+
| 1| cake|
| 2| tea|
| 3| beer|
| 1|fruit|
| 1| cake|
+---+-----+
+---+-----------------+
| id|collect_set(item)|
+---+-----------------+
| 1| [fruit, cake]|
| 3| [beer]|
| 2| [tea]|
+---+-----------------+

switch identifiers in spark to pseudonymize dataset

I have a dataframe like:
+---+-----+
| id|value|
+---+-----+
| a| 1|
| a| 2|
| b| 1|
| b| 3|
+---+-----+
val df = Seq(("a", 1), ("a", 2), ("b", 1), ("b", 3)).toDF("id", "value")
How can I efficiently switch / rotate IDs. Note, hashing is not what I want here, I explicitly want to rotated the identifiers. How could this be implemented in spark efficiently without a self join? Maybe some RDD zipWithIndex?
Not: my intention is to pseudonyme / anonymize the dataset by rotating identifiers. My requirement is to replace each a with another identifier, i.e. possibly b. They all need to be replaced to the same value.
edit
I had a first suggestion: https://spark.apache.org/docs/latest/ml-features.html#stringindexer but this would change data types and also not rotate identifiers something I want to prevent. I need a drop in but anon- /pseudo-nymized replacement.
Also, I expect about 8 million (constant) distinct values for ID.
Collecting all distinct elements and building a map using zip with a randomly permutated list of these distinct elements works.

Get IDs for duplicate rows (considering all other columns) in Apache Spark

I have a Spark sql dataframe, consisting of an ID column and n "data" columns, i.e.
id | dat1 | dat2 | ... | datn
The id columnn is uniquely determined, whereas, looking at dat1 ... datn there may be duplicates.
My goal is to find the ids of those duplicates.
My approach so far:
get the duplicate rows using groupBy:
dup_df = df.groupBy(df.columns[1:]).count().filter('count > 1')
join the dup_df with the entire df to get the duplicate rows including id:
df.join(dup_df, df.columns[1:])
I am quite certain that this is basically correct, it fails because the dat1 ... datn columns contain null values.
To do the join on null values, I found .e.g this SO post. But this would require to construct a huge "string join condition".
Thus my questions:
Is there a simple / more generic / more pythonic way to do joins on null values?
Or, even better, is there another (easier, more beautiful, ...) method to get the desired ids?
BTW: I am using Spark 2.1.0 and Python 3.5.3
If number ids per group is relatively small you can groupBy and collect_list. Required imports
from pyspark.sql.functions import collect_list, size
example data:
df = sc.parallelize([
(1, "a", "b", 3),
(2, None, "f", None),
(3, "g", "h", 4),
(4, None, "f", None),
(5, "a", "b", 3)
]).toDF(["id"])
query:
(df
.groupBy(df.columns[1:])
.agg(collect_list("id").alias("ids"))
.where(size("ids") > 1))
and the result:
+----+---+----+------+
| _2| _3| _4| ids|
+----+---+----+------+
|null| f|null|[2, 4]|
| a| b| 3|[1, 5]|
+----+---+----+------+
You can apply explode twice (or use an udf) to an output equivalent to the one returned from join.
You can also identify groups using minimal id per group. A few additional imports:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, count, min
window definition:
w = Window.partitionBy(df.columns[1:])
query:
(df
.select(
"*",
count("*").over(w).alias("_cnt"),
min("id").over(w).alias("group"))
.where(col("_cnt") > 1))
and the result:
+---+----+---+----+----+-----+
| id| _2| _3| _4|_cnt|group|
+---+----+---+----+----+-----+
| 2|null| f|null| 2| 2|
| 4|null| f|null| 2| 2|
| 1| a| b| 3| 2| 1|
| 5| a| b| 3| 2| 1|
+---+----+---+----+----+-----+
You can further use group column for self join.

Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

Let's say I have a rather large dataset in the following form:
data = sc.parallelize([('Foo',41,'US',3),
('Foo',39,'UK',1),
('Bar',57,'CA',2),
('Bar',72,'CA',2),
('Baz',22,'US',6),
('Baz',36,'US',6)])
What I would like to do is remove duplicate rows based on the values of the first,third and fourth columns only.
Removing entirely duplicate rows is straightforward:
data = data.distinct()
and either row 5 or row 6 will be removed
But how do I only remove duplicate rows based on columns 1, 3 and 4 only? i.e. remove either one one of these:
('Baz',22,'US',6)
('Baz',36,'US',6)
In Python, this could be done by specifying columns with .drop_duplicates(). How can I achieve the same in Spark/Pyspark?
Pyspark does include a dropDuplicates() method, which was introduced in 1.4. https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.dropDuplicates.html
>>> from pyspark.sql import Row
>>> df = sc.parallelize([ \
... Row(name='Alice', age=5, height=80), \
... Row(name='Alice', age=5, height=80), \
... Row(name='Alice', age=10, height=80)]).toDF()
>>> df.dropDuplicates().show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 5| 80|Alice|
| 10| 80|Alice|
+---+------+-----+
>>> df.dropDuplicates(['name', 'height']).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 5| 80|Alice|
+---+------+-----+
From your question, it is unclear as-to which columns you want to use to determine duplicates. The general idea behind the solution is to create a key based on the values of the columns that identify duplicates. Then, you can use the reduceByKey or reduce operations to eliminate duplicates.
Here is some code to get you started:
def get_key(x):
return "{0}{1}{2}".format(x[0],x[2],x[3])
m = data.map(lambda x: (get_key(x),x))
Now, you have a key-value RDD that is keyed by columns 1,3 and 4.
The next step would be either a reduceByKey or groupByKey and filter.
This would eliminate duplicates.
r = m.reduceByKey(lambda x,y: (x))
I know you already accepted the other answer, but if you want to do this as a
DataFrame, just use groupBy and agg. Assuming you had a DF already created (with columns named "col1", "col2", etc) you could do:
myDF.groupBy($"col1", $"col3", $"col4").agg($"col1", max($"col2"), $"col3", $"col4")
Note that in this case, I chose the Max of col2, but you could do avg, min, etc.
Agree with David. To add on, it may not be the case that we want to groupBy all columns other than the column(s) in aggregate function i.e, if we want to remove duplicates purely based on a subset of columns and retain all columns in the original dataframe. So the better way to do this could be using dropDuplicates Dataframe api available in Spark 1.4.0
For reference, see: https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.DataFrame
I used inbuilt function dropDuplicates(). Scala code given below
val data = sc.parallelize(List(("Foo",41,"US",3),
("Foo",39,"UK",1),
("Bar",57,"CA",2),
("Bar",72,"CA",2),
("Baz",22,"US",6),
("Baz",36,"US",6))).toDF("x","y","z","count")
data.dropDuplicates(Array("x","count")).show()
Output :
+---+---+---+-----+
| x| y| z|count|
+---+---+---+-----+
|Baz| 22| US| 6|
|Foo| 39| UK| 1|
|Foo| 41| US| 3|
|Bar| 57| CA| 2|
+---+---+---+-----+
The below programme will help you drop duplicates on whole , or if you want to drop duplicates based on certain columns , you can even do that:
import org.apache.spark.sql.SparkSession
object DropDuplicates {
def main(args: Array[String]) {
val spark =
SparkSession.builder()
.appName("DataFrame-DropDuplicates")
.master("local[4]")
.getOrCreate()
import spark.implicits._
// create an RDD of tuples with some data
val custs = Seq(
(1, "Widget Co", 120000.00, 0.00, "AZ"),
(2, "Acme Widgets", 410500.00, 500.00, "CA"),
(3, "Widgetry", 410500.00, 200.00, "CA"),
(4, "Widgets R Us", 410500.00, 0.0, "CA"),
(3, "Widgetry", 410500.00, 200.00, "CA"),
(5, "Ye Olde Widgete", 500.00, 0.0, "MA"),
(6, "Widget Co", 12000.00, 10.00, "AZ")
)
val customerRows = spark.sparkContext.parallelize(custs, 4)
// convert RDD of tuples to DataFrame by supplying column names
val customerDF = customerRows.toDF("id", "name", "sales", "discount", "state")
println("*** Here's the whole DataFrame with duplicates")
customerDF.printSchema()
customerDF.show()
// drop fully identical rows
val withoutDuplicates = customerDF.dropDuplicates()
println("*** Now without duplicates")
withoutDuplicates.show()
val withoutPartials = customerDF.dropDuplicates(Seq("name", "state"))
println("*** Now without partial duplicates too")
withoutPartials.show()
}
}
This is my Df contain 4 is repeated twice so here will remove repeated values.
scala> df.show
+-----+
|value|
+-----+
| 1|
| 4|
| 3|
| 5|
| 4|
| 18|
+-----+
scala> val newdf=df.dropDuplicates
scala> newdf.show
+-----+
|value|
+-----+
| 1|
| 3|
| 5|
| 4|
| 18|
+-----+

Resources