switch identifiers in spark to pseudonymize dataset - apache-spark

I have a dataframe like:
+---+-----+
| id|value|
+---+-----+
| a| 1|
| a| 2|
| b| 1|
| b| 3|
+---+-----+
val df = Seq(("a", 1), ("a", 2), ("b", 1), ("b", 3)).toDF("id", "value")
How can I efficiently switch / rotate IDs. Note, hashing is not what I want here, I explicitly want to rotated the identifiers. How could this be implemented in spark efficiently without a self join? Maybe some RDD zipWithIndex?
Not: my intention is to pseudonyme / anonymize the dataset by rotating identifiers. My requirement is to replace each a with another identifier, i.e. possibly b. They all need to be replaced to the same value.
edit
I had a first suggestion: https://spark.apache.org/docs/latest/ml-features.html#stringindexer but this would change data types and also not rotate identifiers something I want to prevent. I need a drop in but anon- /pseudo-nymized replacement.
Also, I expect about 8 million (constant) distinct values for ID.

Collecting all distinct elements and building a map using zip with a randomly permutated list of these distinct elements works.

Related

Spark dataframe slice

I've a table with (millions of) entries along the lines of the following example read into a Spark dataframe (sdf):
Id
C1
C2
xx1
c118
c219
xx1
c113
c218
xx1
c118
c214
acb
c121
c201
e3d
c181
c221
e3d
c132
c252
abq
c141
c290
...
...
...
vy1
c13023
C23021
I'd like to get a smaller subset of these Id's for further processing. I identify the unique set of Id's in the table using sdf_id = sdf.select("Id").dropDuplicates().
What is the efficient way from here to filter data (C1, C2) related to, let's say, 100 randomly selected Id's?
There are several ways to achieve what you want.
My sample data
df = spark.createDataFrame([
(1, 'a'),
(1, 'b'),
(1, 'c'),
(2, 'd'),
(2, 'e'),
(3, 'f'),
], ['id', 'col'])
The initial step is getting the sample IDs that you wanted
ids = df.select('id').distinct().sample(0.2) # 2 is 20%, you can adjust this
+---+
| id|
+---+
| 1|
+---+
Approach #1: using inner join
Since you have two dataframes, you can just perform a single inner join to get all records from df for each id in ids. Note that F.broadcast is to boost up the performance because ids suppose to be small enough. Feel free to take it away if you want to. Performance-wise, this approach is preferred.
df.join(F.broadcast(ids), on=['id'], how='inner').show()
+---+---+
| id|col|
+---+---+
| 1| a|
| 1| b|
| 1| c|
+---+---+
Approach #2: using isin
You can't simply get the list of IDs via ids.collect(), because that would return a list of Row, you have to loop through it to get the exact column that you want (id in this case).
df.where(F.col('id').isin([r['id'] for r in ids.collect()])).show()
+---+---+
| id|col|
+---+---+
| 1| a|
| 1| b|
| 1| c|
+---+---+
Since you already have the list of unique ids , you can further sample it to your desired fraction and filter based on that
There are other ways you can sample random ids , which can be found here
Sampling
### Assuming the DF is 1 mil records , 100 records would be 0.01%
sdf_id = sdf.select("Id").dropDuplicates().sample(0.01).collect()
Filter
sdf_filtered = sdf.filter(F.col('Id').isin(sdf_id))

Spark union column order

I've come across something strange recently in Spark. As far as I understand, given the column based storage method of spark dfs, the order of the columns really don't have any meaning, they're like keys in a dictionary.
During a df.union(df2), does the order of the columns matter? I would've assumed that it shouldn't, but according to the wisdom of sql forums it does.
So we have df1
df1
| a| b|
+---+----+
| 1| asd|
| 2|asda|
| 3| f1f|
+---+----+
df2
| b| a|
+----+---+
| asd| 1|
|asda| 2|
| f1f| 3|
+----+---+
result
| a| b|
+----+----+
| 1| asd|
| 2|asda|
| 3| f1f|
| asd| 1|
|asda| 2|
| f1f| 3|
+----+----+
It looks like the schema from df1 was used, but the data appears to have joined following the order of their original dataframes.
Obviously the solution would be to do df1.union(df2.select(df1.columns))
But the main question is, why does it do this? Is it simply because it's part of pyspark.sql, or is there some underlying data architecture in Spark that I've goofed up in understanding?
code to create test set if anyone wants to try
d1={'a':[1,2,3], 'b':['asd','asda','f1f']}
d2={ 'b':['asd','asda','f1f'], 'a':[1,2,3],}
pdf1=pd.DataFrame(d1)
pdf2=pd.DataFrame(d2)
df1=spark.createDataFrame(pdf1)
df2=spark.createDataFrame(pdf2)
test=df1.union(df2)
The Spark union is implemented according to standard SQL and therefore resolves the columns by position. This is also stated by the API documentation:
Return a new DataFrame containing union of rows in this and another frame.
This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does >deduplication of elements), use this function followed by a distinct.
Also as standard in SQL, this function resolves columns by position (not by name).
Since Spark >= 2.3 you can use unionByName to union two dataframes were the column names get resolved.
in spark Union is not done on metadata of columns and data is not shuffled like you would think it would. rather union is done on the column numbers as in, if you are unioning 2 Df's both must have the same numbers of columns..you will have to take in consideration of positions of your columns previous to doing union. unlike SQL or Oracle or other RDBMS, underlying files in spark are physical files. hope that answers your question

How to find first non-null values in groups? (secondary sorting using dataset api)

I am working on a dataset which represents a stream of events (like fired as tracking events from a website). All the events have a timestamp. One use case we often have is trying to find the 1st non null value for a given field. So for example something like gets us most the way there:
val eventsDf = spark.read.json(jsonEventsPath)
case class ProjectedFields(visitId: String, userId: Int, timestamp: Long ... )
val projectedEventsDs = eventsDf.select(
eventsDf("message.visit.id").alias("visitId"),
eventsDf("message.property.user_id").alias("userId"),
eventsDf("message.property.timestamp"),
...
).as[ProjectedFields]
projectedEventsDs.groupBy($"visitId").agg(first($"userId", true))
The problem with the above code is that the order of the data being fed into that first aggregation function is not guaranteed. I would like it to be sorted by timestamp to ensure that it is the 1st non null userId by timestamp rather than any random non null userId.
Is there a way to define the sorting within a grouping?
Using Spark 2.10
BTW, the way suggested for Spark 2.10 in SPARK DataFrame: select the first row of each group is to do ordering before the grouping -- that doesn't work. For example the following code:
case class OrderedKeyValue(key: String, value: String, ordering: Int)
val ds = Seq(
OrderedKeyValue("a", null, 1),
OrderedKeyValue("a", null, 2),
OrderedKeyValue("a", "x", 3),
OrderedKeyValue("a", "y", 4),
OrderedKeyValue("a", null, 5)
).toDS()
ds.orderBy("ordering").groupBy("key").agg(first("value", true)).collect()
Will sometimes return Array([a,y]) and sometimes Array([a,x])
Use my beloved windows (...and experience how much simpler your life becomes !)
import org.apache.spark.sql.expressions.Window
val byKeyOrderByOrdering = Window
.partitionBy("key")
.orderBy("ordering")
.rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
import org.apache.spark.sql.functions.first
val firsts = ds.withColumn("first",
first("value", ignoreNulls = true) over byKeyOrderByOrdering)
scala> firsts.show
+---+-----+--------+-----+
|key|value|ordering|first|
+---+-----+--------+-----+
| a| null| 1| x|
| a| null| 2| x|
| a| x| 3| x|
| a| y| 4| x|
| a| null| 5| x|
+---+-----+--------+-----+
NOTE: Somehow, Spark 2.2.0-SNAPSHOT (built today) could not give me the correct answer with no rangeBetween which I thought should've been the default unbounded range.

Get IDs for duplicate rows (considering all other columns) in Apache Spark

I have a Spark sql dataframe, consisting of an ID column and n "data" columns, i.e.
id | dat1 | dat2 | ... | datn
The id columnn is uniquely determined, whereas, looking at dat1 ... datn there may be duplicates.
My goal is to find the ids of those duplicates.
My approach so far:
get the duplicate rows using groupBy:
dup_df = df.groupBy(df.columns[1:]).count().filter('count > 1')
join the dup_df with the entire df to get the duplicate rows including id:
df.join(dup_df, df.columns[1:])
I am quite certain that this is basically correct, it fails because the dat1 ... datn columns contain null values.
To do the join on null values, I found .e.g this SO post. But this would require to construct a huge "string join condition".
Thus my questions:
Is there a simple / more generic / more pythonic way to do joins on null values?
Or, even better, is there another (easier, more beautiful, ...) method to get the desired ids?
BTW: I am using Spark 2.1.0 and Python 3.5.3
If number ids per group is relatively small you can groupBy and collect_list. Required imports
from pyspark.sql.functions import collect_list, size
example data:
df = sc.parallelize([
(1, "a", "b", 3),
(2, None, "f", None),
(3, "g", "h", 4),
(4, None, "f", None),
(5, "a", "b", 3)
]).toDF(["id"])
query:
(df
.groupBy(df.columns[1:])
.agg(collect_list("id").alias("ids"))
.where(size("ids") > 1))
and the result:
+----+---+----+------+
| _2| _3| _4| ids|
+----+---+----+------+
|null| f|null|[2, 4]|
| a| b| 3|[1, 5]|
+----+---+----+------+
You can apply explode twice (or use an udf) to an output equivalent to the one returned from join.
You can also identify groups using minimal id per group. A few additional imports:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, count, min
window definition:
w = Window.partitionBy(df.columns[1:])
query:
(df
.select(
"*",
count("*").over(w).alias("_cnt"),
min("id").over(w).alias("group"))
.where(col("_cnt") > 1))
and the result:
+---+----+---+----+----+-----+
| id| _2| _3| _4|_cnt|group|
+---+----+---+----+----+-----+
| 2|null| f|null| 2| 2|
| 4|null| f|null| 2| 2|
| 1| a| b| 3| 2| 1|
| 5| a| b| 3| 2| 1|
+---+----+---+----+----+-----+
You can further use group column for self join.

How to efficiently rename columns in Datasets (Spark 2.0)

With DataFrames, one can simply rename columns by using df.withColumnRename("oldName", "newName"). In Datasets, since every field is typed and named, this doesn't seem possible. The only work around I can think of is to use map on the Dataset:
case class Orig(a: Int, b: Int)
case class OrigRenamed(a: Int, bNewName: Int)
val origDS = Seq(Orig(1,2), Orig(3,4)).toDS
origDS.show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
// To rename with map
val origRenamedDS = origDS.map{ case Orig(x,y) => OrigRenamed(x,y) }
origRenamed.show
+---+--------+
| a|bNewName|
+---+--------+
| 1| 2|
| 3| 4|
+---+--------+
This seems a very round-about and inefficient way just to rename a column. Is there a better way?
A slightly more concise solution would be something like this:
origDS.toDF("a", "bNewName").as[OrigRenamed]
but in practice renaming is simply not meaningful on statically typed Dataset. While we use the same columnar representation as Dataframe (Dataset[Row]) semantics is completely different here.
Name of the column corresponds to a specific field of the stored objects so it is not something that can be dynamically renamed. In other words Datasets are not statically typed DataFrames but collections of objects.
You can make it slightly more concise, while maintaining semantics:
origDS.map(o => OrigRenamed(o.a, o.b)).show()

Resources