Finding the duplicate using Row Number in Pyspark - apache-spark

I have written an SQL query which actually finds the duplicate elevation from the table along with other unique columns. Here is my query. I want to convert that into pyspark.
dup_df = spark.sql('''
SELECT g.pbkey,
g.lon,
g.lat,
g.elevation
FROM DATA AS g
INNER JOIN
(SELECT elevation,
COUNT(elevation) AS NumOccurrences
FROM DATA
GROUP BY elevation
HAVING (COUNT(elevation) > 1)) AS a ON (a.elevation = g.elevation)
''')

On Scala can be implemented with Window, guess, can be converted to Python:
val data = Seq(1, 2, 3, 4, 5, 7, 3).toDF("elevation")
val elevationWindow = Window.partitionBy("elevation")
data
.withColumn("elevationCount", count("elevation").over(elevationWindow))
.where($"elevationCount" > 1)
.drop("elevationCount")
Output is:
+---------+
|elevation|
+---------+
|3 |
|3 |
+---------+

Related

how can we force dataframe repartitioning to be balanced in spark?

I created a synthetic dataset and I trying to experiment with repartitioning based on a one column. The objective is to end up with a balanced (equal size) number of partitions, but I cannot achieve this. Is there a way it could be done, preferably without resorting to RDDs and saving the dataframe?
Example code:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as f
spark = SparkSession.builder.appName('learn').getOrCreate()
import pandas as pd
import random
from pyspark.sql.types import *
nr = 500
data = {'id': [random.randint(0,5) for _ in range(nr)], 'id2': [random.randint(0,5) for _ in range(nr)]}
data = pd.DataFrame(data)
df = spark.createDataFrame(data)
# df.show()
df = df.repartition(3, 'id')
# see the different partitions
for ipart in range(3):
print(f'partition {ipart}')
def fpart(partition_idx, iterator, target_partition_idx=ipart):
if partition_idx == target_partition_idx:
return iterator
else:
return iter(())
res = df.rdd.mapPartitionsWithIndex(fpart)
res = res.toDF(schema=schema)
# res.show(n=5, truncate=False)
print(f"number of rows {res.count()}, unique ids {res.select('id').drop_duplicates().toPandas()['id'].tolist()}")
It produces:
partition 0
number of rows 79, unique ids [3]
partition 1
number of rows 82, unique ids [0]
partition 2
number of rows 339, unique ids [5, 1, 2, 4]
so the partitions are clearly not balanced.
I saw in How to guarantee repartitioning in Spark Dataframe that this is explainable because assigning to partitions is based on the hash of column id modulo 3 (the number of partitions):
df.select('id', f.expr("hash(id)"), f.expr("pmod(hash(id), 3)")).drop_duplicates().show()
that produces
+---+-----------+-----------------+
| id| hash(id)|pmod(hash(id), 3)|
+---+-----------+-----------------+
| 3| 519220707| 0|
| 0|-1670924195| 1|
| 1|-1712319331| 2|
| 5| 1607884268| 2|
| 4| 1344313940| 2|
| 2| -797927272| 2|
+---+-----------+-----------------+
but I find this strange. The point of specifying the column in the repartition function is to somehow split the values of id to different partitions. If the column id had more unique values than 6 in this example it would work better, but still.
Is there a way to achieve this?

Replace null values with other Dataframe in PySpark

I have some data with products (DF), however some don't have a description. I have an excel file with the description of some (loaded as Map). Now I would like to fill the missing values in DF with those of Map and the rows that already have a description keep them untouched using Pyspark.
DF
Id | Desc
01 | 'desc1'
02 | null
03 | 'desc3'
04 | null
Map
Key | Value
2 | 'desc2'
4 | 'desc4'
Output
Id | Desc
1 | 'desc1'
2 | 'desc2'
3 | 'desc3'
4 | 'desc4'
Thanks in advance
You'll want to make sure the DF.Id field and the Map.Key field are the same type/values (currently, they don't look like it with the leading 0), then do a left join, and then select the desired columns with a coalesce(). My pySpark is a bit rusty, so I'll provide the solution in scala. The logic should be the same.
val df = Seq(
(1, "desc1"),
(2, null),
(3, "desc3"),
(4, null)
).toDF("Id", "Desc")
val map = Seq(
(2, "desc2"),
(4, "desc4")
).toDF("Key", "Value")
df.show()
map.show()
df.join(map, df("Id") === map("Key"), "left")
.select(
df("Id"),
coalesce(df("Desc"), $"Value").as("Desc")
)
.show()
Yields:
+---+-----+
| Id| Desc|
+---+-----+
| 1|desc1|
| 2| null|
| 3|desc3|
| 4| null|
+---+-----+
+---+-----+
|Key|Value|
+---+-----+
| 2|desc2|
| 4|desc4|
+---+-----+
+---+-----+
| Id| Desc|
+---+-----+
| 1|desc1|
| 2|desc2|
| 3|desc3|
| 4|desc4|
+---+-----+
In PySpark, with the help of an UDF:
schema = StructType([StructField("Index", IntegerType(), True),
StructField("Desc", StringType(), True)])
DF = sc.parallelize([(1, "desc1"), (2,None), (3,"desc3"), (4, None)]).toDF(schema)
myMap = {
2: "desc2",
4 : "desc4"
}
myMapBroadcasted = sc.broadcast(myMap)
#udf(StringType())
def fillNone(Index, Desc):
if Desc is None:
if Index in myMapBroadcasted.value:
return myMapBroadcasted.value[Index]
return Desc
DF.withColumn('Desc', fillNone(col('Index'), col('Desc'))).show()
It's hard to know the cardinality of the datasets that you've provided... some examples of how that might change a solution here are:
If "DF" and "Map" have overlapping Desc... how should we prioritize which table has the "right" description?
Does the final dataframe that you are looking to create need to be fully inclusive of a list of ID's or descriptions? Do either of these dataframes have the full list? This could also change the solution.
I've made some assumptions so that you can determine for yourself what is the right approach here:
I'm assuming that "DF" contains the whole list of IDs
I'm assuming that "Map" only has a subset of IDs and is not wholly inclusive of the broader set of IDs that exist within "DF"
I'm using PySpark here:
DF = DF.na.drop() # we'll eliminate the missing values from the parent dataframe
DF_Output = DF.join(Map, on = "ID", how = 'outer')
We can divide DF into two dataframes, operate on them separately, and then union them:
val df = Seq(
(1, "desc1"),
(2, null),
(3, "desc3"),
(4, null)
).toDF("Id", "Desc")
val Map = Seq(
(2, "desc2"),
(4, "desc4")
).toDF("Key", "Value")
val nullDF = df.where(df("Desc").isNull)
val nonNullDF = df.where(df("Desc").isNotNull)
val joinedWithKeyDF = nullDF.drop("Desc").join(Map, nullDF("Id")===Map("Key")).withColumnRenamed("Value", "Desc").drop("Key")
val outputDF = joinedWithKeyDF.union(nonNullDF)

How to create an unique autogenerated Id column in a spark dataframe

I have a dataframe where I have to generate a unique Id in one of the columns. This id has to be generated with an offset.
Because , I need to persist this dataframe with the autogenerated id , now if new data comes in the autogenerated id should not collide with the existing ones.
I checked the monotonically increasing function but it does not accept any offset .
This is what I tried :
df=df.coalesce(1);
df = df.withColumn(inputCol,functions.monotonically_increasing_id());
But is there a way to make the monotonically_increasing_id() start from a starting offset ?
You can simply add to it to provide a minimum value for the id. Note that it is not guaranteed the values will start from the minimum value
.withColumn("id", monotonically_increasing_id + 123)
Explanation: Operator + is overloaded for columns https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L642
Or if you don't want to restrict your program into one only partition with df.coalesce(1) you can use zipWithIndex which starts with index = 0 as next:
lines = [["a1", "a2", "a3"],
["b1", "b2", "b3"],
["c1", "c2", "c3"]]
cols = ["c1", "c2", "c3"]
df = spark.createDataFrame(lines, cols)
start_indx = 10
df = df.rdd.zipWithIndex() \
.map(lambda (r, indx): (indx + start_indx, r[0], r[1], r[2])) \
.toDF(["id", "c1", "c2", "c3"])
df.show(10, False)
In this case I set the start_index = 10. And this will be the output:
+---+---+---+---+
|id |c1 |c2 |c3 |
+---+---+---+---+
|10 |a1 |a2 |a3 |
|11 |b1 |b2 |b3 |
|12 |c1 |c2 |c3 |
+---+---+---+---+
You could add a rownumber to your columns and then add that to the maximum existing identity column, or your offset. Once it is set drop the rownumber attribute.
from pyspark.sql import functions as sf
from pyspark.sql.window import Window
# Could also grab the exist max ID value
seed_value = 123
df = df.withColumn("row_number", sf.rowNumber().over(Window.partitionBy(sf.col("natural_key")).orderBy(sf.col("anything"))))
df = df.withColumn("id", sf.col("row_number")+seed_value)
Remember to drop the row_number attribute.

Iterate cols PySpark

I have a SQL table containing 40 columns: ID, Product, Product_ID, Date etc. and would like to iterate over all columns to get distinct values.
Customer table (sample):
ID Product
1 gadget
2 VR
2 AR
3 hi-fi
I have tried using dropDuplicates within a function that loops over all columns but the resultant output is only spitting out one distinct value per column instead of all possible distinct values.
Expected Result:
Column Value
ID 1
ID 2
ID 3
Product gadget
Product VR
Product AR
Product hi-fi
Actual Result:
Column Value
ID 1
Product gadget
The idea is to use collect_set() to fetch distinct elements in a column and then exploding the dataframe.
#All columns which need to be aggregated should be added here in col_list.
col_list = ['ID','Product']
exprs = [collect_set(x) for x in col_list]
Let's start aggregating.
from pyspark.sql.functions import lit , collect_set, explode, array, struct, col, substring, length, expr
df = spark.createDataFrame([(1,'gadget'),(2,'VR'),(2,'AR'),(3,'hi-fi')], schema = ['ID','Product'])
df = df.withColumn('Dummy',lit('Dummy'))
#While exploding later, the datatypes must be the same, so we have to cast ID as a String.
df = df.withColumn('ID',col('ID').cast('string'))
#Creating the list of distinct values.
df = df.groupby("Dummy").agg(*exprs)
df.show(truncate=False)
+-----+---------------+-----------------------+
|Dummy|collect_set(ID)|collect_set(Product) |
+-----+---------------+-----------------------+
|Dummy|[3, 1, 2] |[AR, VR, hi-fi, gadget]|
+-----+---------------+-----------------------+
def to_transpose(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
df = to_transpose(df, ['Dummy']).drop('Dummy')
df.show()
+--------------------+--------------------+
| key| val|
+--------------------+--------------------+
| collect_set(ID)| [3, 1, 2]|
|collect_set(Product)|[AR, VR, hi-fi, g...|
+--------------------+--------------------+
df = df.withColumn('val', explode(col('val')))
df = df.withColumnRenamed('key', 'Column').withColumnRenamed('val', 'Value')
df = df.withColumn('Column', expr("substring(Column,13,length(Column)-13)"))
df.show()
+-------+------+
| Column| Value|
+-------+------+
| ID| 3|
| ID| 1|
| ID| 2|
|Product| AR|
|Product| VR|
|Product| hi-fi|
|Product|gadget|
+-------+------+
Note: All the columns which are not strings, should be converted into String like df = df.withColumn('ID',col('ID').cast('string')). Otherwise, you will get error.

Pyspark parallelized loop of dataframe column

I have a raw Dataframe pyspark with encapsulate column. I need to loop on all columns to unwrap those columns. I don't know name columns and they could change. So I need generic algorithm. The problem is that I can't use classic loop (for) because I need a paralleled code.
Example of Data:
Timestamp | Layers
1456982 | [[1, 2],[3,4]]
1486542 | [[3,5], [5,5]]
In layers, it's a column which contain other columns (with their own column names). My goal is to have something like this:
Timestamp | label | number1 | text | value
1456982 | 1 | 2 |3 |4
1486542 | 3 | 5 |5 |5
How can I make a loop on columns with pyspark function?
Thanks for advice
You can use reduce function to this. I dont know what you want to do but lets suppose you wanna add 1 to all columns:
from functools import reduce
from pyspark.sql import functions as F
def add_1(df, col_name):
return df.withColumn(col_name, F.col(col_name)+1) # using same column name will update column
reduce(add_1, df.columns, df)
Edit:
I am not sure about solving it without converting rdd. Maybe this can be helpful:
from pyspark.sql import Row
flatF = lambda col: [item for item in l for l in col]
df \
.rdd \
.map(row: Row(timestamp=row['timestamp'],
**dict(zip(col_names, flatF(row['layers']))))) \
.toDF()

Resources