Pyspark : concat two spark df side ways without join efficiently - apache-spark

Hi I have sparse dataframe that was loaded by mergeschema option
DF
name A1 A2 B1 B2 ..... partitioned_name
A 1 1 null null partition_a
B 2 2 null null partition_a
A null null 3 4 partition_b
B null null 3 4 partition_b
to
DF
name A1 A2 B1 B2 .....
A 1 1 3 4
B 2 2 3 4
Any Best ideas without joining for efficiency (nor rdd because data is huge)? I was thinking about solutions like pandas concat(axis=1) since all the tables are sorted

If that pattern repeats and you don't mind hardcode the column names:
df = spark.createDataFrame(
[
('A','1','1','null','null','partition_a'),
('B','2','2','null','null','partition_a'),
('A','null','null','3','4','partition_b'),
('B','null','null','3','4','partition_b')
],
['name','A1','A2','B1','B2','partitioned_name']
)\
.withColumn('A1', F.col('A1').cast('integer'))\
.withColumn('A2', F.col('A2').cast('integer'))\
.withColumn('B1', F.col('B1').cast('integer'))\
.withColumn('B2', F.col('B2').cast('integer'))\
df.show()
import pyspark.sql.functions as F
cols_to_agg = [col for col in df.columns if col not in ["name", "partitioned_name"]]
df\
.groupby('name')\
.agg(F.sum('A1').alias('A1'),
F.sum('A2').alias('A2'),
F.sum('B1').alias('B1'),
F.sum('B2').alias('B2'))\
.show()
+----+----+----+----+----+----------------+
# |name| A1| A2| B1| B2|partitioned_name|
# +----+----+----+----+----+----------------+
# | A| 1| 1|null|null| partition_a|
# | B| 2| 2|null|null| partition_a|
# | A|null|null| 3| 4| partition_b|
# | B|null|null| 3| 4| partition_b|
# +----+----+----+----+----+----------------+
# +----+---+---+---+---+
# |name| A1| A2| B1| B2|
# +----+---+---+---+---+
# | A| 1| 1| 3| 4|
# | B| 2| 2| 3| 4|
# +----+---+---+---+---+

Related

How to set the value of a Pyspark column based on two conditions of the value of another column

Say I have a dataframe:
+-----+-----+-----+
|id |foo. |bar. |
+-----+-----+-----+
| 1| baz| 0|
| 2| baz| 0|
| 3| 333| 2|
| 4| 444| 1|
+-----+-----+-----+
I want to set the 'foo' column to a value depending on the value of bar.
If bar is 2: set the value of foo for that row to 'X',
else if bar is 1: set the value of foo for that row to 'Y'
And if neither condition is met, leave the foo value as it is.
pyspark.when seems like the closest method, but that doesn't seem to work based on another columns value.
when can work with other columns. You can use F.col to get the value of the other column and provide an appropriate condition:
import pyspark.sql.functions as F
df2 = df.withColumn(
'foo',
F.when(F.col('bar') == 2, 'X')
.when(F.col('bar') == 1, 'Y')
.otherwise(F.col('foo'))
)
df2.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1|baz| 0|
| 2|baz| 0|
| 3| X| 2|
| 4| Y| 1|
+---+---+---+
We can solve this using when òr UDF in spark to insert new column based on condition.
Create Sample DataFrame:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('AddConditionalColumn').getOrCreate()
data = [(1,"baz",0),(2,"baz",0),(3,"333",2),(4,"444",1)]
columns = ["id","foo","bar"]
df = spark.createDataFrame(data = data, schema = columns)
df.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1|baz| 0|
| 2|baz| 0|
| 3|333| 2|
| 4|444| 1|
+---+---+---+
Using When:
from pyspark.sql.functions import when
df2 = df.withColumn("foo", when(df.bar == 2,"X")
.when(df.bar == 1,"Y")
.otherwise(df.foo))
df2.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1|baz| 0|
| 2|baz| 0|
| 3| X| 2|
| 4| Y| 1|
+---+---+---+
Using UDF:
import pyspark.sql.functions as F
from pyspark.sql.types import *
def executeRule(value):
if value == 2:
return 'X'
elif value == 1:
return 'Y'
else:
return value
# Converting function to UDF
ruleUDF = F.udf(executeRule, StringType())
df3 = df.withColumn("foo", ruleUDF("bar"))
df3.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1| 0| 0|
| 2| 0| 0|
| 3| X| 2|
| 4| Y| 1|
+---+---+---+

Pyspark: Dropping columns with no distinct values only using transformations [duplicate]

Related question: How to drop columns which have same values in all rows via pandas or spark dataframe?
So I have a pyspark dataframe, and I want to drop the columns where all values are the same in all rows while keeping other columns intact.
However the answers in the above question are only for pandas. Is there a solution for pyspark dataframe?
Thanks
You can apply the countDistinct() aggregation function on each column to get count of distinct values per column. Column with count=1 means it has only 1 value in all rows.
# apply countDistinct on each column
col_counts = df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)).collect()[0].asDict()
# select the cols with count=1 in an array
cols_to_drop = [col for col in df.columns if col_counts[col] == 1 ]
# drop the selected column
df.drop(*cols_to_drop).show()
You can use approx_count_distinct function (link) to count the number of distinct elements in a column. In case there is just one distinct, the remove the corresponding column.
Creating the DataFrame
from pyspark.sql.functions import approx_count_distinct
myValues = [(1,2,2,0),(2,2,2,0),(3,2,2,0),(4,2,2,0),(3,1,2,0)]
df = sqlContext.createDataFrame(myValues,['value1','value2','value3','value4'])
df.show()
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+
| 1| 2| 2| 0|
| 2| 2| 2| 0|
| 3| 2| 2| 0|
| 4| 2| 2| 0|
| 3| 1| 2| 0|
+------+------+------+------+
Couting number of distinct elements and converting it into dictionary.
count_distinct_df=df.select([approx_count_distinct(x).alias("{0}".format(x)) for x in df.columns])
count_distinct_df.show()
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+
| 4| 2| 1| 1|
+------+------+------+------+
dict_of_columns = count_distinct_df.toPandas().to_dict(orient='list')
dict_of_columns
{'value1': [4], 'value2': [2], 'value3': [1], 'value4': [1]}
#Storing those keys in the list which have just 1 distinct key.
distinct_columns=[k for k,v in dict_of_columns.items() if v == [1]]
distinct_columns
['value3', 'value4']
Drop the columns having distinct values
df=df.drop(*distinct_columns)
df.show()
+------+------+
|value1|value2|
+------+------+
| 1| 2|
| 2| 2|
| 3| 2|
| 4| 2|
| 3| 1|
+------+------+

How to convert PySpark pipeline rdd (tuple inside tuple) into Data Frame?

I have a PySpark pipeline RDD like bellow
(1,([1,2,3,4],[5,3,4,5])
(2,([1,2,4,5],[4,5,6,7])
I want to generate Data Frame like below:
Id sid cid
1 1 5
1 2 3
1 3 4
1 4 5
2 1 4
2 2 5
2 4 6
2 5 7
Please help me on this.
If you have an RDD like this one,
rdd = sc.parallelize([
(1, ([1,2,3,4], [5,3,4,5])),
(2, ([1,2,4,5], [4,5,6,7]))
])
I would just use RDDs:
rdd.flatMap(lambda rec:
((rec[0], sid, cid) for sid, cid in zip(rec[1][0], rec[1][1]))
).toDF(["id", "sid", "cid"]).show()
# +---+---+---+
# | id|sid|cid|
# +---+---+---+
# | 1| 1| 5|
# | 1| 2| 3|
# | 1| 3| 4|
# | 1| 4| 5|
# | 2| 1| 4|
# | 2| 2| 5|
# | 2| 4| 6|
# | 2| 5| 7|
# +---+---+---+

PySpark : change column names of a df based on relations defined in another df

I have two Spark data-frames loaded from csv of the form :
mapping_fields (the df with mapped names):
new_name old_name
A aa
B bb
C cc
and
aa bb cc dd
1 2 3 43
12 21 4 37
to be transformed into :
A B C D
1 2 3
12 21 4
as dd didn't have any mapping in the original table, D column should have all null values.
How can I do this without converting the mapping_df into a dictionary and checking individually for mapped names? (this would mean I have to collect the mapping_fields and check, which kind of contradicts my use-case of distributedly handling all the datasets)
Thanks!
With melt borrowed from here you could:
from pyspark.sql import functions as f
mapping_fields = spark.createDataFrame(
[("A", "aa"), ("B", "bb"), ("C", "cc")],
("new_name", "old_name"))
df = spark.createDataFrame(
[(1, 2, 3, 43), (12, 21, 4, 37)],
("aa", "bb", "cc", "dd"))
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"], "left_outer")
.withColumn("value", f.when(f.col("new_name").isNotNull(), col("value")))
.withColumn("new_name", f.coalesce("new_name", f.upper(col("old_name"))))
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
but in your description nothing justifies this. Because number of columns is fairly limited, I'd rather:
mapping = dict(
mapping_fields
.filter(f.col("old_name").isin(df.columns))
.select("old_name", "new_name").collect())
df.select([
(f.lit(None).cast(t) if c not in mapping else col(c)).alias(mapping.get(c, c.upper()))
for (c, t) in df.dtypes])
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
At the end of the day you should use distributed processing when it provides performance or scalability improvements. Here it would do the opposite and make your code overly complicated.
To ignore no-matches:
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"])
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
or
df.select([
col(c).alias(mapping.get(c))
for (c, t) in df.dtypes if c in mapping])
I tried with a simple for loop,hope this helps too.
from pyspark.sql import functions as F
l1 = [('A','aa'),('B','bb'),('C','cc')]
l2 = [(1,2,3,43),(12,21,4,37)]
df1 = spark.createDataFrame(l1,['new_name','old_name'])
df2 = spark.createDataFrame(l2,['aa','bb','cc','dd'])
print df1.show()
+--------+--------+
|new_name|old_name|
+--------+--------+
| A| aa|
| B| bb|
| C| cc|
+--------+--------+
>>> df2.show()
+---+---+---+---+
| aa| bb| cc| dd|
+---+---+---+---+
| 1| 2| 3| 43|
| 12| 21| 4| 37|
+---+---+---+---+
when you need the missing column with null values,
>>>cols = df2.columns
>>> for i in cols:
val = df1.where(df1['old_name'] == i).first()
if val is not None:
df2 = df2.withColumnRenamed(i,val['new_name'])
else:
df2 = df2.withColumn(i,F.lit(None))
>>> df2.show()
+---+---+---+----+
| A| B| C| dd|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
when we need only the mapping columns,changing the else part,
else:
df2 = df2.drop(i)
>>> df2.show()
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 12| 21| 4|
+---+---+---+
This will transform the original df2 dataframe though.

How to aggregate on one column and take maximum of others in pyspark?

I have columns X (string), Y (string), and Z (float).
And I want to
aggregate on X
take the maximum of column Z
report ALL the values for columns X, Y, and Z
If there are multiple values for column Y that correspond to the maximum for column Z, then take the maximum of those values in column Y.
For example, my table is like: table1:
col X col Y col Z
A 1 5
A 2 10
A 3 10
B 5 15
resulting in:
A 3 10
B 5 15
If I were using SQL, I would do it like this:
select X, Y, Z
from table1
join (select max(Z) as max_Z from table1 group by X) table2
on table1.Z = table2.max_Z
However how do I do this when 1) column Z is a float? and 2) I'm using pyspark sql?
The two following solutions are in Scala, but honestly could not resist posting them to promote my beloved window aggregate functions. Sorry.
The only question is which structured query is more performant/effective?
Window Aggregate Function: rank
val df = Seq(
("A",1,5),
("A",2,10),
("A",3,10),
("B",5,15)
).toDF("x", "y", "z")
scala> df.show
+---+---+---+
| x| y| z|
+---+---+---+
| A| 1| 5|
| A| 2| 10|
| A| 3| 10|
| B| 5| 15|
+---+---+---+
// describe window specification
import org.apache.spark.sql.expressions.Window
val byX = Window.partitionBy("x").orderBy($"z".desc).orderBy($"y".desc)
// use rank to calculate the best X
scala> df.withColumn("rank", rank over byX)
.select("x", "y", "z")
.where($"rank" === 1) // <-- take the first row
.orderBy("x")
.show
+---+---+---+
| x| y| z|
+---+---+---+
| A| 3| 10|
| B| 5| 15|
+---+---+---+
Window Aggregate Function: first and dropDuplicates
I've always been thinking about the alternatives to rank function and first usually sprung to mind.
// use first and dropDuplicates
scala> df.
withColumn("y", first("y") over byX).
withColumn("z", first("z") over byX).
dropDuplicates.
orderBy("x").
show
+---+---+---+
| x| y| z|
+---+---+---+
| A| 3| 10|
| B| 5| 15|
+---+---+---+
You can consider using Window function. My approach here is to create Window function that partition dataframe by X first. Then, order columns Y and Z by its value.
We can simply select rank == 1 for row that we're interested.
Or we can use first and drop_duplicates to achieve the same task.
PS. Thanks Jacek Laskowski for the comments and Scala solution that leads to this solution.
Create toy example dataset
from pyspark.sql.window import Window
import pyspark.sql.functions as func
data=[('A',1,5),
('A',2,10),
('A',3,10),
('B',5,15)]
df = spark.createDataFrame(data,schema=['X','Y','Z'])
Window Aggregate Function: rank
Apply windows function with rank function
w = Window.partitionBy(df['X']).orderBy([func.col('Y').desc(), func.col('Z').desc()])
df_max = df.select('X', 'Y', 'Z', func.rank().over(w).alias("rank"))
df_final = df_max.where(func.col('rank') == 1).select('X', 'Y', 'Z').orderBy('X')
df_final.show()
Output
+---+---+---+
| X| Y| Z|
+---+---+---+
| A| 3| 10|
| B| 5| 15|
+---+---+---+
Window Aggregate Function: first and drop_duplicates
This task can also be achieved by using first and drop_duplicates as follows
df_final = df.select('X', func.first('Y').over(w).alias('Y'), func.first('Z').over(w).alias('Z'))\
.drop_duplicates()\
.orderBy('X')
df_final.show()
Output
+---+---+---+
| X| Y| Z|
+---+---+---+
| A| 3| 10|
| B| 5| 15|
+---+---+---+
Lets create a dataframe from your sample data as -
data=[('A',1,5),
('A',2,10),
('A',3,10),
('B',5,15)]
df = spark.createDataFrame(data,schema=['X','Y','Z'])
df.show()
output:
+---+---+---+
| X| Y| Z|
+---+---+---+
| A| 1| 5|
| A| 2| 10|
| A| 3| 10|
| B| 5| 15|
+---+---+---+
:
# create a intermediate dataframe that find max of Z
df1 = df.groupby('X').max('Z').toDF('X2','max_Z')
:
# create 2nd intermidiate dataframe that finds max of Y where Z = max of Z
df2 = df.join(df1,df.X==df1.X2)\
.where(col('Z')==col('max_Z'))\
.groupBy('X')\
.max('Y').toDF('X','max_Y')
:
# join above two to form final result
result = df1.join(df2,df1.X2==df2.X)\
.select('X','max_Y','max_Z')\
.orderBy('X')
result.show()
:
+---+-----+-----+
| X|max_Y|max_Z|
+---+-----+-----+
| A| 3| 10|
| B| 5| 15|
+---+-----+-----+

Resources