PySpark: Subtract Dataframe Ignoring Some Columns - apache-spark

I want to perform subtract between 2 dataframes in pyspark. Challenge is that I have to ignore some columns while subtracting dataframe. But end dataframe should have all the columns, including ignored columns.
Here is an example:
userLeft = sc.parallelize([
Row(id=u'1',
first_name=u'Steve',
last_name=u'Kent',
email=u's.kent#email.com',
date1=u'2017-02-08'),
Row(id=u'2',
first_name=u'Margaret',
last_name=u'Peace',
email=u'marge.peace#email.com',
date1=u'2017-02-09'),
Row(id=u'3',
first_name=None,
last_name=u'hh',
email=u'marge.hh#email.com',
date1=u'2017-02-10')
]).toDF()
userRight = sc.parallelize([
Row(id=u'2',
first_name=u'Margaret',
last_name=u'Peace',
email=u'marge.peace#email.com',
date1=u'2017-02-11'),
Row(id=u'3',
first_name=None,
last_name=u'hh',
email=u'marge.hh#email.com',
date1=u'2017-02-12')
]).toDF()
Expected:
ActiveDF = userLeft.subtract(userRight) ||| Ignore "date1" column while subtracting.
End result should look something like this including "date1" column.
+----------+--------------------+----------+---+---------+
| date1| email|first_name| id|last_name|
+----------+--------------------+----------+---+---------+
|2017-02-08| s.kent#email.com| Steve| 1| Kent|
+----------+--------------------+----------+---+---------+

Seems you need anti-join:
userLeft.join(userRight, ["id"], "leftanti").show()
+----------+----------------+----------+---+---------+
| date1| email|first_name| id|last_name|
+----------+----------------+----------+---+---------+
|2017-02-08|s.kent#email.com| Steve| 1| Kent|
+----------+----------------+----------+---+---------+

You can also use a full join and only keep null values:
userLeft.join(
userRight,
[c for c in userLeft.columns if c != "date1"],
"full"
).filter(psf.isnull(userLeft.date1) | psf.isnull(userRight.date1)).show()
+------------------+----------+---+---------+----------+----------+
| email|first_name| id|last_name| date1| date1|
+------------------+----------+---+---------+----------+----------+
|marge.hh#email.com| null| 3| hh|2017-02-10| null|
|marge.hh#email.com| null| 3| hh| null|2017-02-12|
| s.kent#email.com| Steve| 1| Kent|2017-02-08| null|
+------------------+----------+---+---------+----------+----------+
If you want to use joins, whether it's leftanti or full you'll need to find default values for your null in the joining columns (I think we discussed it in a previous thread).
You can also just drop the column that bothers you subtract and join:
df = userLeft.drop("date1").subtract(userRight.drop("date1"))
userLeft.join(df, df.columns).show()
+----------------+----------+---+---------+----------+
| email|first_name| id|last_name| date1|
+----------------+----------+---+---------+----------+
|s.kent#email.com| Steve| 1| Kent|2017-02-08|
+----------------+----------+---+---------+----------+

Related

Using spark divide a dataset into 2 based on sum of a column value

My spark dataset is like below. I want to divide this into 2 datasets such that they contain the same or almost the same value column when summing up the value column.
id,value
1,20
2,30
3,50
4,10
5,20
6,10
I want to divide this into 2 datasets such that first dataset looks like: Sum of value is 70
id,value
1,20
2,30
4,10
6,10
2nd dataset:Sum of value is 70
id,value
3,50
5,20
The idea here is to sum the value column and the 2 datasets should almost contain equal sum of values.
Any help? Looks like a complicated one?
Split the dataframe using the average. It's not as good as your answer, but should give a good enough solution for a general case.
import pyspark.sql.functions as F
mean = df.agg(F.avg('value')).collect()[0][0]
df1 = df.filter(F.col('value') > mean)
df2 = df.filter(F.col('value') < mean)
>>> df1.show()
+---+-----+
| id|value|
+---+-----+
| 2| 30|
| 3| 50|
+---+-----+
>>> df2.show()
+---+-----+
| id|value|
+---+-----+
| 1| 20|
| 4| 10|
| 5| 20|
| 6| 10|
+---+-----+

Python Spark join two dataframes and fill column

I have two dataframes that need to be joined in a particular way I am struggling with.
dataframe 1:
+--------------------+---------+----------------+
| asset_domain| eid| oid|
+--------------------+---------+----------------+
| test-domain...| 126656| 126656|
| nebraska.aaa.com| 335660| 335660|
| netflix.com| 460| 460|
+--------------------+---------+----------------+
dataframe 2:
+--------------------+--------------------+---------+--------------+----+----+------------+
| asset| asset_domain|dns_count| ip| ev|post|form_present|
+--------------------+--------------------+---------+--------------+----+----+------------+
| sub1.test-domain...| test-domain...| 6354| 11.11.111.111| 1| 1| null|
| netflix.com| netflix.com| 3836| 22.22.222.222|null|null| null|
+--------------------+--------------------+---------+--------------+----+----+------------+
desired result:
+--------------------+---------+-------------+----+----+------------+---------+----------------+
| asset|dns_count| ip| ev|post|form_present| eid| oid|
+--------------------+---------+-------------+----+----+------------+---------+----------------+
| netflix.com| 3836|22.22.222.222|null|null| null| 460| 460|
| sub1.test-domain...| 5924|111.11.111.11| 1| 1| null| 126656| 126656|
| nebraska.aaa.com| null| null|null|null| null| 335660| 335660|
+--------------------+---------+-------------+----+----+------------+---------+----------------+
Basically – it should join df1 and df2 on asset_domain but if that doesn't exist in df2, then the resulting asset should be the asset_domain from df1.
I tried df = df2.join(df1, ["asset_domain"], "right").drop("asset_domain") but that obviously leaves null in the asset column for nebraska.aaa.com since it does not have a matching domain in df2. How do I go about adding those to the asset column for this particular case?
you can use coalesce function after join to create asset column.
df2.join(df1, ["asset_domain"], "right").select(coalesce("asset","asset_domain").alias("asset"),"dns_count","ip","ev","post","form_present","eid","oid").orderBy("asset").show()
#+----------------+---------+-------------+----+----+------------+------+------+
#| asset|dns_count| ip| ev|post|form_present| eid| oid|
#+----------------+---------+-------------+----+----+------------+------+------+
#|nebraska.aaa.com| null| null|null|null| null|335660|335660|
#| netflix.com| 3836|22.22.222.222|null|null| None| 460| 460|
#|sub1.test-domain| 6354|11.11.111.111| 1| 1| null|126656|126656|
#+----------------+---------+-------------+----+----+------------+------+------+
After the join you can use the isNull() function
import pyspark.sql.functions as F
tst1 = sqlContext.createDataFrame([('netflix',1),('amazon',2)],schema=("asset_domain",'xtra1'))
tst2= sqlContext.createDataFrame([('netflix','yahoo',1),('amazon','yahoo',2),('flipkart',None,2)],schema=("asset_domain","asset",'xtra'))
tst_j = tst1.join(tst2,on='asset_domain',how='right')
#%%
tst_res = tst_j.withColumn("asset",F.when(F.col('asset').isNull(),F.col('asset_domain')).otherwise(F.col('asset')))

Apache Spark: Get the first and last row of each partition

I would like to get the first and last row of each partition in spark (I'm using pyspark). How do I go about this?
In my code I repartition my dataset based on a key column using:
mydf.repartition(keyColumn).sortWithinPartitions(sortKey)
Is there a way to get the first row and last row for each partition?
Thanks
I would highly advise against working with partitions directly. Spark does a lot of DAG optimisation, so when you try executing specific functionality on each partition, all your assumptions about the partitions and their distribution might be completely false.
You seem to however have a keyColumn and sortKey, so then I'd just suggest to do the following:
import pyspark
import pyspark.sql.functions as f
w_asc = pyspark.sql.Window.partitionBy(keyColumn).orderBy(f.asc(sortKey))
w_desc = pyspark.sql.Window.partitionBy(keyColumn).orderBy(f.desc(sortKey))
res_df = mydf. \
withColumn("rn_asc", f.row_number().over(w_asc)). \
withColumn("rn_desc", f.row_number().over(w_desc)). \
where("rn_asc = 1 or rn_desc = 1")
The resulting dataframe will have 2 additional columns, where rn_asc=1 indicates the first row and rn_desc=1 indicates the last row.
Scala: I think the repartition is not by come key column but it requires the integer how may partition you want to set. I made a way to select the first and last row by using the Window function of the spark.
First, this is my test data.
+---+-----+
| id|value|
+---+-----+
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 1|
| 2| 2|
| 2| 3|
| 3| 1|
| 3| 3|
| 3| 5|
+---+-----+
Then, I use the Window function twice, because I cannot know the last row easily but the reverse is quite easy.
import org.apache.spark.sql.expressions.Window
val a = Window.partitionBy("id").orderBy("value")
val d = Window.partitionBy("id").orderBy(col("value").desc)
val df = spark.read.option("header", "true").csv("test.csv")
df.withColumn("marker", when(rank.over(a) === 1, "Y").otherwise("N"))
.withColumn("marker", when(rank.over(d) === 1, "Y").otherwise(col("marker")))
.filter(col("marker") === "Y")
.drop("marker").show
The final result is then,
+---+-----+
| id|value|
+---+-----+
| 3| 5|
| 3| 1|
| 1| 4|
| 1| 1|
| 2| 3|
| 2| 1|
+---+-----+
Here is another approach using mapPartitions from RDD API. We iterate over the elements of each partition until we reach the end. I would expect this iteration to be very fast since we skip all the elements of the partition except the two edges. Here is the code:
df = spark.createDataFrame([
["Tom", "a"],
["Dick", "b"],
["Harry", "c"],
["Elvis", "d"],
["Elton", "e"],
["Sandra", "f"]
], ["name", "toy"])
def get_first_last(it):
first = last = next(it)
for last in it:
pass
# Attention: if first equals last by reference return only one!
if first is last:
return [first]
return [first, last]
# coalesce here is just for demonstration
first_last_rdd = df.coalesce(2).rdd.mapPartitions(get_first_last)
spark.createDataFrame(first_last_rdd, ["name", "toy"]).show()
# +------+---+
# | name|toy|
# +------+---+
# | Tom| a|
# | Harry| c|
# | Elvis| d|
# |Sandra| f|
# +------+---+
PS: Odd positions will contain the first partition element and the even ones the last item. Also note that the number of results will be (numPartitions * 2) - numPartitionsWithOneItem which I expect to be relatively small therefore you shouldn't bother about the cost of the new createDataFrame statement.

Count of rows containing null values in pyspark

Consider a pyspark dataframe for example
columns = ['id', 'dogs', 'cats']
vals = [(1, 2, 0),(None, 0, 1),(5,None,9)]
df=spark.createDataFrame(vals,columns)
df.show()
+----+----+----+
| id|dogs|cats|
+----+----+----+
| 1| 2| 0|
|null| 0| 1|
| 5|null| 9|
+----+----+----+
I want to write a code which returns 2 as the number of rows containing null values
df.subtract(df.dropna()).count()
The df.dropna() returns a new dataframe where any row containing a null is removed; this dataframe is then subtracted (the equivalent of SQL EXCEPT) from the original dataframe to keep only the rows with nulls in them.
This is obviously not as pretty as if you were only looking at a single column, but this is the simplest way I know to do this when all columns are involved.

Add columns on a Pyspark Dataframe

I have a Pyspark Dataframe with this structure:
+----+----+----+----+---+
|user| A/B| C| A/B| C |
+----+----+-------------+
| 1 | 0| 1| 1| 2|
| 2 | 0| 2| 4| 0|
+----+----+----+----+---+
I had originally two dataframes, but I outer joined them using user as key, so there could be also null values. I can't find the way to sum the columns with equal name in order to get a dataframe like this:
+----+----+----+
|user| A/B| C|
+----+----+----+
| 1 | 1| 3|
| 2 | 4| 2|
+----+----+----+
Also note that there could be many equal columns, so selecting literally each column is not an option. In pandas this was possible using "user" as Index and then adding both dataframes. How can I do this on Spark?
I have a work around for this
val dataFrameOneColumns=df1.columns.map(a=>if(a.equals("user")) a else a+"_1")
val updatedDF=df1.toDF(dataFrameOneColumns:_*)
Now make the Join then the out will contain the Values with different names
Then make the tuple of the list to be combined
val newlist=df1.columns.filter(_.equals("user").zip(dataFrameOneColumns.filter(_.equals("user"))
And them Combine the value of the Columns within each tuple and get the desired output !
PS: i am guessing you can write the logic for combining ! So i am not spoon feeding !

Resources