Excel: Store the difference of two columns in another column - excel

I have two email lists. One list is in column A the other is in column B.
I want to remove all emails that are in B from A and then
store the results on column C.
I searched for a solution but they just highlight the differences,
i want to remove them instead.

Assume that your two lists are in column A and column B respectively, and lists start from second row (e.g. A2 and B2). Put this formula in cell C2 and fill down:
=IF(ISERROR(VLOOKUP(A2,B:B,1,FALSE)),A2,"")
If column A contains duplicate items, you can extract unique values puting this formula in cell D2:
=IFERROR(INDEX($C$2:$C$1000,MATCH(0,INDEX(COUNTIF($D$1:D1,$C$2:$C$1000),0,0),0)),"")
You can change 1000 in $C$2:$C$1000 according to the length of your list.
See my example:
column A| column B| column C |
1|
2| apple| banana| =IF(ISERROR(VLOOKUP(A2,B:B,1,FALSE)),A2,"")|
3| banana| grape| =IF(ISERROR(VLOOKUP(A3,B:B,1,FALSE)),A3,"")|
4| orange| melon| ...
5| pineapple| limon| =IF(ISERROR(VLOOKUP(A5,B:B,1,FALSE)),A5,"")|
6| orange| ...
7| limon|
8| apple|
9| grape|
10| melon|
11| peach| | =IF(ISERROR(VLOOKUP(A11,B:B,1,FALSE)),A11,"")|
column D |
=IFERROR(INDEX($C$2:$C$1000,MATCH(0,INDEX(COUNTIF($D$1:D1,$C$2:$C$1000),0,0),0)),"")|
=IFERROR(INDEX($C$2:$C$1000,MATCH(0,INDEX(COUNTIF($D$1:D2,$C$2:$C$1000),0,0),0)),"")|
...
=IFERROR(INDEX($C$2:$C$1000,MATCH(0,INDEX(COUNTIF($D$1:D10,$C$2:$C$1000),0,0),0)),"")|
Example result:
column A| column B| column C| column D|
apple| banana| apple| apple|
banana| grape| | orange|
orange| melon| orange|pineapple|
pineapple| limon|pineapple| peach|
orange| orange|
limon| |
apple| apple|
grape| |
melon| |
peach| peach|

Related

excel pivot table - query based on mulitple columns value

I have a pivot table in excel that has an attribute say animals that holds 3 values (dogs, cats, birds)
and the rows represent persons.
|dogs|cats|birds|
----------------------
John| 0| 1| 1|
Jack| 1| 1| 1|
Jim | 0| 0| 1|
Pam | 1| 0| 1|
Tess| 2| 1| 0|
I'd like to select all persons who have at least a dog and a cat, or exactly 2 dogs and a bird...
I am trying to use calculated fields but I can't seem to be able to select the proper values
=IF(AND(animals['dogs']>1, animals['cats'] >1),"YEAH", ":-(")
ok that was fairly easy after figuring out what to search for
I user calculated item and I was able to right the exact formula I was looking for

Calculate Spark column value depending on another row value on the same column

I'm working on Apache spark 2.3.0 cloudera4 and I have an issue processing a Dataframe.
I've got this input dataframe:
+---+---+----+
| id| d1| d2 |
+---+---+----+
| 1| | 2.0|
| 2| |-4.0|
| 3| | 6.0|
| 4|3.0| |
+---+---+----+
And I need this output:
+---+---+----+----+
| id| d1| d2 | r |
+---+---+----+----+
| 1| | 2.0| 7.0|
| 2| |-4.0| 5.0|
| 3| | 6.0| 9.0|
| 4|3.0| | 3.0|
+---+---+.---+----+
Which is, from an iterating perspective, get the biggest id row (4) and put the d1 value on the r column, then take the next row (3) and put r[4] + d2[3] on r column, and so on.
Is it posible to do something like that on Spark? because I will need a computed value from a row to calculate the value for another row.
How about this? The important bit is sum($"r1").over(Window.orderBy($"id".desc) which calculates a cumulative sum of a column. Other than that, I'm creating a couple of helper columns to get the max id and get the ordering right.
val result = df
.withColumn("max_id", max($"id").over(Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)))
.withColumn("r1", when($"id" === $"max_id", $"d1").otherwise($"d2"))
.withColumn("r", sum($"r1").over(Window.orderBy($"id".desc)))
.drop($"max_id").drop($"r1")
.orderBy($"id")
result.show
+---+----+----+---+
| id| d1| d2| r|
+---+----+----+---+
| 1|null| 2.0|7.0|
| 2|null|-4.0|5.0|
| 3|null| 6.0|9.0|
| 4| 3.0|null|3.0|
+---+----+----+---+

How to do a conditional aggregation after a groupby in pyspark dataframe?

I'm trying to group by an ID column in a pyspark dataframe and sum a column depending on the value of another column.
To illustrate, consider the following dummy dataframe:
+-----+-------+---------+
| ID| type| amount|
+-----+-------+---------+
| 1| a| 55|
| 2| b| 1455|
| 2| a| 20|
| 2| b| 100|
| 3| null| 230|
+-----+-------+---------+
My desired output is:
+-----+--------+----------+----------+
| ID| sales| sales_a| sales_b|
+-----+--------+----------+----------+
| 1| 55| 55| 0|
| 2| 1575| 20| 1555|
| 3| 230| 0| 0|
+-----+--------+----------+----------+
So basically, sales will be the sum of amount, while sales_a and sales_b are the sum of amount when type is a or b respectively.
For sales, I know this could be done like this:
from pyspark.sql import functions as F
df = df.groupBy("ID").agg(F.sum("amount").alias("sales"))
For the others, I'm guessing F.when would be useful but I'm not sure how to go about it.
You could create two columns before the aggregation based off of the value of type.
df.withColumn("sales_a", F.when(col("type") == "a", col("amount"))) \
.withColumn("sales_b", F.when(col("type") == "b", col("amount"))) \
.groupBy("ID") \
.agg(F.sum("amount").alias("sales"),
F.sum("sales_a").alias("sales_a"),
F.sum("sales_b").alias("sales_b"))
from pyspark.sql import functions as F
df = df.groupBy("ID").agg(F.sum("amount").alias("sales"))
dfPivot = df.filter("type is not null").groupBy("ID").pivot("type").agg(F.sum("amount").alias("sales"))
res = df.join(dfPivot, df.id== dfPivot.id,how='left')
Then replace null with 0.
This is generic solution will work irrespective of values in type column.. so if type c is added in dataframe then it will create column _c

Pyspark: Dropping columns with no distinct values only using transformations [duplicate]

Related question: How to drop columns which have same values in all rows via pandas or spark dataframe?
So I have a pyspark dataframe, and I want to drop the columns where all values are the same in all rows while keeping other columns intact.
However the answers in the above question are only for pandas. Is there a solution for pyspark dataframe?
Thanks
You can apply the countDistinct() aggregation function on each column to get count of distinct values per column. Column with count=1 means it has only 1 value in all rows.
# apply countDistinct on each column
col_counts = df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)).collect()[0].asDict()
# select the cols with count=1 in an array
cols_to_drop = [col for col in df.columns if col_counts[col] == 1 ]
# drop the selected column
df.drop(*cols_to_drop).show()
You can use approx_count_distinct function (link) to count the number of distinct elements in a column. In case there is just one distinct, the remove the corresponding column.
Creating the DataFrame
from pyspark.sql.functions import approx_count_distinct
myValues = [(1,2,2,0),(2,2,2,0),(3,2,2,0),(4,2,2,0),(3,1,2,0)]
df = sqlContext.createDataFrame(myValues,['value1','value2','value3','value4'])
df.show()
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+
| 1| 2| 2| 0|
| 2| 2| 2| 0|
| 3| 2| 2| 0|
| 4| 2| 2| 0|
| 3| 1| 2| 0|
+------+------+------+------+
Couting number of distinct elements and converting it into dictionary.
count_distinct_df=df.select([approx_count_distinct(x).alias("{0}".format(x)) for x in df.columns])
count_distinct_df.show()
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+
| 4| 2| 1| 1|
+------+------+------+------+
dict_of_columns = count_distinct_df.toPandas().to_dict(orient='list')
dict_of_columns
{'value1': [4], 'value2': [2], 'value3': [1], 'value4': [1]}
#Storing those keys in the list which have just 1 distinct key.
distinct_columns=[k for k,v in dict_of_columns.items() if v == [1]]
distinct_columns
['value3', 'value4']
Drop the columns having distinct values
df=df.drop(*distinct_columns)
df.show()
+------+------+
|value1|value2|
+------+------+
| 1| 2|
| 2| 2|
| 3| 2|
| 4| 2|
| 3| 1|
+------+------+

Keep track of the previous row values with additional condition using pyspark

I'm using pyspark to generate a dataframe where I need to update 'amt' column with previous row's 'amt' value only when amt = 0.
For example, below is my dataframe
+---+-----+
| id|amt |
+---+-----+
| 1| 5|
| 2| 0|
| 3| 0|
| 4| 6|
| 5| 0|
| 6| 3|
+---+-----+
Now, I want the following DF to be created. whenever amt = 0, modi_amt col will contain previous row's non zero value, else no change.
+---+-----+----------+
| id|amt |modi_amt |
+---+-----+----------+
| 1| 5| 5|
| 2| 0| 5|
| 3| 0| 5|
| 4| 6| 6|
| 5| 0| 6|
| 6| 3| 3|
+---+-----+----------+
I'm able to get the previous rows value but need help for the rows where multiple 0 amt appears (example, id = 2,3)
code I'm using :
from pyspark.sql.window import Window
my_window = Window.partitionBy().orderBy("id")
DF= DF.withColumn("prev_amt", F.lag(DF.amt).over(my_window))
DF= DF.withColumn("modi_amt",when(DF.amt== 0,DF.prev_amt).otherwise(DF.amt)).drop('prev_amt')
I'm getting the below DF
+---+-----+----------+
| id|amt |modi_amt |
+---+-----+----------+
| 1| 5| 5|
| 2| 0| 5|
| 3| 0| 0|
| 4| 6| 6|
| 5| 0| 6|
| 6| 3| 3|
+---+-----+----------+
basically id 3 also should have modi_amt = 5
I've used the below approach to get the output and it's working fine,
from pyspark.sql.window import Window
my_window = Window.partitionBy().orderBy("id")
# this will hold the previous col value
DF= DF.withColumn("prev_amt", F.lag(DF.amt).over(my_window))
# this will replace the amt 0 with previous column value, but not consecutive rows having 0 amt.
DF = DF.withColumn("amt_adjusted",when(DF.prev_amt == 0,DF.prev_OffSet).otherwise(DF.amt))
# define null for the rows where both amt and amt_adjusted are having 0 (logic for consecutive rows having 0 amt)
DF = DF.withColumn('zeroNonZero', when((DF.amt== 0)&(DF.amt_adjusted == 0),lit(None)).otherwise(DF.amt_adjusted))
# replace all null values with previous Non zero amt row value
DF= DF.withColumn('modi_amt',last("zeroNonZero", ignorenulls= True).over(Window.orderBy("id").rowsBetween(Window.unboundedPreceding,0)))
Is there any other better approach?

Resources