how to combine different categorical attribute values in dataframe - python-3.x

I am working on NYC property sales dataset(https://www.kaggle.com/new-york-city/nyc-property-sales).
There is one column "BUILDING CLASS CATEGORY", which has several different categorical string values. What I want to do is to choose only the top 4 categories which have maximum occurrences and combine the rest of the values into a single one.
For ex-
> dataset["BUILDING CLASS CATEGORY"].value_counts()
01 ONE FAMILY DWELLINGS 12686
10 COOPS - ELEVATOR APARTMENTS 11518
02 TWO FAMILY DWELLINGS 9844
13 CONDOS - ELEVATOR APARTMENTS 7965
09 COOPS - WALKUP APARTMENTS 2504
03 THREE FAMILY DWELLINGS 2318
07 RENTALS - WALKUP APARTMENTS 1743
so what I want is that all the instances of top 4 categories are replaced by some integer values like
01 ONE FAMILY DWELLINGS instances are replaced by 0
10 COOPS - ELEVATOR APARTMENTS instances are replaced by 1
02 TWO FAMILY DWELLINGS instances are replaced by 2
13 CONDOS - ELEVATOR APARTMENTS instances are replaced by 3
all the other instances are replaced by integer 4
So next time when I run the command it should output something like this:
> dataset["BUILDING CLASS CATEGORY"].value_counts()
0 12686
1 11518
2 9844
3 7965
4 6565 #sum of all the other instances
I have tried using LabelEncoder but my method is getting too long, so if there is an efficient way to do this, please do tell me.

Let's just call your series like this for short:
building_cat = dataset["BUILDING CLASS CATEGORY"]
This is what you already did:
vc = building_cat.value_counts()
Now get list of top 4:
top4 = vc[:4].index.tolist()
And map it to your df:
building_cat = building_cat.map(lambda x: top4.index(x) if x in top4 else 4)
I didn't download the dataset, if it doesn't work I'll try locally.
You can change type if needed:
building_cat = building_cat.astype("category")

Related

Rank groups without duplicates [duplicate]

I am trying to get a unique rank value (e.g. {1, 2, 3, 4} from a subgroup in my data. SUMPRODUCT will produce ties{1, 1, 3, 4}, I am trying to add the COUNTIFS to the end to adjust the duplicate rank away.
subgroup
col B col M rank
LMN 01 1
XYZ 02
XYZ 02
ABC 03
ABC 01
XYZ 01
LMN 02 3
ABC 01
LMN 03 4
LMN 03 4 'should be 5
ABC 02
XYZ 02
LMN 01 1 'should be 2
So far, I've come up with this.
=SUMPRODUCT(($B$2:$B$38705=B2)*(M2>$M$2:$M$38705))+countifs(B2:B38705=B2,M2:M38705=M2)
What have I done wrong here?
The good news is that you can throw away the SUMPRODUCT function and replace it with a pair of COUNTIFS functions. The COUNTIFS can use full column references without detriment and is vastly more efficient than the SUMPRODUCT even with the SUMPRODUCT cell ranges limited to the extents of the data.
In N2 as a standard function,
=COUNTIFS(B:B, B2,M:M, "<"&M2)+COUNTIFS(B$2:B2, B2, M$2:M2, M2)
Fill down as necessary.
      
  Filtered Results
        
Solution basing on OP
Studying your post demanding to post any alternatives, I got interested in a solution based on your original approach via the SUMPRODUCT function.
IMO this could show the right way for the sake of the art:
Applied method
Get
a) all current ids with a group value lower or equal to the current value
MINUS
b) the number of current ids with the identical group value starting count from the current row
PLUS
c) the increment of 1
Formula example, e.g. in cell N5:
=SUMPRODUCT(($B$2:$B$38705=$B5)*($M$2:$M$38705<=$M5))-COUNTIFS($B5:$B$38705,$B5,$M5:$M$38705,$M5)+1
P.S.
Of course, I agree with you preferring the above posted solution, too :+)

compare values of two dataframes based on certain filter conditions and then get count

I am new to spark. I am writing a pyspark code where I have two dataframes such that :
DATAFRAME-1:
NAME BATCH MARKS
A 1 44
B 15 50
C 45 99
D 2 18
DATAFRAME-2:
NAME MARKS
A 36
B 100
C 23
D 67
I want my output as a comparison between these two dataframes such that I can store the counts as my variables.
for instance,
improvedStudents = 1 (since D belongs to batch 1-15 and has improved his score)
badPerformance = 2 (A,B have performed bad since they belong between batch 1-15 and their marks are lesser than before)
neutralPerformance = 1 (C because even if his marks went down, he belongs to batch 45 that we dont want to consider)
This is just an example out of a complex problem I'm trying to solve.
Thanks
If the data is as in your example why don't you just join them and create new columns for every metric that you have:
val df = df1.withColumnRenamed("MARKS", "PRE_MARKS")
.join(df2.withColumnRenamed("MARKS", "POST_MARKS"), Seq("NAME"))
.withColumn("Evaluation",
when(col("BATCH") > 15, lit("neutral"))
.when(col("PRE_MARKS") gt col("POST_MARKS"), lit("bad"))
.when(col("POST_MARKS") gt col("PRE_MARKS"), lit("improved"))
.otherwise(lit("neutral"))
.groupBy("Evaluation")
.count

manipulating a pandas dataframe column containing a list

I have used the following code with the unique() function in pandas to create a column which then contains a list of unique values:
import pandas as pd
from collections import OrderedDict
dct = OrderedDict([
('referencenum',['10','10','20','20','20','30','30','40']),
('Month',['Jan','Jan','Jan','Feb','Feb','Feb','Feb','Mar']),
('Category',['good','bad','bad','bad','bad','good','bad','bad'])
])
df = pd.DataFrame.from_dict(dct)
This gives the following sample dataset:
referencenum Month Category
0 10 Jan good
1 10 Jan bad
2 20 Jan bad
3 20 Feb bad
4 20 Feb bad
5 30 Feb good
6 30 Feb bad
7 40 Mar bad
Then I summarise as follows:
dfsummary = pd.DataFrame(df.groupby(['referencenum', 'Month'])['Category'].unique())
dfsummary.reset_index()
To give the summary dataframe with "Category" column containing a list
referencenum Month Category
0 10 Jan [good, bad]
1 20 Feb [bad]
2 20 Jan [bad]
3 30 Feb [good, bad]
4 40 Mar [bad]
My question is how do I obtain another column containing the len() or number of items in the Category "list" column?
Also - how do extract the first/ second item in the list to another column?
Can I do these manipulations within pandas or do I somehow need to drop out to list manipulations and then come back to pandas?
Many thanks!
You should check out the accessors.
Basically, they're ways to handle the values contained in a Series that are specific to their type (datetime, string, etc.).
In this case, you would use df['Category'].str.len().
If you wanted the first element, you would use df['Category'].str[0].
To generalise: you can treat the elements of a Series as a collection of objects by referring to its .str property.
If you want to get the number of elements of each entry in Category column, you should use len() method with apply():
dfsummary['Category_len'] = dfsummary['Category'].apply(len)

How to extract 5 number and nanmes from the list [duplicate]

I am trying to get a unique rank value (e.g. {1, 2, 3, 4} from a subgroup in my data. SUMPRODUCT will produce ties{1, 1, 3, 4}, I am trying to add the COUNTIFS to the end to adjust the duplicate rank away.
subgroup
col B col M rank
LMN 01 1
XYZ 02
XYZ 02
ABC 03
ABC 01
XYZ 01
LMN 02 3
ABC 01
LMN 03 4
LMN 03 4 'should be 5
ABC 02
XYZ 02
LMN 01 1 'should be 2
So far, I've come up with this.
=SUMPRODUCT(($B$2:$B$38705=B2)*(M2>$M$2:$M$38705))+countifs(B2:B38705=B2,M2:M38705=M2)
What have I done wrong here?
The good news is that you can throw away the SUMPRODUCT function and replace it with a pair of COUNTIFS functions. The COUNTIFS can use full column references without detriment and is vastly more efficient than the SUMPRODUCT even with the SUMPRODUCT cell ranges limited to the extents of the data.
In N2 as a standard function,
=COUNTIFS(B:B, B2,M:M, "<"&M2)+COUNTIFS(B$2:B2, B2, M$2:M2, M2)
Fill down as necessary.
      
  Filtered Results
        
Solution basing on OP
Studying your post demanding to post any alternatives, I got interested in a solution based on your original approach via the SUMPRODUCT function.
IMO this could show the right way for the sake of the art:
Applied method
Get
a) all current ids with a group value lower or equal to the current value
MINUS
b) the number of current ids with the identical group value starting count from the current row
PLUS
c) the increment of 1
Formula example, e.g. in cell N5:
=SUMPRODUCT(($B$2:$B$38705=$B5)*($M$2:$M$38705<=$M5))-COUNTIFS($B5:$B$38705,$B5,$M5:$M$38705,$M5)+1
P.S.
Of course, I agree with you preferring the above posted solution, too :+)

How to create Power View report tiled by multiple KPIs?

I am trying to create a Power View report tiled by KPIs, is this possible?
An example of my raw data:
Company ID Employee ID Measure numerator denominator
1 01 1 2 5
2 04 1 3 6
3 02 1 0 5
4 03 1 1 2
1 01 2 4 4
2 04 2 2 3
2 06 2 0 6
4 01 2 1 4
I have created a calculated column in Power Pivot using the following DAX function:
RATE:=[numerator]/[denominator]
From this, I want to create KPIs for each measure (each measure has different targets), and use these KPIs as tiles in a Power View report filtered by Company ID and/or Employee ID.
Can this be done?
You cannot tile by KPI. You KPI has to be a measure and you cannot tile by a measure. You can tile by a calculated column or by any original field in your dataset. So you can drop RATE into a TILE BY box.
To use the Measure column or a KPI type icon as the Tile field the best approach might be to use an image for your measure column. See more on including images in Power View here: https://support.office.com/en-ca/article/Images-in-Power-View-84e91b90-b1da-4913-83d6-beada5591fce?ui=en-US&rs=en-CA&ad=CA

Resources