How to display the rows with the most number of occurrences in a column of a dataframe? - python-3.x

I have a data frame with 6 columns:
taken person quant reading personal family
0 1 lake rad 9.7 Anderson Lake
1 1 lake sal 0.21 Anderson Lake
2 5 Lim sal 0.08 Andy Lim
3 2 Lim rad 9.82 Andy Lim
4 2 Lim sal 0.13 Andy Lim
5 3 dyer rad 7.7 William Dyer
Output i want:
taken person quant reading personal family
0 5 Lim sal 0.08 Andy Lim
1 2 Lim rad 9.82 Andy Lim
2 2 Lim sal 0.13 Andy Lim
Basically, i want to display all the rows in the df based on the most number of occurrences in the personal column. This is what i've tried but it doesn't work
test = df.personal.mode()
test1 = df.loc[df.personal == test]
display(test1)

You can combine value_counts and boolean indexing:
df[df['person'] == df['person'].value_counts().index[0] ]
Output:
taken person quant reading personal family
2 5 Lim sal 0.08 Andy Lim
3 2 Lim rad 9.82 Andy Lim
4 2 Lim sal 0.13 Andy Lim
Note that this only keep one person in the case there are several persons with same number of appearances. If you want to keep all of them, mode and isin is a better choice:
df[df['person'].isin(df['person'].mode())]

Related

How to Replace Multiple String in a Data frame Using Python

I have a data frame with 73k rows, and here's the following sample data :
Index Customers' Name States
0 Alpha Oregon
1 Alpha Oregon
2 Bravo Utah
3 Bravo Utah
4 Charlie Alabama
5 Charlie Alabama
6 Alpha Oregon
7 Alpha Oregon
8 Bravo Utah
The data have a unique value but I am not allowed to delete or remove it because it's needed or mandatory for my research. On the other hand, I would like to change the customers' names with some specific pseudocode so the result can look like this :
Index Customers' Name States
0 z1 Oregon
1 z1 Oregon
2 z2 Utah
3 z2 Utah
4 z3 Alabama
5 z3 Alabama
6 z1 Oregon
7 z1 Oregon
8 z2 Utah
I'm still a beginner, learning Python for around 3 months. So, how can I change this in a 'bulky' way remembering that I have 73k rows like this? I assume that it must be executed using a looping ('For'). I already tried, but I can't wrap up this well. Please help me finish/solve this.
You can use .groupby() with .ngroup():
df["Customers' Name"] = "z" + (
df.groupby("Customers' Name").ngroup() + 1
).astype("str")
print(df)
Prints:
Customers' Name States
0 z1 Oregon
1 z1 Oregon
2 z2 Utah
3 z2 Utah
4 z3 Alabama
5 z3 Alabama
6 z1 Oregon
7 z1 Oregon
8 z2 Utah

Count unique values in a MS Excel column based on values of other column

I am trying to find the unique number of Customers, O (Orders), Q (Quotations) and D (Drafts) our team has dealt with on a particular day from this sample dataset. Please note that there are repeated "Quote/Order #"s in the dataset. I need to figure out the unique numbers of Q/O/D on a given day.
I have figured out all the values except the fields highlighted in light orange color of my Expected output table. Can someone help me figure out the MS Excel formula for these four values as requested above?
Below is the given dataset. Please note that there can be empty values against a date. But those will always be found in the bottom few rows of the table:
Date
Job #
Job Type
Quote/Ordr #
Parts
Customer
man-hr
4-Apr-22
1
O
307585
1
FRU
0.35
4-Apr-22
2
D
307267
28
ATM
4.00
4-Apr-22
2
D
307267
25
ATM
3.75
4-Apr-22
2
D
307267
6
ATM
0.17
4-Apr-22
3
D
307438
3
ELCTRC
0.45
4-Apr-22
4
D
307515
7
ATM
0.60
4-Apr-22
4
D
307515
5
ATM
0.55
4-Apr-22
4
D
307515
4
ATM
0.35
4-Apr-22
5
O
307587
4
PULSE
0.30
4-Apr-22
6
O
307588
3
PULSE
0.40
5-Apr-22
1
O
307623
1
WST
0.45
5-Apr-22
2
O
307629
4
CG
0.50
5-Apr-22
3
O
307630
10
SUPER
1.50
5-Apr-22
4
O
307631
3
SUPER
0.60
5-Apr-22
5
O
307640
7
CAM
0.40
5-Apr-22
6
Q
307527
6
WG
0.55
5-Apr-22
6
Q
307527
3
WG
0.30
5-Apr-22
To figure out the unique "Number of Jobs" on Apr 4, I used the Excel formula:
=MAXIFS($K$3:$K$20,$J$3:$J$20,R3) Where, R3 ='4-Apr-22'
To figure out the unique "Number of D (Draft) Jobs" I used the Excel formula:
=SUMIFS($P$3:$P$20,$J$3:$J$20,R3,$L$3:$L$20,"D")
[1
[2

How to compare values in a data frame in pandas [duplicate]

I am trying to calculate the biggest difference between summer gold medal counts and winter gold medal counts relative to their total gold medal count. The problem is that I need to consider only countries that have won at least 1 gold medal in both summer and winter.
Gold: Count of summer gold medals
Gold.1: Count of winter gold medals
Gold.2: Total Gold
This a sample of my data:
Gold Gold.1 Gold.2 ID diff gold %
Afghanistan 0 0 0 AFG NaN
Algeria 5 0 5 ALG 1.000000
Argentina 18 0 18 ARG 1.000000
Armenia 1 0 1 ARM 1.000000
Australasia 3 0 3 ANZ 1.000000
Australia 139 5 144 AUS 0.930556
Austria 18 59 77 AUT 0.532468
Azerbaijan 6 0 6 AZE 1.000000
Bahamas 5 0 5 BAH 1.000000
Bahrain 0 0 0 BRN NaN
Barbados 0 0 0 BAR NaN
Belarus 12 6 18 BLR 0.333333
This is the code that I have but it is giving the wrong answer:
def answer():
Gold_Y = df2[(df2['Gold'] > 1) | (df2['Gold.1'] > 1)]
df2['difference'] = (df2['Gold']-df2['Gold.1']).abs()/df2['Gold.2']
return df2['diff gold %'].idxmax()
answer()
Try this code after subbing in the correct (your) function and variable names. I'm new to Python, but I think the issue was that you had to use the same variable in Line 4 (df1['difference']), and just add the method (.idxmax()) to the end. I don't think you need the first line of code for the function, either, as you don't use the local variable (Gold_Y). FYI - I don't think we're working with the same dataset.
def answer_three():
df1['difference'] = (df1['Gold']-df1['Gold.1']).abs()/df1['Gold.2']
return df1['difference'].idxmax()
answer_three()
def answer_three():
atleast_one_gold = df[(df['Gold']>1) & (df['Gold.1']> 1)]
return ((atleast_one_gold['Gold'] - atleast_one_gold['Gold.1'])/atleast_one_gold['Gold.2']).idxmax()
answer_three()
def answer_three():
_df = df[(df['Gold'] > 0) & (df['Gold.1'] > 0)]
return ((_df['Gold'] - _df['Gold.1']) / _df['Gold.2']).argmax() answer_three()
This looks like a question from the programming assignment of courser course -
"Introduction to Data Science in Python"
Having said that if you are not cheating "maybe" the bug is here:
Gold_Y = df2[(df2['Gold'] > 1) | (df2['Gold.1'] > 1)]
You should use the & operator. The | operator means you have countries that have won Gold in either the Summer or Winter olympics.
You should not get a NaN in your diff gold.
def answer_three():
diff=df['Gold']-df['Gold.1']
relativegold = diff.abs()/df['Gold.2']
df['relativegold']=relativegold
x = df[(df['Gold.1']>0) &(df['Gold']>0) ]
return x['relativegold'].idxmax(axis=0)
answer_three()
I an pretty new to python or programming as a whole.
So my solution would be the most novice ever!
I love to create variables; so you'll see a lot in the solution.
def answer_three:
a = df.loc[df['Gold'] > 0,'Gold']
#Boolean masking that only prints the value of Gold that matches the condition as stated in the question; in this case countries who had at least one Gold medal in the summer seasons olympics.
b = df.loc[df['Gold.1'] > 0, 'Gold.1']
#Same comment as above but 'Gold.1' is Gold medals in the winter seasons
dif = abs(a-b)
#returns the abs value of the difference between a and b.
dif.dropna()
#drops all 'Nan' values in the column.
tots = a + b
#i only realised that this step wasn't essential because the data frame had already summed it up in the column 'Gold.2'
tots.dropna()
result = dif.dropna()/tots.dropna()
returns result.idxmax
# returns the index value of the max result
def answer_two():
df2=pd.Series.max(df['Gold']-df['Gold.1'])
df2=df[df['Gold']-df['Gold.1']==df2]
return df2.index[0]
answer_two()
def answer_three():
return ((df[(df['Gold']>0) & (df['Gold.1']>0 )]['Gold'] - df[(df['Gold']>0) & (df['Gold.1']>0 )]['Gold.1'])/df[(df['Gold']>0) & (df['Gold.1']>0 )]['Gold.2']).argmax()

How to have a cross tabulation for categorical data in Pandas (Python)?

I have the following code for example.
df = pd.DataFrame(dtype="category")
df["Gender"]=np.random.randint(2, size=100)
df["Q1"] = np.random.randint(3, size=100)
df["Q2"] = np.random.randint(3, size=100)
df["Q3"] = np.random.randint(3, size=100)
df[["Gender", "Q1", "Q2", "Q3"]] = df[["Gender", "Q1", "Q2", "Q3"]].astype('category')
pd.pivot_table(data=df,index=["Gender"])
I want to have a pivot table with percentages over gender for all the other columns. Infact, like the follwing.
How to achieve this?
The above code gives an error saying that
No numeric types to aggregate
I dont have any numerical columns. I just want to find the frequency in each category under male and female and find the percentage of them over male and female respectively.
As suggested by your question, you can use the pd.crosstab to make the cross tabulation you need.
You just need to do a quick preprocessing with your data, which is to melt and convert Q columns to rows (see details below):
df = df.melt(id_vars='Gender',
value_vars=['Q1', 'Q2', 'Q3'],
var_name='Question', value_name='Answer' )
Then you can use pd.crosstab and calculate percentage as needed (here the percentage for each Question per Gender per Answer is shown)
pd.crosstab(df.Question, columns=[df.Gender, df.Answer]).apply(lambda row: row/row.sum(), axis=1)
Gender 0 1
Answer 0 1 2 0 1 2
Question
Q1 0.13 0.18 0.18 0.13 0.19 0.19
Q2 0.09 0.21 0.19 0.22 0.13 0.16
Q3 0.19 0.10 0.20 0.16 0.18 0.17
Details
df.head()
Gender Q1 Q2 Q3
0 1 0 2 0
1 1 0 0 1
2 0 2 0 2
3 0 0 2 0
4 0 1 1 1
df.melt().head()
Gender Question Answer
0 1 Q1 0
1 1 Q1 0
2 0 Q1 2
3 0 Q1 0
4 0 Q1 1

how to compare strings in two data frames in pandas(python 3.x)?

I have two DFs like so:
df1:
ProjectCode ProjectName
1 project1
2 project2
3 projc3
4 prj4
5 prjct5
and df2 as
VillageName
v1
proj3
pro1
prjc3
project1
What I have to do is compare each ProjectName with VillageName and also add the percentage of matching. The percentage to be calculated as:
No. of matching characters/total characters * 100
The Village data i.e. df2 has more than 10 million records and the Project data i.e. df1 contains around 1200 records.
What I have done so far:
import pandas as pd
df1 = pd.read_excel("C:\\Users\\Desktop\\distinctVillage.xlsx")
df = pd.read_excel("C:\\Users\\Desktop\\awcProjectMaster.xlsx")
for idx, row in df.iteritems():
for idx1, row1 in df1.iteritems():
I don't know how to proceed with this. How to find substring and get third df having percentage match with each string. I think it is not feasible since each record from Project will have matching with each value of Village which will produce a huge result.
Is there any better way to find out which project names are matching with which village names and how good is the match?
Expected output:
ProjectName VillageName charactersMatching PercentageMatch
project1 v1 1 whateverPercent
project1 proj3 4 whateverPercent
The expected output can be changed depending on the feasibility and solution.
The following code assumes you don't care about repeated characters (since it's taking the set on both sides).
percentage_match = df1['ProjectName'].apply(lambda x: df2['VillageName'].apply(lambda y: len(set(y).intersection(set(x))) / len(set(x+y))))
Output:
0 1 2 3 4
ProjectCode
1 0.111111 0.444444 0.500000 0.444444 1.000000
2 0.000000 0.444444 0.333333 0.444444 0.777778
3 0.000000 0.833333 0.428571 0.833333 0.555556
4 0.000000 0.500000 0.333333 0.500000 0.333333
5 0.000000 0.375000 0.250000 0.571429 0.555556
If you want the 'best match' for each Project:
percentage_match.idxmax(axis = 1)
Output:
1 4
2 4
3 1
4 1
5 3

Resources