How to calculate mean by skipping String Value in Numeric Column? - python-3.x

Name Emp_ID Salary Age
0 John Abc 21000 31
1 mark Abn 34000 82
2 samy bbc thirty 78
3 Johny Ajc 21000 34
4 John Ajk 2100.28 twentyone
How to calculate mean of 'Age' Column without changing string value in that column. Basically i want to loop through age column for numerical value and gives mean of that list of value. If any string comes it should skip that value?

Use pd.to_numeric with the argument errors='coerce', which turns values to NaN if it can't convert it to numeric. Then use Series.mean:
pd.to_numeric(df['Age'],errors='coerce').mean()
#Out
56.25

Related

How to drop records containing cell values equals to the header in pandas

I have read in this dataframe (called df):
As you can see there is a record that contains the same values as the header (ltv and age).
How do I drop that record in pandas?
Data:
df = pd.DataFrame({'ltv':[34.56, 50, 'ltv', 12.3], 'age':[45,56,'age',45]})
Check with
out = df[~df.eq(df.columns).any(1)]
Out[203]:
ltv age
0 34.56 45
1 50 56
3 12.3 45
One way is to just filter it out (assuming the strings match the column name they are in):
out = df[df['ltv']!='ltv']
Another could be to use to_numeric + dropna:
out = df.apply(pd.to_numeric, errors='coerce').dropna()
Output:
ltv age
0 34.56 45
1 50 56
3 12.3 45

In Excel how can a formula verify whether the column location or column element has taken the correct data from its header name?

The Input data
in sheet1
and
the output calculated in sheet2
Now the sheet1 data can be changed by the user for input, so now columns 'Units1' & 'Units2' may not be placed at the same address that are in columns 'C' and 'D' respectively, so suppose a new user will input the data in which 'Avocado' and 'Banana' are in columns C & D , then the 'Output' calculation in Sheet2 will be incorrect because we always want to use Units1 & Units2 for calculation.
How to fix this, so that every time the data is input the formula checks whether the correct columns have been taken for calculation or not?
Is there a way to use INDEX or family of LOOKUP functions or any other function for this.
Maybe by a creating a new sheet and making a table of Indexes which refer to (or point to) the column names of Data sheet
Location
Dates
Units1
Units2
Avocado
Banana
New York
05-01-18
10
12
1
2
Los Angeles
02-02-18
20
23
1
2
Chicago
08-03-18
30
34
1
2
Houston
05-04-18
40
45
1
2
Phoenix
02-05-18
50
56
1
2
Philadelphia
08-06-18
60
67
1
2
San Antonio
05-07-18
70
78
1
2
San Diego
02-08-18
80
89
1
2
Dallas
08-09-18
90
99
1
2
San Jose
05-10-18
100
112
1
2
Use INDEX/MATCH:
=INDEX(2:2,1,MATCH("Units2",$1:$1,0))/INDEX(2:2,1,MATCH("Units1",$1:$1,0))

Pandas condition-based row elimination in DataFrame

I have a DataFrame in with information stored in a column until an unknown row number. After this row number, the column only stores NaN values. However, throughout the column some random NaN values appear as well. I want a cumulation to check how many NaN values are repeated to determine the the last row storing information.
My code is as follows:
first, I create a NaN checker that accumulates the number of NaN values row after row
next, I checks whether the NaN checker exceeds a certain threshold (3 in this case)
last, if the threshold is exceeded, the subsequent rows are eliminated
Check_NaN =
Fruits['bananas'].isnull().astype(int).groupby(Fruits['bananas']
.notnull().astype(int).cumsum()).sum()
for row in Fruits:
for cell in row['bananas']:
if cell(Check_NaN) < 3:
sum_Fruits.update(Fruits)
else:
row.dropna(subset=['bananas'])
Below is a data sample for Fruits['bananas']. These are rows 110-130 from which the end of Excel-information in the DataFrame is indicated by the beginning of NaN values.
110 banana red
111 banana green
112 banana white
113 banana yellow
114 banana black
115 banana orange
116 banana purple
117 banana pink
118 banana blue
119 banana silver
120 banana grey
121 banana gold
122 banana white
123 banana orange
124 --
125 NaN
126 NaN
127 NaN
128 NaN
129 NaN
However, I do run into a problem that is in for cell in row['bananas']: which gives TypeError: string indices must be integers.
To me this is confusing as I can not iterate over the rows that I want to eliminate the rows. I need reusable code as the beginning of NaN values is different for each Excel sheet. How can I write my script such that the threshold of 3 NaN values is understood and eliminates the rest of the rows?
To achieve this you could look at the shift function in Pandas, then shift twice and check if all three values are NaN
Try this:
# Find the rows where itself and the two subsequent rows are null in the bananas column
All_three_null = Fruits[‘banana’].isna() & Fruits[‘banana’].shift(-1).isna() & Fruits[‘banana’].shift(-2).isna()
# Find the index of the first row where this happens
First_instance = Fruits[All_three_null].index.min()
# Filter the data to remove all the null rows
Good_data = Fruits[Fruits.index <= First_instance]
Another option which will be better if you want to move from 3 NaNs in a row to 30!
The basic idea is to group all the subsequent NaN occurances into a uniquely identifiable group, then find the first group that exceeds the set limit and use this group to filter the original DataFrame
NaN_in_a_Row = 3
Fruits['Row_Not_NaN'] = Fruits['banana'].notna()
Fruits['First_Nan_After_Not_Nan'] = Fruits['banana'].isna() & Fruits['banana'].shift(1).notna()
Fruits['Group_ID'] = (Fruits['Row_Not_Nan']+Fruits['First_Nan_After_Not_Nan']).cumsum()
Fruits['Number_of_Rows'] = 1
Filter = Fruits.groupby(['Group_ID'])['Number_of_Rows'].sum()
Filter = Filter[Filter["Number_of_Rows"]>=NaN_in_a_Row].Group_ID.min()
Fruits = Fruits[Fruits.Group_ID < Filter]

AGGREGAT with critiera and duplicates in array

I have the following Excel spreadsheet:
A B C D E
1 ProdID Price Unique ProdID 1. Biggest 2. Biggest
2 2606639 40 2606639 50 50
3 2606639 50 4633523 45 35
4 2606639 20 3911436 25 25
5 2606639 50
6 4633523 45
7 4633523 20
8 4633523 35
9 3911436 20
10 3911436 25
11 3911436 25
12 3911436 15
In Cells D2:E4 I want to show the 1. biggest and 2. biggest price of each ProdID in Column A. Therefore, I use the following formula:
D2 =AGGREGAT(14,6,$B$2:$B$12/($A$2:$A$12=$C2),1)
E2 =AGGREGAT(14,6,$B$2:$B$12/($A$2:$A$12=$C2),2)
This formula works as long as the prices are unique in Column B as you can see on the second ProdID (4633523).
However, once the price is not unique in Column B (for example 50 for ProdID 26026639 and 25 for ProdID 3911436) the functions in Cells D2:E4 does not show the right results.
Do you have an idea if you can solve this issue with the AGGREGAT-Formula and wihtout using an ARRAY-Formula?
you could check number of occurences of the first ProdID-price combinations and use that in the last argument of the AGGREGAT function. So instead of
=AGGREGAT(14,6,$B$2:$B$12/($A$2:$A$12=$C2),2)
you would have
=AGGREGAT(14,6,$B$2:$B$12/($A$2:$A$12=$C2),2+COUNTIFS(A:A,C2,B:B,D2)-1)
of course you can just put "1+COUNTIFS..." but I put it this way so it can be better understood that it uses position 2 + number of occurences of the combination of ProdID with biggest number after the first occurence.

Returning next match to a equal value in a column

I often need to search through columns to find the match to values and then return the according value.
My issue is that INDEXand MATCHalways return the first value in the column.
EX. I got 7 car dealers and this is the sales last month. Oslo and Berlin sold the same ammount and INDEX(D:E,MATCH(B1,E:E,0),1)) in column C will return the first hit from column D.
A B C D E
rank Sales Delaer
1 409 London | Tokyo 272
2 272 Tokyo | London 409
3 257 Hawaii | oslo 248
4 255 Stockholm | numbai 240
5 248 Oslo | Berlin 248
6 248 Oslo | hawaii 257
7 240 Numbai | Stockholm 255
At the moment my best solution is to first find the row each value in B got in E with MATCH(B1,E:E,0) and add that to a new column (column F). Then I can add another formula in the next column, which is what I currently have to do:
=IF(F2=F1;MATCH(F2;INDIRECT("F"&(1+F1)):$F$7;0))+F2
Is there a better approach at this?
In B2 use the following standard formula,
=IFERROR(LARGE(E$2:E$8, ROW(1:1)), "")
Fill down as necessary.
In C2 use the following standard formula,
=INDEX(D$2:D$8, AGGREGATE(15, 6, ROW($1:$7)/(E$2:E$8=B2), COUNTIF(B$2:B2, B2)))
Fill down as necessary.
        
[Optional] - Repair the ranking in column A.
In A2 use the following formula,
=SUMPRODUCT((B$2:B$8>=B2)/(COUNTIFS(B$2:B$8, B$2:B$8&"")))
Fill down as necessary.
        

Resources