How can i remove repeated values in a single column? - python-3.x

i have a dataframe like:
shops prod_id atv_y1
company_b A 56.3
company_b B 4.3
company_b C 136.3
company_b D 89.3
company_c A 7.3
company_c B 64.0
company_c A 34.7
For the purpose of plotting i would like to remove the repeated company_b/company_c values so that it takes only the first time it is referenced like below:
shops prod_id atv_y1
company_b A 56.3
B 4.3
C 136.3
D 89.3
company_c A 7.3
B 64.0
A 34.7
how can i do this in pandas ?

You might be able to manage this within plots itself by the way. But if you really want the df transformed like you asked, then you could try something like below.
It may not be the best way, but does the job.
shops = df.groupby('shops').first().reset_index()['shops']
for i in shops:
l = np.where(df['shops'] == i)[0]
df.loc[l[1]:l[len(l)-1],'shops'] = ''
print(df)
prints
shops prod_id atv_y1
0 company_b A 56.3
1 B 4.3
2 C 136.3
3 D 89.3
4 company_c A 7.3
5 B 64.0
6 A 34.7

Related

how to get quartiles and classify a value according to this quartile range

I have this df:
d = pd.DataFrame({'Name':['Andres','Lars','Paul','Mike'],
'target':['A','A','B','C'],
'number':[10,12.3,11,6]})
And I want classify each number in a quartile. I am doing this:
(d.groupby(['Name','target','number'])['number']
.quantile([0.25,0.5,0.75,1]).unstack()
.reset_index()
.rename(columns={0.25:"1Q",0.5:"2Q",0.75:"3Q",1:"4Q"})
)
But as you can see, the 4 quartiles are all equal because the code above is calculating per row so if there's one 1 number per row all quartiles are equal.
If a run instead:
d['number'].quantile([0.25,0.5,0.75,1])
Then I have the 4 quartiles I am looking for:
0.25 9.000
0.50 10.500
0.75 11.325
1.00 12.300
What I need as output(showing only first 2 rows)
Name target number 1Q 2Q 3Q 4Q Rank
0 Andres A 10.0 9.0 10.5 11.325 12.30 1
1 Lars A 12.3 9.0 10.5 11.325 12.30 4
you can see all quartiles has the the values considering tall values in the number column. Besides that, now we have a column names Rank that classify the number according to it's quartile. ex. In the first row 10 is within the 1st quartile.
Here's one way that build on the quantiles you've created by making it a DataFrame and joining it to d. Also assigns "Rank" column using rank method:
out = (d.join(d['number'].quantile([0.25,0.5,0.75,1])
.set_axis([f'{i}Q' for i in range(1,5)], axis=0)
.to_frame().T
.pipe(lambda x: x.loc[x.index.repeat(len(d))])
.reset_index(drop=True))
.assign(Rank=d['number'].rank(method='dense')))
Output:
Name target number 1Q 2Q 3Q 4Q Rank
0 Andres A 10.0 9.0 10.5 11.325 12.3 2.0
1 Lars A 12.3 9.0 10.5 11.325 12.3 4.0
2 Paul B 11.0 9.0 10.5 11.325 12.3 3.0
3 Mike C 6.0 9.0 10.5 11.325 12.3 1.0

Removing outliers based on column variables or multi-index in a dataframe

This is another IQR outlier question. I have a dataframe that looks something like this:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
df
I would like to find and remove the outliers for each condition (i.e. Spring Placebo, Spring Drug, etc). Not the whole row, just the cell. And would like to do it for each of the 'red', 'yellow', 'green' columns.
Is there way to do this without breaking the dataframe into a whole bunch of sub dataframes with all of the conditions broken out separately? I'm not sure if this would be easier if 'Season' and 'Treatment' were handled as columns or indices. I'm fine with either way.
I've tried a few things with .iloc and .loc but I can't seem to make it work.
If need replace outliers by missing values use GroupBy.transform with DataFrame.quantile, then compare for lower and greater values by DataFrame.lt and DataFrame.gt, chain masks by | for bitwise OR and set missing values in DataFrame.mask, default replacement, so not specified:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
g = df.groupby(['Season','Treatment'])
df1 = g.transform('quantile', 0.05)
df2 = g.transform('quantile', 0.95)
c = df.columns.difference(['Season','Treatment'])
mask = df[c].lt(df1) | df[c].gt(df2)
df[c] = df[c].mask(mask)
print (df)
Season Treatment red yellow green
0 Spring Placebo NaN NaN 67.0
1 Spring Placebo 67.0 91.0 3.0
2 Spring Placebo 71.0 56.0 29.0
3 Spring Placebo 48.0 32.0 24.0
4 Spring Placebo 74.0 9.0 51.0
.. ... ... ... ... ...
95 Fall Drug 90.0 35.0 55.0
96 Fall Drug 40.0 55.0 90.0
97 Fall Drug NaN 54.0 NaN
98 Fall Drug 28.0 50.0 74.0
99 Fall Drug NaN 73.0 11.0
[100 rows x 5 columns]

Rearrange columns in DataFrame

Having a DataFrame structured as follows:
country A B C D
0 Albany 5.2 4.7 253.75 4
1 China 7.5 3.4 280.72 3
2 Portugal 4.6 7.5 320.00 6
3 France 8.4 3.6 144.00 3
4 Greece 2.1 10.0 331.00 6
I wanted to get something like this:
cost A B
country C D C D
Albany 2.05 4 1.85 4
China 2.67 3 1.21 3
Portugal 1.44 6 2.34 6
France 5.83 3 2.50 3
Greece 0.63 6 3.02 6
I mean, get the columns A and B as headers over C and D, keeping D the same with its constant value, and calculating in C the percentage resulting of the header over C. Example for Albany:
value C in A: (5.2/253.75)*100 = 2.05
value C in B: (4.7/253.75)*100 = 1.85
Is there any way to do it?
Thanks!
You can divide multiple columns, here A and B by DataFrame.div, then DataFrame.reindex by MultiIndex created by MultiIndex.from_product and last set D columns by original with MultiIndex slicers:
cols = ['A','B']
mux = pd.MultiIndex.from_product([cols, ['C', 'D']])
df1 = df[cols].div(df['C'], axis=0).mul(100).reindex(mux, axis=1, level=0)
idx = pd.IndexSlice
df1.loc[:, idx[:, 'D']] = df[['D'] * len(cols)].to_numpy()
#pandas bellow 0.24
#df1.loc[:, idx[:, 'D']] = df[['D'] * len(cols)].values
print (df1)
A B
C D C D
0 2.049261 4 1.852217 4
1 2.671701 3 1.211171 3
2 1.437500 6 2.343750 6
3 5.833333 3 2.500000 3
4 0.634441 6 3.021148 6

Pandas print missing value column names and count only

I am using the following code to print the missing value count and the column names.
#Looking for missing data and then handling it accordingly
def find_missing(data):
# number of missing values
count_missing = data_final.isnull().sum().values
# total records
total = data_final.shape[0]
# percentage of missing
ratio_missing = count_missing/total
# return a dataframe to show: feature name, # of missing and % of missing
return pd.DataFrame(data={'missing_count':count_missing, 'missing_ratio':ratio_missing},
index=data.columns.values)
find_missing(data_final).head(5)
What I want to do is to only print those columns where there is a missing value as I have a huge data set of about 150 columns.
The data set looks like this
A B C D
123 ABC X Y
123 ABC X Y
NaN ABC NaN NaN
123 ABC NaN NaN
245 ABC NaN NaN
345 ABC NaN NaN
In the output I would just want to see :
missing_count missing_ratio
C 4 0.66
D 4 0.66
and not the columns A and B as there are no missing values there
Use DataFrame.isna with DataFrame.sum
to count by columns. We can also use DataFrame.isnull instead DataFrame.isna.
new_df = (df.isna()
.sum()
.to_frame('missing_count')
.assign(missing_ratio = lambda x: x['missing_count']/len(df))
.loc[df.isna().any()] )
print(new_df)
We can also use pd.concat instead DataFrame.assign
count = df.isna().sum()
new_df = (pd.concat([count.rename('missing_count'),
count.div(len(df))
.rename('missing_ratio')],axis = 1)
.loc[count.ne(0)])
Output
missing_count missing_ratio
A 1 0.166667
C 4 0.666667
D 4 0.666667
IIUC, we can assign the missing and total count to two variables do some basic math and assign back to a df.
a = df.isnull().sum(axis=0)
b = np.round(df.isnull().sum(axis=0) / df.fillna(0).count(axis=0),2)
missing_df = pd.DataFrame({'missing_vals' : a,
'missing_ratio' : b})
print(missing_df)
missing_vals ratio
A 1 0.17
B 0 0.00
C 4 0.67
D 4 0.67
you can filter out columns that don't have any missing vals
missing_df = missing_df[missing_df.missing_vals.ne(0)]
print(missing_df)
missing_vals ratio
A 1 0.17
C 4 0.67
D 4 0.67
You can also use concat:
s = df.isnull().sum()
result = pd.concat([s,s/len(df)],1)
result.columns = ["missing_count","missing_ratio"]
print (result)
missing_count missing_ratio
A 1 0.166667
B 0 0.000000
C 4 0.666667
D 4 0.666667

Split rows with same name into multiple columns

My excel sheet looks like this:
Name C.p Value
a 1 1.75
b 1 2.35
c 1 1.32
d 1 2.45
a 2 2.7
b 2 1.85
c 2 1.9
d 2 2.6
a 3 3.2
b 3 4.5
c 3 9.2
d 3 5.01
Like this 4~5 names 50 ~ 60 check points and values at those check points
I want the excel to look like
C.p a b c d
1 1.75 2.35 1.32 2.45
2 2.7 1.85 1.9 2.6
3 3.2 4.5 9.2 5.01
Here C.p is check point. it is not always 1 2 3 .. it changes values form sheet to sheet
Could Some one help with the code
thank you
If that is the only thing you want to do,You can do it quickly by pivot table in excel itself. You will get some extra columns like Grand Total Which you can remove. As far as effort for removing the unwanted columns to the code it will be quite less.
see the below pic.

Resources