reproduce/break rows based on field value - python-3.x

I have my dataframe as:
id date value
1 2016 3
2 2016 1
1 2018 1
1 2016 1.1
Now I want to reproduce rows for some weird reason with logic as:
if value > 1
reproduce row value times - 1
with value = 1
or fraction left for last unit
for better understanding consider only 1st row of dataframe i.e. :
id date value
1 2016 3
which I have broken down into 3 rows as:
id date value
1 2016 1
1 2016 1
1 2016 1
but consider last row i.e.:
id date value
1 2016 1.1
Which is broken as:
id date value
1 2016 1
1 2016 0.1
i.e. if any fraction is there, this is broken separately, else in one unit
and then group by id and sort by date is obviously easy.
i.e. new dataframe will look like:
id date value
1 2016 1
1 2016 1
1 2016 1
1 2016 1
1 2016 0.1
1 2018 1
2 2016 1
The main problem is reproducing rows.
UPDATED
A sample dataframe code:
df = pd.DataFrame([[1,2018,5.1],[2,2018,2],[1,2016,1]], columns=["id", "date", "value"])

generator
def f(df):
for i, *t, v in df.itertuples():
while v > 0:
yield t + [min(v, 1)]
v -= 1
pd.DataFrame([*f(df)], columns=df.columns)
id date value
0 1 2018 1.0
1 1 2018 1.0
2 1 2018 1.0
3 1 2018 1.0
4 1 2018 1.0
5 1 2018 0.1
6 2 2018 1.0
7 2 2018 1.0
8 1 2016 1.0

Using // and % with pandas repeat
s1=df.value//1
s2=df.value%1
s=pd.concat([s1.loc[s1.index.repeat(s1.astype(int))],s2[s2!=0]]).sort_index()
s.loc[s>=1]=1
newdf=df.reindex(df.index.repeat((s1+(s2).ne(0)).astype(int)))
newdf['value']=s.values
newdf
Out[236]:
id date value
0 1 2016 1.0
0 1 2016 1.0
0 1 2016 1.0
1 2 2016 1.0
2 1 2018 1.0
3 1 2016 1.0
3 1 2016 0.1

Related

Groupby one column and forward replace values in multiple columns based on condition using Pandas

Given a dataframe as follows:
id city district year price
0 1 bj cy 2018 8
1 1 bj cy 2019 6
2 1 xd dt 2020 7
3 2 sh hp 2018 4
4 2 sh hp 2019 3
5 2 sh pd 2020 5
Say there are typo errors in columns city and district for rows in the year columns which is 2020, so I want groupby id and ffill those columns with previous values.
How could I do that in Pandas? Thanks a lot.
The desired output will like this:
id city district year price
0 1 bj cy 2018 8
1 1 bj cy 2019 6
2 1 bj cy 2020 7
3 2 sh hp 2018 4
4 2 sh hp 2019 3
5 2 sh hp 2020 5
The following code works, but I'm not sure if it's the best solutions.
If you have others, welcome to share. Thanks.
df.loc[df['year'].isin(['2020']), ['city', 'district']] = np.nan
df[['city', 'district']] = df[['city', 'district']].fillna(df.groupby('id')[['city', 'district']].ffill())
Out:
id city district year price
0 1 bj cy 2018 8
1 1 bj cy 2019 6
2 1 bj cy 2020 7
3 2 sh hp 2018 4
4 2 sh hp 2019 3
5 2 sh hp 2020 5

Read excel and reformat the multi-index headers in Pandas

Given a excel file with format as follows:
Reading with pd.read_clipboard, I get:
year 2018 Unnamed: 2 2019 Unnamed: 4
0 city quantity price quantity price
1 bj 10 2 4 7
2 sh 6 8 3 4
Just wondering if it's possible to convert to the following format with Pandas:
year city quantity price
0 2018 bj 10 2
1 2019 bj 4 7
2 2018 sh 6 8
3 2019 sh 3 4
I think here is best convert excel file to DataFrame with MultiIndex in columns and first column as index:
df = pd.read_excel(file, header=[0,1], index_col=[0])
print (df)
year 2018 2019
city quantity price quantity price
bj 10 2 4 7
sh 6 8 3 4
print (df.columns)
MultiIndex([('2018', 'quantity'),
('2018', 'price'),
('2019', 'quantity'),
('2019', 'price')],
names=['year', 'city'])
Then reshape by DataFrame.stack, change order of levels by DataFrame.swaplevel, set index and columns names by DataFrame.rename_axis and last convert index to columns, and if encessary convert year to integers:
df1 = (df.stack(0)
.swaplevel(0,1)
.rename_axis(index=['year','city'], columns=None)
.reset_index()
.assign(year=lambda x: x['year'].astype(int)))
print (df1)
year city price quantity
0 2018 bj 2 10
1 2019 bj 7 4
2 2018 sh 8 6
3 2019 sh 4 3

binning with months column

i have data frame which contains fields casenumber , count and credated date .here created date is months which are in numerical i want to make dataframe as arrenge the ranges to the count acoording to createddate column
Here i used below code but i didnot match my requirement.i have data frame which contains fields casenumber , count and credated date .here created date is months which are in numerical i want to make dataframe as arrenge the ranges to the count acoording to createddate column
i have data frame as below
casenumber count CREATEDDATE
3820516 1 jan
3820547 1 jan
3820554 2 feb
3820562 1 feb
3820584 1 march
4226616 1 april
4226618 2 may
4226621 2 may
4226655 1 june
4226663 1 june
Here i used below code but i didnot match my requirement.i have data frame which contains fields casenumber , count and credated date .here created date is months which are in numerical i want to make dataframe as arrenge the ranges to the count acoording to createddate column
import pandas as pd
import numpy as np
df = pd.read_excel(r"")
bins = [0, 1 ,4,8,15, np.inf]
names = ['0-1','1-4','4-8','8-15','15+']
df1 = df.groupby(pd.cut(df['CREATEDDATE'],bins,labels=names))['casenumber'].size().reset_index(name='No_of_times_statuschanged')
CREATEDDATE No_of_times_statuschanged
0 0-1 2092
1 1-4 9062
2 4-8 12578
3 8-15 3858
4 15+ 0
I got the above data as out put but my expected should be range for month on month based on the cases per month .
expected output should be like
CREATEDDATE jan feb march april may june
0-1 1 2 3 4 5 6
1-4 3 0 6 7 8 9
4-8 4 6 3 0 9 2
8-15 0 3 4 5 8 9
I got the above data as out put but my expected should be range for month on month based on the cases per month .
expected output should be like
Use crosstab with change CREATEDDATE to count for pd.cut and change order of column by subset by list of columns names:
#add another months if necessary
months = ["jan", "feb", "march", "april", "may", "june"]
bins = [0, 1 ,4,8,15, np.inf]
names = ['0-1','1-4','4-8','8-15','15+']
df1 = pd.crosstab(pd.cut(df['count'],bins,labels=names), df['CREATEDDATE'])[months]
print (df1)
CREATEDDATE jan feb march april may june
count
0-1 2 1 1 1 0 2
1-4 0 1 0 0 2 0
Another idea is use ordered categoricals:
df1 = pd.crosstab(pd.cut(df['count'],bins,labels=names),
pd.Categorical(df['CREATEDDATE'], ordered=True, categories=months))
print (df1)
col_0 jan feb march april may june
count
0-1 2 1 1 1 0 2
1-4 0 1 0 0 2 0

how to use group by in filter condition in pandas

I have below data stored in a dataframe and I want to remove the rows that has id equal to finalid and for the same id, i have multiple rows.
example:
df_target
id finalid month year count_ph count_sh
1 1 1 2012 12 20
1 2 1 2012 6 18
1 32 1 2012 6 2
2 2 1 2012 2 6
2 23 1 2012 2 6
3 3 1 2012 2 2
output
id finalid month year count_ph count_sh
1 2 1 2012 6 18
1 32 1 2012 6 2
2 23 1 2012 2 6
3 3 1 2012 2 2
functionality is something like:
remove records and get the final dataframe
(df_target.groupby(['id','month','year']).size() > 1) & (df_target['id'] == df_target['finalid'])
I think need transform for same Series as origonal DataFrame and ~ for invert final boolean mask:
df = df_target[~((df_target.groupby(['id','month','year'])['id'].transform('size') > 1) &
(df_target['id'] == df_target['finalid']))]
Alternative solution:
df = df_target[((df_target.groupby(['id','month','year'])['id'].transform('size') <= 1) |
(df_target['id'] != df_target['finalid']))]
print (df)
id finalid month year count_ph count_sh
1 1 2 1 2012 6 18
2 1 32 1 2012 6 2
4 2 23 1 2012 2 6
5 3 3 1 2012 2 2

python/pandas - converting column headers into index

The data I have to deal with treats hourly data as the columns. I want to convert this as an index. Sample looks like this:
year month day 1 2 3 4 5 ... 24
2015 1 1 a b ................... c
2015 1 2 d e ................... f
2015 1 3 g h ................... i
I want to make the output file something like this:
year month day hour value
2015 1 1 1 a
2015 1 1 2 b
. . . . .
2015 1 1 24 c
2015 1 2 1 d
. . . . .
Currently using python 3.4 with the pandas module
Use set_index with stack:
print (df.set_index(['year','month','day'])
.stack()
.reset_index(name='value')
.rename(columns={'level_3':'hour'}))
year month day hour value
0 2015 1 1 1 a
1 2015 1 1 2 b
2 2015 1 1 24 c
3 2015 1 2 1 d
4 2015 1 2 2 e
5 2015 1 2 24 f
6 2015 1 3 1 g
7 2015 1 3 2 h
8 2015 1 3 24 i
Another solution with melt and sort_values:
print (pd.melt(df, id_vars=['year','month','day'], var_name='hour')
.sort_values(['year', 'month', 'day','hour']))
year month day hour value
0 2015 1 1 1 a
3 2015 1 1 2 b
6 2015 1 1 24 c
1 2015 1 2 1 d
4 2015 1 2 2 e
7 2015 1 2 24 f
2 2015 1 3 1 g
5 2015 1 3 2 h
8 2015 1 3 24 i

Resources