How to split rows in pandas with special condition of date? - python-3.x

I have a DataFrame like:
Code Date sales
1 2/2013 10
1 3/2013 11
2 3/2013 12
2 4/2013 14
...
I want to convert it into a DataFrame with a timeline, code, and sales of each type of item:
Date Code Sales1 Code Sales2
2/2013 1 10 NA NA
3/2013 1 11 2 12
4/2013 NA NA 2 14
....
or into a simpler way:
Date Code Sales1 Date Code Sales2 .....
2/2013 1 10 3/2013 2 12
3/2013 1 11 4/2013 2 14
or even into the simplest way, splitting into many small DataFrames

IIUC using concatwith the groupby result
df.index=df.groupby('Code').cumcount()# create the key for concat
pd.concat([x for _,x in df.groupby('Code')],1)
Out[392]:
Code Date sales Code Date sales
0 1 2/2013 10 2 3/2013 12
1 1 3/2013 11 2 4/2013 14

Actually, I was stupid to split the data that way, I rethink and solve the problem with the pivot_table
pd.pivot_table(df, values = ['sales'], index = ['code'], columns = ['date'])
and the result should be like.
sum
date 2/2013 3/2013 4/2013 ....
code
1 10 11 NaN
2 NaN 12 14
...

Related

How to find again the index after pivoting dataframe?

I created a dataframe form a csv file containing data on number of deaths by year (running from 1946 to 2021) and month (within year):
dataD = pd.read_csv('MY_FILE.csv', sep=',')
First rows (out of 902...) of output are :
dataD
Year Month Deaths
0 2021 2 55500
1 2021 1 65400
2 2020 12 62800
3 2020 11 64700
4 2020 10 56900
As expected, the dataframe contains an index numbered 0,1,2, ... and so on.
Now, I pivot this dataframe in order to have only 1 row by year and months in column, using the following code:
dataDW = dataD.pivot(index='Year', columns='Month', values='Deaths')
The first rows of the result are now:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Year
1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0
1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0
1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0
1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0
1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0
My question is:
What do I have to change in the previous pivoting code in order to find again the index 0,1,2,..etc. when I output the pivoted file? I think I need to specify index=*** in order to make the pivot instruction run. But afterwards, I would like to recover an index "as usual" (if I can say), exactly like in my first file dataD.
Any possibility?
You can reset_index() after pivoting:
dataDW = dataD.pivot(index='Year', columns='Month', values='Deaths').reset_index()
This would give you the following:
Month Year 1 2 3 4 5 6 7 8 9 10 11 12
0 1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0
1 1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0
2 1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0
3 1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0
4 1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0
Note that the "Month" here might look like the index name but is actually df.columns.name. You can unset it if preferred:
df.columns.name = None
Which then gives you:
Year 1 2 3 4 5 6 7 8 9 10 11 12
0 1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0
1 1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0
2 1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0
3 1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0
4 1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0

How to replace a column in dataframe for the result of a function

currently I have a dataframe with a column named age, which has the age of the person in days. I would like to convert this value to year, how could I achieve that?
at this moment, if one runs this command
df['age']
the result would be something like
0 18393
1 20228
2 18857
3 17623
4 17474
5 21914
6 22113
7 22584
8 17668
9 19834
10 22530
11 18815
12 14791
13 19809
I would like to change the value from each row to the current value/ 365 (which would convert days to year)
As suggested:
>>> df['age'] / 365
age
0 50.391781
1 55.419178
2 51.663014
3 48.282192
4 47.873973
Or if you need a real year:
>>> df['age'] // 365
age
0 50
1 55
2 51
3 48
4 47

Iterating through a data frame and grouping values in a range

I have a python data frame of weekly data like this :
Week Val
1 11
2 11
3 11
4 11
5 9
6 9
7 9
8 9
I would like create an output table like this:
Week 1 Week 2 Val
1 4 11
5 8 9
Apologies, I am quite new to python and its iterative tools. I am not sure how to solve this problem.
I tried to match using the previous row columns but I do not think how to go further:
df['Match'] = df['Val'].eq(df['Val'].shift(-1))
You want to groupby the consecutive blocks of Val. So you can use cumsum on the non-zero differences to get the block:
blocks = df['Val'].ne(df['Val'].shift(1)).cumsum()
(df.groupby(blocks, as_index=False)
.agg(Week1=('Week','min'), Week2=('Week','max'), Val=('Val', 'first'))
)
Or you can chain:
(df.groupby(df['Val'].ne(df['Val'].shift(1)).cumsum(), as_index=False)
.agg(Week1=('Week','min'), Week2=('Week','max'),Val=('Val', 'first'))
)
Output:
Week1 Week2 Val
0 1 4 11
1 5 8 9

How can I add previous column values to to get new value in Excel?

I am working on graph and in need data in below format. I have data in COL A. I need to calculate COL B values as in below picture.
What is the formula for obtaining this in excel?
You can do with cumsum and shift:
# sample data
df = pd.DataFrame({'COL A': np.arange(11)})
df['COL B'] = df['COL A'].shift(fill_value=0).cumsum()
Output:
COL A COL B
0 0 0
1 1 0
2 2 1
3 3 3
4 4 6
5 5 10
6 6 15
7 7 21
8 8 28
9 9 36
10 10 45
Use simple MS technique.
You can use the formula (A3*A2)/2 for COL2

pandas combine a data frame with another groupby dataframe

I have two data frames with structure as given below.
>>> df1
IID NAME TEXT
0 10 One AA,AB
1 11 Two AB,AC
2 12 Three AB
3 13 Four AC
>>> df2
IID TEXT
0 10 aa
1 10 ab
2 11 abc
3 11 a,c
4 11 ab
5 12 AA
6 13 AC
7 13 ad
8 13 abc
I want them to combine such that new data frame is a copy of df1 with the TEXT field appearing in df2 for the corresponding IID is appended to the TEXT field of df1 with duplicates removed (cases insensitive duplication check).
My expected output is
>>> df1
IID NAME TEXT
0 10 One AA,AB
1 11 Two AB,AC,ABC,A,C
2 12 Three AB,AA
3 13 Four AC,AD,ABC
I tried with groupby on df2, but how can I do the joint of the groupie object to a dataframe ?
I believe you need concat with groupby.agg to create the skeleton with duplicates , then series.explode with groupby+unique for de-duplicating
out = (pd.concat((df1,df2),sort=False).groupby('IID')
.agg({'NAME':'first','TEXT':','.join}).reset_index())
out['TEXT'] = (out['TEXT'].str.upper().str.split(',').explode()
.groupby(level=0).unique().str.join(','))
print(out)
IID NAME TEXT
0 10 One AA,AB
1 11 Two AB,AC,ABC,A,C
2 12 Three AB,AA
3 13 Four AC,AD,ABC
I took the reverse steps. First combined the rows having the same values to a list then merge and then combine the two columns into a single column.
df1:
IID NAME TEXT
0 10 One AA,AB
1 11 Two AB,AC
2 12 Three AB
3 13 Four AC
df2:
IID TEXT
0 10 aa
1 10 ab
2 11 abc
3 11 a,c
4 11 ab
5 12 AA
6 13 AC
7 13 ad
8 13 abc
df3 = pd.DataFrame(df2.groupby("IID")['TEXT'].apply(list).transform(lambda x: ','.join(x).upper()).reset_index())
df3:
IID TEXT
0 10 AA,AB
1 11 ABC,A,C,AB
2 12 AA
3 13 AC,AD,ABC
df4 = pd.merge(df1,df3,on='IID')
df4:
IID NAME TEXT_x TEXT_y
0 10 One AA,AB AA,AB
1 11 Two AB,AC ABC,A,C,AB
2 12 Three AB AA
3 13 Four AC AC,AD,ABC
df4['TEXT'] = df4[['TEXT_x','TEXT_y']].apply(
lambda x: ','.join(pd.unique(','.join(x).split(','))),
axis=1
)
df4.drop(['TEXT_x','TEXT_y'],axis=1)
OR
df5 = df1.assign(TEXT = df4.apply(
lambda x: ','.join(pd.unique(','.join(x[['TEXT_x','TEXT_y']]).split(','))),
axis=1))
df4/df5:
IID NAME TEXT
0 10 One AA,AB
1 11 Two AB,AC,ABC,A,C
2 12 Three AB,AA
3 13 Four AC,AD,ABC

Resources