I have a data set like this:-
S.No.,Year of birth,year of death
1, 1, 5
2, 3, 6
3, 2, -
4, 5, 7
I need to calculate population on till that years let say:-
year,population
1 1
2 2
3 3
4 3
5 4
6 3
7 2
8 1
How can i solve it in pandas?
Since i am not good in pandas.
Any help would be appreciate.
First is necessary choose maximum year of year of death if not exist, in solution is used 8.
Then convert values of year of death to numeric and replace missing values by this year. In first solution is used difference between birth and death column with Index.repeat with GroupBy.cumcount, for count is used Series.value_counts:
#if need working with years
#today_year = pd.to_datetime('now').year
today_year = 8
df['year of death'] = pd.to_numeric(df['year of death'], errors='coerce').fillna(today_year)
df = df.loc[df.index.repeat(df['year of death'].add(1).sub(df['Year of birth']).astype(int))]
df['Year of birth'] += df.groupby(level=0).cumcount()
df1 = (df['Year of birth'].value_counts()
.sort_index()
.rename_axis('year')
.reset_index(name='population'))
print (df1)
year population
0 1 1
1 2 2
2 3 3
3 4 3
4 5 4
5 6 3
6 7 2
7 8 1
Another solution use list comprehension with range for repeat years:
#if need working with years
#today_year = pd.to_datetime('now').year
today_year = 8
s = pd.to_numeric(df['year of death'], errors='coerce').fillna(today_year)
L = [x for s, e in zip(df['Year of birth'], s) for x in range(s, e + 1)]
df1 = (pd.Series(L).value_counts()
.sort_index()
.rename_axis('year')
.reset_index(name='population'))
print (df1)
year population
0 1 1
1 2 2
2 3 3
3 4 3
4 5 4
5 6 3
6 7 2
7 8 1
Similar like before, only is used Counter for dictionary for final DataFrame:
from collections import Counter
#if need working with years
#today_year = pd.to_datetime('now').year
today_year = 8
s = pd.to_numeric(df['year of death'], errors='coerce').fillna(today_year)
d = Counter([x for s, e in zip(df['Year of birth'], s) for x in range(s, e + 1)])
print (d)
Counter({5: 4, 3: 3, 4: 3, 6: 3, 2: 2, 7: 2, 1: 1, 8: 1})
df1 = pd.DataFrame({'year':list(d.keys()),
'population':list(d.values())})
print (df1)
year population
0 1 1
1 2 2
2 3 3
3 4 3
4 5 4
5 6 3
6 7 2
7 8 1
Related
I want to convert my dataframe rows to column and take last value of last column.
here is my dataframe
df=pd.DataFrame({'flag_1':[1,2,3,1,2,500],'dd':[1,1,1,7,7,8],'x':[1,1,1,7,7,8]})
print(df)
flag_1 dd x
0 1 1 1
1 2 1 1
2 3 1 1
3 1 7 7
4 2 7 7
5 500 8 8
df_out:
1 2 3 1 2 500 1 1 1 7 7 8 8
Assuming you want a list as output, you can mask the initial values of the list column and stack:
import numpy as np
out = (df
.assign(**{df.columns[-1]: np.r_[[pd.NA]*(len(df)-1),[df.iloc[-1,-1]]]})
.T.stack().to_list()
)
Output:
[1, 2, 3, 1, 2, 500, 1, 1, 1, 7, 7, 8, 8]
For a wide dataframe with a single row, use .to_frame().T in place of to_list() (here with a MultiIndex):
flag_1 dd x
0 1 2 3 4 5 0 1 2 3 4 5 5
0 1 2 3 1 2 500 1 1 1 7 7 8 8
Given this Dataframe:
df2 = pd.DataFrame([[3,3,3,3,3,3,5,5,5,5],[2,2,2,2,8,8,8,8,6,6]], columns=list('ABCDEFGHIJ'))
A B C D E F G H I J
0 3 3 3 3 3 3 5 5 5 5
1 2 2 2 2 8 8 8 8 6 6
I created 2 news columns which give for each row the max_freq and the max_freq_value:
df2["max_freq_val"] = df2.apply(lambda x: x.mode().agg(list), axis=1)
df2["max_freq"] = df2.loc[:, df2.columns != "max_freq_val"].apply(lambda x: x.value_counts().max(), axis=1)
A B C D E F G H I J max_freq_val max_freq
0 3 3 3 3 3 3 5 5 5 5 [3] 6
1 2 2 2 2 8 8 8 8 6 6 [2, 8] 4
EDIT: I've edited my code inspired by the answer given by #rhug123.
Thanks to all of you for your answers.
Try this, it uses mode()
df2.assign(max_freq=pd.Series(df2.mode(axis=1).stack().groupby(level=0).agg(list)),
max_freq_value = df2.eq(df2.mode(axis=1)[0].squeeze(),axis=0).sum(axis=1))
or
df2.assign(freq = df2.eq((s := df2.mode(axis=1).stack().groupby(level=0).agg(list)).str[0],axis=0).sum(axis=1),val = s)
We can try stack then adjust the freq with agg put the multiple into the list
s = df2.stack().groupby(level=0).value_counts()
s = s[s.eq(s.max(level=0),level=0)].reset_index(level=1).groupby(level=0).agg(val= ('level_1',list),fre=(0,'first'))
df2 = df2.join(s)
df2
Out[156]:
A B C D E F G H I J val fre
0 3 3 3 3 3 3 5 5 5 5 [3] 6
1 2 2 2 2 8 8 8 8 6 6 [2, 8] 4
Perhaps you could use this function:
def give_back_maximums(a = [2,2,2,2,8,8,8,8,6,6]):
values, counts = np.unique(a, return_counts=True)
return values[counts >= counts.max()].tolist()
The order of the below could affect the result
df2["max_freq_value"] = df2.apply(lambda x: give_back_maximums(x), axis=1)
df2["max_freq"] = df2.apply(lambda x: x.value_counts().max(), axis=1)
print(df2)
A B C D E F G H I J max_freq_value max_freq
0 3 3 3 3 3 3 5 5 5 5 [3] 6
1 2 2 2 2 8 8 8 8 6 6 [2, 8] 4
Hope it helps : )
Given two data frames. One contains a column of repeated values (a, in this case). The other contains what this value corresponds to (in this example, it corresponds to some "d" values). How do I efficiently replenish the first data frame with a new column, values in which correspond to some existent column, according to a rule recorded in the other data frame. Here is an example code that works really slow:
import pandas as pd
import numpy as np
d1 = pd.DataFrame(np.asarray([[1,2,3], [2,4,5], [3,4,5], [2,1,4], [3,4,5]]), columns = ['a', 'b', 'c'])
d2 = pd.DataFrame(np.asarray([[1,7], [2,8], [3,11]]), columns = ['a', 'd'])
d = np.empty((d1.shape[0],))
for i in range(d1.shape[0]):
temp = d2.loc[d2['a'] == d1.at[i,'a']]
d[i] = temp['d'].array[0]
d1['d'] = d
This is d1 original:
a b c
0 1 2 3
1 2 4 5
2 3 4 5
3 2 1 4
4 3 4 5
This is d2:
a d
0 1 7
1 2 8
2 3 11
This is a resultant d1:
a b c d
0 1 2 3 7
1 2 4 5 8
2 3 4 5 11
3 2 1 4 8
4 3 4 5 11
You're probably looking for pd.merge.
In your case, d1 = d1.merge(d2, on=['a'], how='left') should do the trick.
Another way is to use map and make only the values you need.
d1['d'] = d1['a'].map(d2.set_index('a')['d'])
d1
Output:
a b c d
0 1 2 3 7
1 2 4 5 8
2 3 4 5 11
3 2 1 4 8
4 3 4 5 11
I need to create a incremental series for a given value of dataframe in python.
Any help much appreciated
Suppose I have dataframe column
df['quadrant']
Out[6]:
0 4
1 4
2 4
3 3
4 3
5 3
6 2
7 2
8 2
9 1
10 1
11 1
I want to create a new column such that
index quadrant new value
0 4 1
1 4 5
2 4 9
3 3 2
4 3 6
5 3 10
6 2 3
7 2 7
8 2 11
9 1 4
10 1 8
11 1 12
Using Numpy, you can create the array as:
import numpy as np
def value(q, k=1):
diff_quadrant = np.diff(q)
j = 0
ramp = []
for i in np.where(diff_quadrant != 0)[0]:
ramp.extend(list(range(i-j+1)))
j = i+1
ramp.extend(list(range(len(quadrant)-j)))
ramp = np.array(ramp) * k # sawtooth-shaped array
a = np.ones([len(quadrant)], dtype = np.int)*5
return a - q + ramp
quadrant = np.array([3, 3, 3, 3, 4, 4, 4, 2, 2, 1, 1, 1])
b = value(quadrant, 4)
# [ 2 6 10 14 1 5 9 3 7 4 8 12]
I want to find the argmax of the values in a matrix by column, e.g.:
1 2 3 2 3 3
4 5 6 ->
3 7 8
I feel like I should just be able to map an argmax/posmax function over the columns, but I don't see a particularly intuitive way to do this in Octave.
Read max function documentation here
[max_values indices] = max(input);
Example:
input =
1 2 3
4 5 6
3 7 8
[max_values indices] = max(input)
max_values =
4 7 8
indices =
2 3 3
In Octave If
A =
1 3 2
6 5 4
7 9 8
1) For Each Column Max value and corresponding index of them can be found by
>> [max_values,indices] =max(A,[],1)
max_values =
7 9 8
indices =
3 3 3
2) For Each Row Max value and corresponding index of them can be found by
>> [max_values,indices] =max(A,[],2)
max_values =
3
6
9
indices =
2
1
2
Similarly For minimum value
>> [min_values,indices] =min(A,[],1)
min_values =
1 3 2
indices =
1 1 1
>> [min_values,indices] =min(A,[],2)
min_values =
1
4
7
indices =
1
3
1