Similar random variation for two columns in pandas - python-3.x

data = pd.DataFrame(1.0, index=[1,2,3,4,5], columns=list('ABCD') )
data[['B', 'C']] = data[['B', 'C']].apply(lambda x: x + (-1)**random.randrange(2)*1)
I wanted to randomly vary column B and C, such that the the variation is the same for both columns. If column B increase by one, column C must increase by one too. however for each row, the value can increase/decrease randomly. Code above doesn't work. Then I tried this with random seed:
data['B'] = data['B'].apply(lambda x: x + (-1)**random.randrange(2)*1)
data['C'] = data['C'].apply(lambda x: x + (-1)**random.randrange(2)*1)
Each rows vary randomly but the change in column B and C are not the same. how do I do this?
expected output
A B C D
1 1.0 1.0 1.0 1.0
2 1.0 2.0 2.0 1.0
3 1.0 2.0 2.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 0.0 0.0 1.0

Related

How to count value greater than or equal to 0.5 continuous for 5 or greater than 5 rows python

I am trying to count values in column x greater than or equal to 0.5 for continuous for 5 times or greater. i also need to use groupby function for my data .
i used this function work fine , but this function can not count continuous occurrence of value , its just count all values greater than or equal 0.5
data['points_greater_0.5'] = data[abs(data['x'])>=0.5].groupby(['y','z','n'])['x'].count()
but i want to count if greater than or equal to 0.5 value occurs continuous for 5 times or more
As the source DataFrame I took:
x y z n
0 0.1 1.0 1.0 1.0
1 0.5 1.0 1.0 1.0
2 0.6 1.0 1.0 1.0
3 0.7 1.0 1.0 1.0
4 0.6 1.0 1.0 1.0
5 0.5 1.0 1.0 1.0
6 0.1 1.0 1.0 1.0
7 0.5 1.0 1.0 1.0
8 0.6 1.0 1.0 1.0
9 0.7 1.0 1.0 1.0
10 0.1 1.0 1.0 1.0
11 0.5 1.0 1.0 1.0
12 0.6 1.0 1.0 1.0
13 0.7 1.0 1.0 1.0
14 0.7 1.0 1.0 1.0
15 0.6 1.0 1.0 1.0
16 0.5 1.0 1.0 1.0
17 0.1 1.0 1.0 1.0
18 0.5 2.0 1.0 1.0
19 0.6 2.0 1.0 1.0
20 0.7 2.0 1.0 1.0
21 0.6 2.0 1.0 1.0
22 0.5 2.0 1.0 1.0
(one group for (y, z, n) == (1.0, 1.0, 1.0) and another for (2.0, 1.0, 1.0)).
Start from import itertools as it.
Then define the following function to get the count of your "wanted"
elements from the current group:
def getCnt(grp):
return sum(filter(lambda x: x >= 5, [ len(list(group))
for key, group in it.groupby(grp.x, lambda elem: elem >= 0.5)
if key ]))
Note that it contains it.groupby, i.e. groupby function from itertools
(not the pandasonic version of it).
The difference is that the itertools version starts a new group on each change
of the grouping key (by default, the value of the source element).
Steps:
it.groupby(grp.x, lambda elem: elem >= 0.5) - create an iterator,
returning pairs (key, group), from x column of the current group.
The key states whether the current group (from itertools grouping)
contains your "wanted" elements (>= 0.5) and the group contains these
elements.
[ len(list(group)) for key, group in … if key ] - get a list of
lengths of groups, excluding groups of "smaller" elements.
filter(lambda x: x >= 5, …) - filter the above list, leaving only counts
of groups with 5 or more members.
sum(…) - sum the above counts.
Then, to get your expected result, as a DataFrame, apply this function to
each group of rows, this time grouping with the pandasonic version of
groupby.
Then set the name of the resulting Series (it will be the column name
in the final result) and reset the index, to convert it to a DataFrame.
The code to do it is:
result = df.groupby(['y','z','n']).apply(getCnt).rename('Cnt').reset_index()
The result is:
y z n Cnt
0 1.0 1.0 1.0 11
1 2.0 1.0 1.0 5

DataFrame of Dates into sequential dates

I would like to turn a dataframe as follows into a data frame of sequential dates.
Date
01/25/1995
01/20/1995
01/20/1995
01/23/1995
into
Date Value Cumsum
01/20/1995 2 2
01/21/1995 0 2
01/22/1995 0 2
01/23/1995 1 3
01/24/1995 0 3
01/25/1995 1 4
Try this:
df['Date'] = pd.to_datetime(df['Date'])
df_out = df.assign(Value=1).set_index('Date').resample('D').asfreq().fillna(0)
df_out = df_out.assign(Cumsum=df_out['Value'].cumsum())
print(df_out)
Output:
Value Cumsum
Date
1995-01-20 1.0 1.0
1995-01-21 0.0 1.0
1995-01-22 0.0 1.0
1995-01-23 1.0 2.0
1995-01-24 0.0 2.0
1995-01-25 1.0 3.0

Joining Time Series Dataframes where duplicate columns contain the same values

I'm trying to combine multiple dataframes that contain time series data. These dataframes can have up to 100 columns and roughly 5000 rows. Two sample dataframes are
df1 = pd.DataFrame({'SubjectID': ['A', 'A', 'B', 'C'], 'Date': ['2010-05-08', '2010-05-10', '2010-05-08', '2010-05-08'], 'Test1':[1, 2, 3, 4], 'Gender': ['M', 'M', 'M', 'F'], 'StudyID': [1, 1, 1, 1]})
df2 = pd.DataFrame({'SubjectID': ['A', 'A', 'A', 'B', 'C'], 'Date': ['2010-05-08', '2010-05-09', '2010-05-10', '2010-05-08', '2010-05-09'], 'Test2': [1, 2, 3, 4, 5], 'Gender': ['M', 'M', 'M', 'M', 'F'], 'StudyID': [1, 1, 1, 1, 1]})
df1
SubjectID Date Test1 Gender StudyID
0 A 2010-05-08 1 M 1
1 A 2010-05-10 2 M 1
2 B 2010-05-08 3 M 1
3 C 2010-05-08 4 F 1
df2
SubjectID Date Test2 Gender StudyID
0 A 2010-05-08 1 M 1
1 A 2010-05-09 2 M 1
2 A 2010-05-10 3 M 1
3 B 2010-05-08 4 M 1
4 C 2010-05-09 5 F 1
My expected output is
SubjectID Date Test1 Gender StudyID Test2
0 A 2010-05-08 1.0 M 1.0 1.0
1 A 2010-05-09 NaN M 1.0 2.0
2 A 2010-05-10 2.0 M 1.0 3.0
3 B 2010-05-08 3.0 M 1.0 4.0
4 C 2010-05-08 4.0 F 1.0 NaN
5 C 2010-05-09 NaN F 1.0 5.0
I'm joining the dataframes by
merged_df = df1.set_index(['SubjectID', 'Date']).join(df2.set_index(['SubjectID', 'Date']), how = 'outer', lsuffix = '_l', rsuffix = '_r').reset_index()
but my output is
SubjectID Date Test1 Gender_l StudyID_l Test2 Gender_r StudyID_r
0 A 2010-05-08 1.0 M 1.0 1.0 M 1.0
1 A 2010-05-09 NaN NaN NaN 2.0 M 1.0
2 A 2010-05-10 2.0 M 1.0 3.0 M 1.0
3 B 2010-05-08 3.0 M 1.0 4.0 M 1.0
4 C 2010-05-08 4.0 F 1.0 NaN NaN NaN
5 C 2010-05-09 NaN NaN NaN 5.0 F 1.0
Is there a way to combine columns while joining the dataframes if all the values in both dataframes are equal? I can do it after the merge, but that will get tedious for my large datasets.
It depends how you want to implement the logic of resolving information that may not exactly match. Had you merged several frames, I think taking the modal value is appropriate. Taking your merged_df we can resolve it as:
merged_df = merged_df.groupby([x.split('_')[0] for x in merged_df.columns], 1).apply(lambda x: x.mode(1)[0])
Date Gender StudyID SubjectID Test1 Test2
0 2010-05-08 M 1.0 A 1.0 1.0
1 2010-05-09 M 1.0 A NaN 2.0
2 2010-05-10 M 1.0 A 2.0 3.0
3 2010-05-08 M 1.0 B 3.0 4.0
4 2010-05-08 F 1.0 C 4.0 NaN
5 2010-05-09 F 1.0 C NaN 5.0
Or perhaps, you want to give priority to the non-null value value in the first frame, then this is .combine_first.
df1.set_index(['SubjectID', 'Date']).combine_first(df2.set_index(['SubjectID', 'Date']))
Gender StudyID Test1 Test2
SubjectID Date
A 2010-05-08 M 1.0 1.0 1.0
2010-05-09 M 1.0 NaN 2.0
2010-05-10 M 1.0 2.0 3.0
B 2010-05-08 M 1.0 3.0 4.0
C 2010-05-08 F 1.0 4.0 NaN
2010-05-09 F 1.0 NaN 5.0
If you have to merge many DataFrames it may be best to use reduce from functools.
from functools import reduce
merged_df = reduce(lambda l,r: l.merge(r, on=['SubjectID', 'Date'], how='outer', suffixes=['_l', '_r']),
[df1, df2 ,df1, df2, df2])
You'll have lots of overlapping columns, but still can resolve them:
merged_df.groupby([x.split('_')[0] for x in merged_df.columns], 1).apply(lambda x: x.mode(1)[0])
Date Gender StudyID SubjectID Test1 Test2
0 2010-05-08 M 1.0 A 1.0 1.0
1 2010-05-10 M 1.0 A 2.0 3.0
2 2010-05-08 M 1.0 B 3.0 4.0
3 2010-05-08 F 1.0 C 4.0 NaN
4 2010-05-09 M 1.0 A NaN 2.0
5 2010-05-09 F 1.0 C NaN 5.0

Get number of rows for all combinations of attribute levels in Pandas

I have a dataframe bunch of categorical variables, each row corresponds to a product.I wanted to find the number of rows for every combination of attribute levels and decided to run the following:
att1=list(frame_base.columns.values)
f1=att.groupby(att1,as_index=False).size().rename('counts').to_frame()
att1 is the list of all attributes, f1 does not seem to provide the correct value as f1.counts.sum() is not equal to len(f1) before the group by.Why doesn't this work?
One possible problem is NaN row, but maybe there is typo - need att instead frame_base:
att = pd.DataFrame({'A':[1,1,3,np.nan],
'B':[1,1,6,np.nan],
'C':[2,2,9,np.nan],
'D':[1,1,5,np.nan],
'E':[1,1,6,np.nan],
'F':[1,1,3,np.nan]})
print (att)
A B C D E F
0 1.0 1.0 2.0 1.0 1.0 1.0
1 1.0 1.0 2.0 1.0 1.0 1.0
2 3.0 6.0 9.0 5.0 6.0 3.0
3 NaN NaN NaN NaN NaN NaN
att1=list(att.columns.values)
f1=att.groupby(att1).size().reset_index(name='counts')
print (f1)
A B C D E F counts
0 1.0 1.0 2.0 1.0 1.0 1.0 2
1 3.0 6.0 9.0 5.0 6.0 3.0 1

Pandas Pivot and Summarize For Multiple Rows Vertically

Given the following data frame:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Site':['a','a','a','b','b','b'],
'x':[1,1,0,1,0,0],
'y':[1,np.nan,0,1,1,0]
})
df
Site y x
0 a 1.0 1
1 a NaN 1
2 a 0.0 0
3 b 1.0 1
4 b 1.0 0
5 b 0.0 0
I am looking for the most efficient way, for each numerical column (y and x), to produce a percent per group, label the column name, and stack them in one column.
Here's how I accomplish this for 'y':
df=df.loc[~np.isnan(df['y'])] #do not count non-numbers
t=pd.pivot_table(df,index='Site',values='y',aggfunc=[np.sum,len])
t['Item']='y'
t['Perc']=round(t['sum']/t['len']*100,1)
t
sum len Item Perc
Site
a 1.0 2.0 y 50.0
b 2.0 3.0 y 66.7
Now all I need is a way to add 2 more rows to this; the results for 'x' if I had pivoted with its values above, like this:
sum len Item Perc
Site
a 1.0 2.0 y 50.0
b 2.0 3.0 y 66.7
a 1 2 x 50.0
b 1 3 x 33.3
In reality, I have 48 such numerical data columns that need to be stacked as such.
Thanks in advance!
First you can use notnull. Then omit in pivot_table parameter value, stack and sort_values by new column Item. Last you can use pandas function round:
df=df.loc[df['y'].notnull()]
t=pd.pivot_table(df,index='Site', aggfunc=[sum,len])
.stack()
.reset_index(level=1)
.rename(columns={'level_1':'Item'})
.sort_values('Item', ascending=False)
t['Perc']= (t['sum']/t['len']*100).round(1)
#reorder columns
t = t[['sum','len','Item','Perc']]
print t
sum len Item Perc
Site
a 1.0 2.0 y 50.0
b 2.0 3.0 y 66.7
a 1.0 2.0 x 50.0
b 1.0 3.0 x 33.3
Another solution if is neccessary define values columns in pivot_table:
df=df.loc[df['y'].notnull()]
t=pd.pivot_table(df,index='Site',values=['y', 'x'], aggfunc=[sum,len])
.stack()
.reset_index(level=1)
.rename(columns={'level_1':'Item'})
.sort_values('Item', ascending=False)
t['Perc']= (t['sum']/t['len']*100).round(1)
#reorder columns
t = t[['sum','len','Item','Perc']]
print t
sum len Item Perc
Site
a 1.0 2.0 y 50.0
b 2.0 3.0 y 66.7
a 1.0 2.0 x 50.0
b 1.0 3.0 x 33.3

Resources