I want to replace values in columns using if loop:
If value in column [D] is not same as any values in [A,B,C] then replace column with first NaN with D, and if there is no NaN in a row, create a new column [E] and add value from column [D] in column [E].
ID A B C D
0 22 32 NaN 22
1 25 13 NaN 15
2 27 NaN NaN 20
3 29 10 16 29
4 12 92 33 55
I want output to be:
ID A B C D E
0 22 32 NaN 22
1 25 13 15 15
2 27 20 NaN 20
3 29 10 16 29
4 12 92 33 55 55
List = [[22 , 32 , None , 22],
[25 , 13 , None , 15],
[27 , None , None , 20],
[29 , 10 , 16 , 29],
[12 , 92 , 33 , 55]]
for Row in List:
Target_C = Row[3]
if Row.count(Target_C) < 2: # If there is no similar condetion pass
None_Found = False # Small bool to check later if there is no None !
for enumerate_Column in enumerate(Row): # get index for each list
if(None in enumerate_Column): # if there is None gin the row
Row[enumerate_Column[0]] = Target_C # replace None with column D
None_Found = True # Change None_Found to True
if(None_Found): # Break the loop if found None
break
if(None_Found == False): # if you dont found None add new clulmn
Row.append(Target_C)
My Code example
You can do it this way
a = df.isnull()
b = (a[a.any(axis=1)].idxmax(axis=1))
nanindex = b.index
check = (df.A!=df.D) & (df.B!=df.D) & (df.C!=df.D)
commonind = check[~check].index
replace_ind_list = list(nanindex.difference(commonind))
new_col_list = df.index.difference(list(set(commonind.tolist()+nanindex.tolist()))).tolist()
df['E']=''
for index, row in df.iterrows():
for val in new_col_list:
if index == val:
df.at[index,'E'] = df['D'][index]
for val in replace_ind_list:
if index == val:
df.at[index,b[val]] = df['D'][index]
df
Output
ID A B C D E
0 0 22 32.0 NaN 22
1 1 25 13.0 15.0 15
2 2 27 20.0 NaN 20
3 3 29 10.0 16.0 29
4 4 12 92.0 33.0 55 55
Related
I have a data frame with numbers in multiple columns listed by date, what I'm trying to do is find out the most frequently occurring numbers across the whole data set, also grouped by date.
import pandas as pd
import glob
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
print(lotimport()['ozlotto'])
------------- Output ---------------------
1 2 3 4 5 6 7 8 9
Date
2020-07-07 4 5 7 9 12 13 32 19 35
2020-06-30 1 17 26 28 38 39 44 14 41
2020-06-23 1 3 9 13 17 20 41 28 45
2020-06-16 1 2 13 21 22 27 38 24 33
2020-06-09 8 11 26 27 31 38 39 3 36
... .. .. .. .. .. .. .. .. ..
2005-11-15 7 10 13 17 30 32 41 20 14
2005-11-08 12 18 22 28 33 43 45 23 13
2005-11-01 1 3 11 17 24 34 43 39 4
2005-10-25 7 16 23 29 36 39 42 19 43
2005-10-18 5 9 12 30 33 39 45 7 19
The output I am aiming for is
Number frequency
45 201
32 195
24 187
14 160
48 154
--------------- Updated with append experiment -----------
I tried using append to create a single series from the dataframe, which worked for individual lines of code but got a really odd result when I ran it inside a for loop.
temp = lotimport()['ozlotto']['1']
print(temp)
temp = temp.append(lotimport()['ozlotto']['2'], ignore_index=True, verify_integrity=True)
print(temp)
temp = temp.append(lotimport()['ozlotto']['3'], ignore_index=True, verify_integrity=True)
print(temp)
lotcomb = pd.DataFrame()
for i in (lotimport()['ozlotto'].columns.tolist()):
print(f"{i} - {type(i)}")
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
print(lotcomb)
This solution might be the one you are looking for.
freqvalues = np.unique(df.to_numpy(), return_counts=True)
df2 = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
df2.index.name = "Numbers"
df2
Output:
Frequency
Numbers
1 6
2 5
3 5
5 8
6 4
7 7
8 2
9 7
10 3
11 4
12 2
13 8
14 1
15 4
16 4
17 6
18 4
19 5
20 9
21 3
22 4
23 2
24 4
25 5
26 4
27 6
28 1
29 6
30 3
31 3
... ...
70 6
71 6
72 5
73 5
74 2
75 8
76 5
77 3
78 3
79 2
80 3
81 4
82 6
83 9
84 5
85 4
86 1
87 3
88 4
89 3
90 4
91 4
92 3
93 5
94 1
95 4
96 6
97 6
98 1
99 6
97 rows × 1 columns
df.max(axis=0)
for columns
df.max(axis=1)
for index
Ok so the final answer I came up with was a mix of a few things including some of the great input from people in this thread. Essentially I do the following:
Pull in the CSV file and clean up the dates and the column names, then convert it to a pandas dataframe.
Then create a new pandas series and append each column to it ignoring dates to prevent conflicts.
Once I have the series, I use Vioxini's suggestion to use numpy to get counts of unique values and then turn the values into the index, after that sort the column by count in descending order and return the top 10 values.
Below is the resulting code, I hope it helps someone else.
import pandas as pd
import glob
import numpy as np
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
lotcomb = pd.Series([],dtype=object)
for i in (lotimport()['ozlotto'].columns.tolist()):
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
freqvalues = np.unique(lotcomb.to_numpy(), return_counts=True)
lotop = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
lotop.index.name = "Numbers"
lotop.sort_values(by=['Frequency'],ascending=False).head(10)
I have a Dataframe like this:
data = {'TYPE':['X', 'Y', 'Z'],'A': [11,12,13], 'B':[21,22,23], 'C':[31,32,34]}
df = pd.DataFrame(data)
TYPE A B C
0 X 11 21 31
1 Y 12 22 32
2 Z 13 23 34
I like to get the following DataFrame:
TYPE A A_added B B_added C C_added
0 X 11 15 21 25 31 35
1 Y 12 18 22 28 32 38
2 Z 13 20 23 30 34 40
For each column (next to TYPE column), here A,B,C:
add a new column with the name column_name_added
if TYPE = X add 4, if TYPE = Y add 6, if Z add 7
Idea is multiple values by helper Series created by Series.map with dictionary with DataFrame.add, add to original by DataFrame.join and last change order of columns by DataFrame.reindex:
d = {'X':4,'Y':6, 'Z':7}
cols = df.columns[:1].tolist() + [i for x in df.columns[1:] for i in (x, x + '_added')]
df1 = df.iloc[:, 1:].add(df['TYPE'].map(d), axis=0, fill_value=0).add_suffix('_added')
df2 = df.join(df1).reindex(cols, axis=1)
print (df2)
TYPE A A_added B B_added C C_added
0 X 11 15 21 25 31 35
1 Y 12 18 22 28 32 38
2 Z 13 20 23 30 34 41
EDIT: For values not matched dictionary are created missing values, so if add Series.fillna it return value 7 for all another values:
d = {'X':4,'Y':6}
cols = df.columns[:1].tolist() + [i for x in df.columns[1:] for i in (x, x + '_added')]
df1 = df.iloc[:, 1:].add(df['TYPE'].map(d).fillna(7).astype(int), axis=0).add_suffix('_added')
df2 = df.join(df1).reindex(cols, axis=1)
print (df2)
TYPE A A_added B B_added C C_added
0 X 11 15 21 25 31 35
1 Y 12 18 22 28 32 38
2 Z 13 20 23 30 34 41
I have a DataFrame that has a ID column and Value column that only consist (0,1,2). I want to capture only those rows, if there is a transition from (0-1) or (1-2) in value column. This process has to be done for each ID separately.
I tried to do the groupby for ID and using a difference aggregation function. So that i can take those rows for which difference of values is 1. But it is failing in certain condition.
df=df.loc[df['values'].isin([0,1,2])]
df = df.sort_values(by=['Id'])
df.value.diff()
Given DataFrame:
Index UniqID Value
1 a 1
2 a 0
3 a 1
4 a 0
5 a 1
6 a 2
7 b 0
8 b 2
9 b 1
10 b 2
11 b 0
12 b 1
13 c 0
14 c 1
15 c 2
16 c 2
Expected Output:
2 a 0
3 a 1
4 a 0
5 a 1
6 a 2
9 b 1
10 b 2
11 b 0
12 b 1
13 c 0
14 c 1
15 c 2
Only expecting those rows when there is a transition from either 0-1 or 1-2.
Thank you in advance.
Use this my solution working for groups with tuples of patterns:
np.random.seed(123)
N = 100
d = {
'UniqID': np.random.choice(list('abcde'), N),
'Value': np.random.choice([0,1,2], N),
}
df = pd.DataFrame(d).sort_values('UniqID')
#print (df)
pat = [(0, 1), (1, 2)]
a = np.array(pat)
s = (df.groupby('UniqID')['Value']
.rolling(2, min_periods=1)
.apply(lambda x: np.all(x[None :] == a, axis=1).any(), raw=True))
mask = (s.mask(s == 0)
.groupby(level=0)
.bfill(limit=1)
.fillna(0)
.astype(bool)
.reset_index(level=0, drop=True))
df = df[mask]
print (df)
UniqID Value
99 a 1
98 a 2
12 a 1
63 a 2
38 a 0
41 a 1
9 a 1
72 a 2
64 b 1
67 b 2
33 b 0
68 b 1
57 b 1
71 b 2
10 b 0
8 b 1
61 c 1
66 c 2
46 c 0
0 c 1
40 c 2
21 d 0
74 d 1
15 d 1
85 d 2
6 d 1
88 d 2
91 d 0
83 d 1
4 d 1
34 d 2
96 d 0
48 d 1
29 d 0
84 d 1
32 e 0
62 e 1
37 e 1
55 e 2
16 e 0
23 e 1
Assuming, transition is strictly from 1 -> 2 and 0 -> 1. (This assumption is valid as well.)
Similar Sample data:
index,id,value
1,a,1
2,a,0
3,a,1
4,a,0
5,a,1
6,a,2
7,b,0
8,b,2
9,b,1
10,b,2
11,b,0
12,b,1
13,c,0
14,c,1
15,c,2
16,c,2
Load this in pandas dataframe.
Then,
Using below code:
def grp_trns(x):
x['dif']=x.value.diff().fillna(0)
return pd.DataFrame(list(x[x.dif==1]['index']-1)+list(x[x.dif==1]['index']))
target_index=df.groupby('id').apply(lambda x:grp_trns(x)).values.squeeze()
print(df[df['index'].isin(target_index)][['index', 'id','value']])
It gives desired dataframe based on assumption:
index id value
1 2 a 0
2 3 a 1
3 4 a 0
4 5 a 1
5 6 a 2
8 9 b 1
9 10 b 2
10 11 b 0
11 12 b 1
12 13 c 0
13 14 c 1
14 15 c 2
Edit: To include transition 1->0, below is updated function:
def grp_trns(x):
x['dif']=x.value.diff().fillna(0)
index1=list(x[x.dif==1]['index']-1)+list(x[x.dif==1]['index'])
index2=list(x[(x.dif==-1)&(x.value==0)]['index']-1)+list(x[(x.dif==-1)&(x.value==0)]['index'])
return pd.DataFrame(index1+index2)
My version is using shift and diff() to delete all lines with diff value equal to 0,2 or -2
df = pandas.DataFrame({'index':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],'UniqId':['a','a','a','a','a','a','b','b','b','b','b','b','c','c','c','c'],'Value':[1,0,1,0,1,2,0,2,1,2,0,1,0,1,2,2]})
df['diff']=np.NaN
for element in df['UniqId'].unique():
df['diff'].loc[df['UniqId']==element]=df.loc[df['UniqId']==element]['Value'].diff()
df['diff']=df['diff'].shift(-1)
df=df.loc[(df['diff']!=-2) & (df['diff']!=2) & (df['diff']!=0)]
print(df)
Actually waiting for updates about the 2-1 and 1-2 relationship
Given that i have a df like this:
ID Date Amount
0 a 2014-06-13 12:03:56 13
1 b 2014-06-15 08:11:10 14
2 a 2014-07-02 13:00:01 15
3 b 2014-07-19 16:18:41 22
4 b 2014-08-06 09:39:14 17
5 c 2014-08-22 11:20:56 55
...
129 a 2016-11-06 09:39:14 12
130 c 2016-11-22 11:20:56 35
131 b 2016-11-27 09:39:14 42
132 a 2016-12-11 11:20:56 18
I need to create a column df['Checking'] to show that ID will appear in next month or not and i tried the code as below:
df['Checking']= df.apply(lambda x: check_nextmonth (x.Date,
x.ID), axis=1)
where
def check_nextmonth(date, id)=
x= id in df['user_id'][df['Date'].dt.to_period('M')== ((date+
relativedelta(months=1))).to_period('M')].values
return x
but it take too long to process a single row.
How can i improve this code or another way to achieve what i want?
Using pd.to_datetime with ts tricks:
import pandas as pd
df['Date'] = pd.to_datetime(df['Date'])
df['tmp'] = (df['Date'] - pd.DateOffset(months=1)).dt.month
s = df.groupby('ID').apply(lambda x:x['Date'].dt.month.isin(x['tmp']))
df['Checking'] = s.reset_index(level=0)['Date']
Output:
ID Date Amount tmp Checking
0 a 2014-06-13 12:03:56 13 5 True
1 b 2014-06-15 08:11:10 14 5 True
2 a 2014-07-02 13:00:01 15 6 False
3 b 2014-07-19 16:18:41 16 6 True
4 b 2014-08-06 09:39:14 17 7 False
5 c 2014-08-22 11:20:56 18 7 False
Here's one method of doing it, check if the grouped id's next month is equal to current month + 1, and assign the same by sorting the ID.
check = df.groupby('ID').apply(lambda x : x['Date'].dt.month.shift(-1) == x['Date'].dt.month+1).stack().values
df = df.sort_values('ID').assign( checking = check).sort_index()
ID Date Amount checking
0 a 2014-06-13 12:03:56 13 True
1 b 2014-06-15 08:11:10 14 True
2 a 2014-07-02 13:00:01 15 False
3 b 2014-07-19 16:18:41 16 True
4 b 2014-08-06 09:39:14 17 False
5 c 2014-08-22 11:20:56 18 False
This maybe real simple solution but I am new to python 3 and I have a dataframe with multiple columns. I would like to add a new column to the existing dataframe - which does the following calculation i.e.
New Column = Max((Column A/Column B), (Column C/Column D), (Column E/Column F))
I can do a max based on the following code but wanted to check how can I do div alongwith it.
df['Max'] = df[['Column A','Column B','Column C', 'Column D', 'Column E', 'Column F']].max(axis=1)
Column A Column B Column C Column D Column E Column F Max
3600 36000 22 11 3200 3200 36000
2300 2300 13 26 1100 1200 2300
1300 13000 15 33 1000 1000 13000
Thanks
You can div the df by itself by slicing the columns in steps and then take the max:
In [105]:
df['Max'] = df.ix[:,df.columns[::2]].div(df.ix[:,df.columns[1::2]].values, axis=1).max(axis=1)
df
Out[105]:
Column A Column B Column C Column D Column E Column F Max
0 3600 36000 22 11 3200 3200 2
1 2300 2300 13 26 1100 1200 1
2 1300 13000 15 33 1000 1000 1
Here are the intermediate values:
In [108]:
df.ix[:,df.columns[::2]].div(df.ix[:,df.columns[1::2]].values, axis=1)
Out[108]:
Column A Column C Column E
0 0.1 2.000000 1.000000
1 1.0 0.500000 0.916667
2 0.1 0.454545 1.000000
You can try something like as follows
df['Max'] = df.apply(lambda v: max(v['A'] / v['B'].astype(float), v['C'] / V['D'].astype(float), v['E'] / v['F'].astype(float)), axis=1)
Example
In [14]: df
Out[14]:
A B C D E F
0 1 11 1 11 12 98
1 2 22 2 22 67 1
2 3 33 3 33 23 4
3 4 44 4 44 11 10
In [15]: df['Max'] = df.apply(lambda v: max(v['A'] / v['B'].astype(float), v['C'] /
v['D'].astype(float), v['E'] / v['F'].astype(float)), axis=1)
In [16]: df
Out[16]:
A B C D E F Max
0 1 11 1 11 12 98 0.122449
1 2 22 2 22 67 1 67.000000
2 3 33 3 33 23 4 5.750000
3 4 44 4 44 11 10 1.100000