Keeping rows from the previous year - excel

I am just going to give it a try as I know that here are some smart people who might have an r -code for this.
I wont be able to code this by myself.
So I got a dataset that contains the names and years-months between 2000-01 and 2008-12. Looking like this:
Name Date
A 2000-01
A 2000-02
A ...
A 2008-12
A 2000-01
B 2000-01
B ...
B 2008-12
C and so on..
It can happen that for each name in my key column there is one value for each year. Thats the best I can ask for. Unfortunately some years dont have a value in my key column.
Getting further in my dataset looking only at Name A:
So if I do not have 1 observations for every year between 2000-2008 and I want to get the row from the year and month that does not have a value for my key column based on the month from the year on the next observation.
In this example:
2003-02 has a value for my keycolumn and 2002-02 does not, I want to get back the row from the date 2002-02 and Name A.
In a nutshell: "Keeping rows from the previous year based on key column from the next year"
Is there some easy way to code this?
Thank you :)

There's no straightforward and easy way to code what you're describing, but it's certainly possible to break the problem down into easier parts. The core part of the problem is as follows. Given a dataframe of rows with non-NA values, e.g.
year month
1 2002 12
2 2005 11
3 2006 01
4 2008 07
for each row, check the dataframe to see if the previous year exists; if yes, return the row, if no, return an additional row with the previous year and the same month. Here's what a function to do that might look like
check_ym <- function(y, m, dat) {
if ((y - 1) %in% dat$year) {
return(data.frame(Date = paste(y, m, sep = "-"), stringsAsFactors = FALSE))
} else {
return(data.frame(Date = paste(c(y - 1, y), c(m, m), sep = "-"), stringsAsFactors = FALSE))
}
}
Now, let's make some fake data.
library(dplyr)
library(tidyr)
library(purrr)
# Simulate data
set.seed(123)
x <- data.frame(Date = paste(sample(2000:2008, 4),
sprintf("%02d", sample(1:12, 4, replace = TRUE)),
sep = "-"),
KeyColumn = floor(runif(4, 1, 10)))
d <- data.frame(Date = paste(rep(2000:2008, each = 12),
sprintf("%02d", rep(1:12, times = 9)),
sep = "-")) %>%
left_join(x)
Identify the non-NA rows:
dd <- d %>%
na.omit() %>%
separate(Date, into = c("year", "month")) %>%
mutate(year = as.numeric(year))
dd
# year month KeyColumn
# 1 2002 12 5
# 2 2005 11 5
# 3 2006 01 5
# 4 2008 07 9
Then, we run the function above, iterating through the year and month columns. This gives us
out <- map2_df(dd$year, dd$month, .f = check_ym, dat = dd)
out
# Date
# 1 2001-12
# 2 2002-12
# 3 2004-11
# 4 2005-11
# 5 2006-01
# 6 2007-07
# 7 2008-07
Finally, we join this with our original data:
inner_join(out, d)
# Joining, by = "Date"
# Date KeyColumn
# 1 2001-12 NA
# 2 2002-12 5
# 3 2004-11 NA
# 4 2005-11 5
# 5 2006-01 5
# 6 2007-07 NA
# 7 2008-07 9
This is just for one Name. We can also do this for many Names. First create some fake data:
# Simulate data
set.seed(123)
d <- map_df(setNames(1:3, LETTERS[1:3]), function(...) {
x <- data.frame(Date = paste(sample(2000:2008, 4),
sprintf("%02d", sample(1:12, 4, replace = TRUE)),
sep = "-"),
KeyColumn = floor(runif(4, 1, 10)))
data.frame(Date = paste(rep(2000:2008, each = 12),
sprintf("%02d", rep(1:12, times = 9)),
sep = "-")) %>%
left_join(x)
}, .id = "Name")
dd <- d %>%
na.omit() %>%
separate(Date, into = c("year", "month")) %>%
mutate(year = as.numeric(year))
dd
# Name year month KeyColumn
# 1 A 2002 12 5
# 2 A 2005 11 5
# 3 A 2006 01 5
# 4 A 2008 07 9
# 5 B 2000 04 6
# 6 B 2004 01 7
# 7 B 2005 12 9
# 8 B 2006 03 9
# 9 B 2000 04 6
# 10 C 2003 12 1
# 11 C 2005 04 7
# 12 C 2006 11 5
# 13 C 2008 02 8
Now, use split to split the dataframe into three dataframes by Name; for each sub-dataframe, we apply check_ym(), and then we combine the results together and join it with the original data:
lapply(split(dd, dd$Name), function(dat) {
map2_df(dat$year, dat$month, .f = check_ym, dat = dat)
}) %>%
bind_rows(.id = "Name") %>%
inner_join(d)
# Joining, by = c("Name", "Date")
# Name Date KeyColumn
# 1 A 2001-12 NA
# 2 A 2002-12 5
# 3 A 2004-11 NA
# 4 A 2005-11 5
# 5 A 2006-01 5
# 6 A 2007-07 NA
# 7 A 2008-07 9
# 8 B 2000-04 6
# 9 B 2003-01 NA
# 10 B 2004-01 7
# 11 B 2005-12 9
# 12 B 2006-03 9
# 13 C 2002-12 NA
# 14 C 2003-12 1
# 15 C 2004-04 NA
# 16 C 2005-04 7
# 17 C 2006-11 5
# 18 C 2007-02 NA
# 19 C 2008-02 8

Related

want to calculate the count of pass instances of data set using python pandas

x=[]
y1=[]
r1=len(df)
L1=len(df.columns)
for i in range(r1):
ll=(df.loc[i,'LL'])
ul=(df.loc[i,'UL'])
count1 =0
for j in range(5,L1):
if isinstance(df.iloc[i,j],str):
df.loc[i,j]=0
if ll<=df.iloc[i,j]<=ul:
count1=count1+1
if count1==(L1-5):
x.append('Pass')
else:
x.append('Fail')
y1.append(count1)
se = pd.Series(x)
se1=pd.Series(y1)
df['Min']=min1.values
df['Mean']=mean1.values
df['Median']=median1.values
df['Max']=max1.values
df['Pass Count']=se1.values
df['Result']=se.values
min1 = df.iloc[:,5:].min(axis=1)
mean1=df.iloc[:,5:].astype(float).mean(axis=1,skipna = True)
median1=df.iloc[:,5:].astype(float).median(axis=1,skipna = True)
max1=df.iloc[:,5:].max(axis=1)
count1=df.iloc[:,5:].count(axis=1)
yield1=[]
for i in range(len(se1)):
yd1=(se1[i]/(L1-3))*100
yield1.append(yd1)
se2=pd.Series(yield1)
df['Yield']=se2.values
df1=df.loc[:,['PARAMETER','Min','Mean','Median','Max','Result','Pass Count','Yield']]
df1
Below is my data set, it is sensor data on daily basis. Daily data should be within the Lower Limit (LL) and Upper Limit(UL). I want to count how many days sensors data is within the LL and UL.
I am not able to calculate the number of days for sensor data within LL and UL using Pandas. How can I calculate the number of days for sensor data within LL and UL?
Take a few key ideas
need a list of the columns that go into calc daycols
transpose these columns into an array then to test, gives a boolean array
sum this boolean array and you have your desired calc
df = pd.read_csv(io.StringIO("""sensor location,LL,UL,day1,day2,day3,day4,day5,day6,day7,number of days sensor data within LL and UL
A,1,10,12,6,9,4,9,7,15,5
B,1,12,4,15,7,1,11,1,7,6
C,1,15,13,13,13,10,7,13,13,7
D,1,10,12,1,14,12,15,4,4,3
E,1,20,11,15,8,14,1,14,14,7"""))
daycols = [d for i,d in enumerate(df.columns) if "day" in d and "number" not in d]
df = df.assign(
# use fact true is 1 so sum a truth array gives the answer
daysBetween=lambda dfa: ((dfa.loc[:,daycols].T>=dfa["LL"]) &
(dfa.loc[:,daycols].T<=dfa["UL"])).sum()
)
print(df.to_string(index=False))
output
sensor location LL UL day1 day2 day3 day4 day5 day6 day7 number of days sensor data within LL and UL daysBetween
A 1 10 12 6 9 4 9 7 15 5 5
B 1 12 4 15 7 1 11 1 7 6 6
C 1 15 13 13 13 10 7 13 13 7 7
D 1 10 12 1 14 12 15 4 4 3 3
E 1 20 11 15 8 14 1 14 14 7 7
speed up
It you have many columns then you can use slice capability to identify them and turn into indexes so iloc can be used. Additionally the transpose is not necessary.
dayi = [df.columns.get_loc(c) for c in df.columns[3:-1]]
df = df.assign(
# use fact true is 1 so sum a truth array gives the answer
daysBetween=lambda dfa: ((dfa.iloc[:,dayi]>=dfa["LL"]) &
(dfa.iloc[:,dayi]<=dfa["UL"])).sum()
)

How to perform lookup function of NaN rows in a column without overwrite the others value python 3.7

My goal is to lookup the information : "Team" from a Master dataset base on Year + Month + Name as a key ,
if there is NaN result , use only "Year" + "Name" as a second key to fill NaN rows .
Goal :
# dataset with lookuped column "Team"
Name Year Month KEY KEY_ND Team
0 Paul 2019 2 20192Paul 2019Paul A
1 Paul 2019 1 20191Paul 2019Paul A
2 Paul 2018 2 20182Paul 2018Paul C
3 Paul 2018 1 20181Paul 2018Paul B
4 Sue 2019 1 20191Sue 2019Sue A
Sample data and script i have tried so far
Master = pd.DataFrame({"Name": ["Paul","Paul","Paul","Sue"],
"Team": ["A","B","C", "A"],
"Year": ["2019","2018","2018","2019"],
"Month": [1,1,2,1]
})
xx = pd.DataFrame({"Name": ["Paul","Paul","Paul","Paul","Sue"],
"Year": ["2019","2019","2018","2018","2019"],
"Month": [2,1,2,1,1]
})
# Make First Key
Master_KEY = Master.assign(KEY = Master['Year'].astype(str) +
Master['Month'].astype(str) + Master['Name'].astype(str))
# Make First Key
xx['KEY'] = xx['Year'] + xx['Month'].astype(str) + xx['Name']
# Make Second Key
Master_KEY = Master_KEY.assign(KEY_ND = Master['Year'].astype(str) + Master['Name'].astype(str))
# Make Second Key
xx['KEY_ND'] = xx['Year'] + xx['Name']
# First LOOKUP with first Key : Year + Month + Name
xx = pd.merge(xx, Master_KEY[['KEY', 'Team']], on = 'KEY', how = 'left')
# MASK for NaN
x_mask = xx['Team'].isnull()
# Second LOOKUP with second Key : Year + Name
xx.loc[x_mask, 'Team'] = pd.merge(xx,Master_KEY[['KEY_ND','Team']],
on = 'KEY_ND', how = 'left')
Problem :
the last Second LOOKUP doesn't return the excepted result as
NaN value still exist.
xx
Name Year Month KEY KEY_ND Team
0 Paul 2019 2 20192Paul 2019Paul NaN
1 Paul 2019 1 20191Paul 2019Paul A
2 Paul 2018 2 20182Paul 2018Paul C
3 Paul 2018 1 20181Paul 2018Paul B
4 Sue 2019 1 20191Sue 2019Sue A
This script is a problem :
# Second LOOKUP with second Key : Year + Name
xx.loc[x_mask, 'Team'] = pd.merge(xx,Master_KEY[['KEY_ND','Team']],
on = 'KEY_ND', how = 'left')
*Apparently its a long and inefficient code , appreciate for any better recommendations is which clean & fast.
You can use double DataFrame.merge with different on parameter and for second remove duplicates by DataFrame.drop_duplicates and replace missing values by DataFrame.fillna:
Master1 = Master[['Name','Year', 'Team']].drop_duplicates(subset=['Name','Year'])
df1 = xx[['Name','Year']].merge(Master1, how='left')
df2 = xx.merge(Master, on=['Name','Year', 'Month'], how='left').fillna({'Team': df1['Team']})
print (df2)
Name Year Month Team
0 Paul 2019 2 A
1 Paul 2019 1 A
2 Paul 2018 2 C
3 Paul 2018 1 B
4 Sue 2019 1 A
Your solution should be changed with Series.map by keys columns with replace missing values by Series.fillna:
Master = Master.assign(K1 = Master['Year'].astype(str) +
Master['Month'].astype(str) +
Master['Name'].astype(str),
K2 = Master['Year'].astype(str) +
Master['Name'].astype(str))
xx = xx.assign(K1 = xx['Year'].astype(str) +
xx['Month'].astype(str) +
xx['Name'].astype(str),
K2 = xx['Year'].astype(str) +
xx['Name'].astype(str))
s1 = xx['K1'].map(Master.set_index('K1')['Team'])
s2 = xx['K2'].map(Master.drop_duplicates('K2').set_index('K2')['Team'])
xx['Team'] = s1.fillna(s2)
print (xx)
Name Year Month K1 K2 Team
0 Paul 2019 2 20192Paul 2019Paul A
1 Paul 2019 1 20191Paul 2019Paul A
2 Paul 2018 2 20182Paul 2018Paul C
3 Paul 2018 1 20181Paul 2018Paul B
4 Sue 2019 1 20191Sue 2019Sue A
slightly clean and readable solution is as below. Define a transform function which will add team column in each row based on your condition and apply that with the dataframe. It is readable and easily extensible for even more complex condition
def transform(x):
master_row = Master[(Master.Name==x.Name) & (Master.Year==x.Year)]
if len(master_row)>1:
temp_rows = master_row[master_row.Month == x.Month]
master_row = temp_rows if len(temp_rows)>=0 else master_row
x["Team"] = master_row.iloc[0].Team
return x
xx.apply(transform, axis=1)

How do I copy to a range, rather than a list, of columns?

I am looking to append several columns to a dataframe.
Let's say I start with this:
import pandas as pd
dfX = pd.DataFrame({'A': [1,2,3,4],'B': [5,6,7,8],'C': [9,10,11,12]})
dfY = pd.DataFrame({'D': [13,14,15,16],'E': [17,18,19,20],'F': [21,22,23,24]})
I am able to append the dfY columns to dfX by defining the new columns in list form:
dfX[[3,4]] = dfY.iloc[:,1:3].copy()
...but I would rather do so this way:
dfX.iloc[:,3:4] = dfY.iloc[:,1:3].copy()
The former works! The latter executes, returns no errors, but does not alter dfX.
Are you looking for
dfX = pd.concat([dfX, dfY], axis = 1)
It returns
A B C D E F
0 1 5 9 13 17 21
1 2 6 10 14 18 22
2 3 7 11 15 19 23
3 4 8 12 16 20 24
And you can append several dataframes in this like pd.concat([dfX, dfY, dfZ], axis = 1)
If you need to append say only column D and E from dfY to dfX, go for
pd.concat([dfX, dfY[['D', 'E']]], axis = 1)

How to sum columns in pandas and add the result into a new row?

In this code I want to sum each column and add it as a new row.
It does the sum but it does not show the new row.
df = pd.DataFrame(g, columns=('AWA', 'REM', 'S1', 'S2'))
df['xSujeto'] = df.sum(axis=1)
xEstado = df.sum(axis=0)
df.append(xEstado, ignore_index=True)
df
I think you can use loc:
df = pd.DataFrame({'AWA':[1,2,3],
'REM':[4,5,6],
'S1':[7,8,9],
'S2':[1,3,5]})
#add 1 to last index value
print (df.index[-1] + 1)
3
df.loc[df.index[-1] + 1] = df.sum()
print (df)
AWA REM S1 S2
0 1 4 7 1
1 2 5 8 3
2 3 6 9 5
3 6 15 24 9
Or append from comment of Nickil Maveli:
xEstado = df.sum()
df = df.append(xEstado, ignore_index=True)
print (df)
AWA REM S1 S2
0 1 4 7 1
1 2 5 8 3
2 3 6 9 5
3 6 15 24 9

Change code to fit strings - Matlab

I have the following code:
NI1=[NI{:,1} NI{:,2} NI{:,3}];
[~,NI2]=sort(NI1(:,2));
NI1=NI1(NI2,:);
NI1((NI1(:,3) == 0),:) = [];
NI1=unique(NI1(:,1:3),'rows');
NI3= unique(NI1(:,1:2),'rows')
for mj=1:size(NI3,1)
NI3(mj,3)=sum(NI1(:,1) == NI3(mj,1) & NI1(:,2)==NI3(mj,2));
end
My initial cell-array NI1 has in collumns: 1) the year; 2) a code that corresponds to a bank 3) a code that corresponds to the workers of the bank. EXAMPLE:
c1 c2 c3
1997 3 850
1997 3 1024
1997 3 5792
My output NI3 counts how many analysts (c3), for the different years (c1) are working in each bank (c2), for instance:
c1 c2 c3
1997 3 14
1997 7 84
1997 11 15
1998 4 1
1998 15 10
1998 3 12
1999 11 17
Now I am trying to apply exactly the same code, but my last column (c3) is a string so initial cell array fir_ins is the following:
1997 3 'ACAD'
1997 3 'ADCT'
1997 3 'ADEX'
I want to obtain exactly the same output as in NI3, but I have to change the code, since my last column is a string.
I am only missing the last part, this is the code I have so far.
ESTIMA=num2cell(I{:,6});
ANALY=num2cell(I{:,7});
YEAR = num2cell(T_ANNDAT3);
fir_ins=[YEAR ESTIMA I{:,1}];
fir_ins= sortrows(fir_ins,2);
[~, in2,~] = unique(strcat(fir_ins(:,2),fir_ins(:, 3)));
fir_ins = fir_ins(in2,:);
fir_ins= sortrows(fir_ins,[1 2]);
fir_ins2=fir_ins(:,1:2);
fir_ins2=unique(cell2mat(fir_ins2(:,1:2)),'rows');
This part is not working:
for jm=1:size(fir_ins2,1)
fir_ins2(jm,3)=sum(cell2mat(fir_ins(:,1))) == fir_ins2(jm,1) & cell2mat(fir_ins(:,2))==cell2mat(fir_ins2(jm,2));
end
You can perform this "aggregation" more efficiently with the help of accumarray function. The idea is to map the first two columns (row primary keys) into subscripts (indices starting from 1), then pass those subscripts to accumarray to do the counting.
Below is an example to illustrate. First I start by generating some random data resembling yours:
% here are the columns
n = 150;
c1 = sort(randi([1997 1999], [n 1])); % years
c2 = sort(randi([3 11], [n 1])); % bank code
c3 = randi(5000, [n 1]); % employee ID as a number
c4 = cellstr(char(randi(['A' 'Z']-0, [n,4]))); % employee ID as a string
% combine records (NI)
X = [c1 c2 c3]; % the one with numeric worker ID
X2 = [num2cell([c1 c2]) c4]; % {c1 c3 c4} % the one with string worker ID
Note that for our purposes, it doesn't matter if the workers ID column is expressed as numbers or string; we won't be using them, only the first two columns that represent the "primary keys" of the rows are used:
% find the unique primary keys and their subscript mapping
[years_banks,~,ind] = unique([c1 c2], 'rows');
% count occurences (as in SQL: SELECT COUNT(..) FROM .. GROUPT BY ..)
counts = accumarray(ind, 1);
% build final matrix: years, bank codes, counts
M = [years_banks counts];
I got the following result with my fake data:
>> M
M =
1997 3 13
1997 4 11
1997 5 15
1997 6 14
1997 7 4
1998 7 11
1998 8 24
1998 9 15
1999 9 1
1999 10 22
1999 11 20

Resources