Change code to fit strings - Matlab - string

I have the following code:
NI1=[NI{:,1} NI{:,2} NI{:,3}];
[~,NI2]=sort(NI1(:,2));
NI1=NI1(NI2,:);
NI1((NI1(:,3) == 0),:) = [];
NI1=unique(NI1(:,1:3),'rows');
NI3= unique(NI1(:,1:2),'rows')
for mj=1:size(NI3,1)
NI3(mj,3)=sum(NI1(:,1) == NI3(mj,1) & NI1(:,2)==NI3(mj,2));
end
My initial cell-array NI1 has in collumns: 1) the year; 2) a code that corresponds to a bank 3) a code that corresponds to the workers of the bank. EXAMPLE:
c1 c2 c3
1997 3 850
1997 3 1024
1997 3 5792
My output NI3 counts how many analysts (c3), for the different years (c1) are working in each bank (c2), for instance:
c1 c2 c3
1997 3 14
1997 7 84
1997 11 15
1998 4 1
1998 15 10
1998 3 12
1999 11 17
Now I am trying to apply exactly the same code, but my last column (c3) is a string so initial cell array fir_ins is the following:
1997 3 'ACAD'
1997 3 'ADCT'
1997 3 'ADEX'
I want to obtain exactly the same output as in NI3, but I have to change the code, since my last column is a string.
I am only missing the last part, this is the code I have so far.
ESTIMA=num2cell(I{:,6});
ANALY=num2cell(I{:,7});
YEAR = num2cell(T_ANNDAT3);
fir_ins=[YEAR ESTIMA I{:,1}];
fir_ins= sortrows(fir_ins,2);
[~, in2,~] = unique(strcat(fir_ins(:,2),fir_ins(:, 3)));
fir_ins = fir_ins(in2,:);
fir_ins= sortrows(fir_ins,[1 2]);
fir_ins2=fir_ins(:,1:2);
fir_ins2=unique(cell2mat(fir_ins2(:,1:2)),'rows');
This part is not working:
for jm=1:size(fir_ins2,1)
fir_ins2(jm,3)=sum(cell2mat(fir_ins(:,1))) == fir_ins2(jm,1) & cell2mat(fir_ins(:,2))==cell2mat(fir_ins2(jm,2));
end

You can perform this "aggregation" more efficiently with the help of accumarray function. The idea is to map the first two columns (row primary keys) into subscripts (indices starting from 1), then pass those subscripts to accumarray to do the counting.
Below is an example to illustrate. First I start by generating some random data resembling yours:
% here are the columns
n = 150;
c1 = sort(randi([1997 1999], [n 1])); % years
c2 = sort(randi([3 11], [n 1])); % bank code
c3 = randi(5000, [n 1]); % employee ID as a number
c4 = cellstr(char(randi(['A' 'Z']-0, [n,4]))); % employee ID as a string
% combine records (NI)
X = [c1 c2 c3]; % the one with numeric worker ID
X2 = [num2cell([c1 c2]) c4]; % {c1 c3 c4} % the one with string worker ID
Note that for our purposes, it doesn't matter if the workers ID column is expressed as numbers or string; we won't be using them, only the first two columns that represent the "primary keys" of the rows are used:
% find the unique primary keys and their subscript mapping
[years_banks,~,ind] = unique([c1 c2], 'rows');
% count occurences (as in SQL: SELECT COUNT(..) FROM .. GROUPT BY ..)
counts = accumarray(ind, 1);
% build final matrix: years, bank codes, counts
M = [years_banks counts];
I got the following result with my fake data:
>> M
M =
1997 3 13
1997 4 11
1997 5 15
1997 6 14
1997 7 4
1998 7 11
1998 8 24
1998 9 15
1999 9 1
1999 10 22
1999 11 20

Related

Is there a way to sort a list so that rows with the same value in one column are evenly distributed?

Hoping to sort (below left) by sector but distribute evenly (below right):
Name
Sector.
Name.
Sector
A
1
A
1
B
1
E
2
C
1
H
3
D
4
D
4
E
2
B
1
F
2
F
2
G
2
J
3
H
3
I
4
I
4
C
1
J
3
G
2
Real data is 70+ rows with 4 sectors.
I've worked around it manually but would love to figure out how to do it with a formula in excel.
Here's a more complete (and hopefully more accurate) idea - the carouselOrder is the column I'd like to generate via a formula.
guestID
guestSector
carouselOrder
1
1
1
2
1
5
3
1
9
4
1
13
5
2
2
6
2
6
7
2
10
8
2
14
9
3
3
10
3
7
11
3
11
12
2
18
13
1
17
14
1
20
15
1
23
16
2
21
17
2
24
18
2
27
19
1
26
20
1
29
21
1
30
22
1
31
23
3
15
24
3
19
25
3
22
26
3
25
27
3
28
28
1
32
29
4
4
30
4
8
31
4
12
32
4
16
When using Office 365 you can use the following in D2: =MOD(SEQUENCE(COUNTA(A2:A11),,0),4)+1
This create the repetitive counter of the sectors 1 to 4 to the total count of rows in your data.
In C2 use the following:
=BYROW(D2#,LAMBDA(x,
INDEX(
FILTER($A$2:$A$11,$B$2:$B$11=x),
SUM(--(D$2:x=x)))))
This filters the Names that equal the sector of mentioned row and indexes it to show only the result where the row in the filter result equals the count of the same sector (D2#) up to current row.
Let's try the following approach that doesn't require to create a helper column. I would like to explain first the logic to build the recurrence, then the excel formula that builds such recurrence.
If we sort the input data Name and Sector. by Sector. in ascending order, the new positions of the Name values (letters) can be calculated as follow (Table 1):
Name
Sector.Sorted
Position
A
1
1+4*0=1
B
1
1+4*1=5
C
1
1+4*2=9
E
2
2+4*0=2
F
2
2+4*1=6
G
2
2*4*2=10
H
3
3+4*0=3
J
3
3+4*1=7
D
4
4+4*0=4
I
4
4+4*1=8
The new positions of Name (letters) follows this pattern (Formula 1):
position = Sector.Sorted + groupSize * factor
where groupSize is 4 in our case and factor counts how many times the same Sector.Sorted value is repeated, starting from 0. Think about Sector.Sorted as groups, where each set of repeated values represents a group: 1,2,3 and 4.
If we are able to build the Position values we can sort Name, based on the new positions via SORTBY(array, by_array1) function. Check SORTBY documentation for more information how this function works.
Here is the formula to get the Name sorted in cell E2:
=LET(groupSize, 4, sorted, SORT(A2:B11,2), sName,
INDEX(sorted,,1),sSector, INDEX(sorted,,2),
seq0, SEQUENCE(ROWS(sSector),,0), mapResult,
MAP(sSector, seq0, LAMBDA(a,b, IF(b=0, "SAME",
IF(a=INDEX(sSector,b), "SAME", "NEW")))), factor,
SCAN(-1,mapResult, LAMBDA(aa,c,IF(c="SAME", aa+1,0))),
pos,MAP(sSector, factor, LAMBDA(m,n, m + groupSize*n)),
SORTBY(sName,pos)
)
Here is the output:
Explanation
The name sorted represents the input data sorted by Sector. in ascending order, i.e.: SORT(A2:B11,2). The names sName and sSector represent each column of sorted.
To identify each group we need the following sequence (seq0) starting from 0, i.e. SEQUENCE(ROWS(sSector),,0).
Now we need to identify when a new group starts. We use MAP function for that and the result is represented by the name mapResult:
MAP(sSector, seq0, LAMBDA(a,b, IF(b=0, "SAME",
IF(a=INDEX(sSector,b), "SAME", "NEW"))))
The logic is the following: If we are at the beginning of the sequence (first value of seq0), then returns SAME otherwise we check current value of sSector (a) against the previous one represented by INDEX(sSector,b) if they are the same, then we are in the same group, otherwise a new group started.
The intermediate result of mapResult is:
Name
Sector Sorted
mapResult
A
1
SAME
B
1
SAME
C
1
SAME
E
2
NEW
F
2
SAME
G
2
SAME
H
3
NEW
J
3
SAME
D
4
NEW
I
4
SAME
The first two columns are shown just for illustrative purpose, but mapResult only returns the last column.
Now we just need to create the counter based on every time we find NEW. In order to do that we use SCAN function and the result is stored under the name factor. This value represents the factor we use to multiply by 4 within each group (see Table 1):
SCAN(-1,mapResult, LAMBDA(aa,c,IF(c="SAME", aa+1,0)))
The accumulator starts in -1, because the counter starts with 0. Every time we find SAME, it increments by 1 the previous value. When it finds NEW (not equal to SAME), the accumulator is reset to 0.
Here is the intermediate result of factor:
Name
Sector Sorted
mapResult
factor
A
1
SAME
0
B
1
SAME
1
C
1
SAME
2
E
2
NEW
0
F
2
SAME
1
G
2
SAME
2
H
3
NEW
0
J
3
SAME
1
D
4
NEW
0
I
4
SAME
1
The first three columns are shown for illustrative purpose.
Now we have all the elements to build our pattern for the new positions represented with the name pos:
MAP(sSector, factor, LAMBDA(m,n, m + groupSize*n))
where m represents each element of Sector.Sorted and factor the previous calculated values. As you can see the formula in Excel represents the generic formula (Formula 1 see above). The intermediate result will be:
Name
Sector Sorted
mapResult
factor
pos
A
1
SAME
0
1
B
1
SAME
1
5
C
1
SAME
2
9
E
2
NEW
0
2
F
2
SAME
1
6
G
2
SAME
2
10
H
3
NEW
0
3
J
3
SAME
1
7
D
4
NEW
0
4
I
4
SAME
1
8
The previous columns are shown just for illustrative purpose. Now we have the new positions, so we are ready to sort based on the new positions for Name via:
SORTBY(sName,pos)
Update
The first MAP can be removed creating an array as input for SCAN that has the information of sSector and the index position to be used for finding the previous element. SCAN only allows a single array as input argument, so we can combine both information in a new array. This is the formula can be used instead:
=LET(groupSize, 4, sorted, SORT(A2:B11,2), sName,
INDEX(sorted,,1),sSector, INDEX(sorted,,2),
factor, SCAN(-1,sSector&"-"&SEQUENCE(ROWS(sSector),,0),
LAMBDA(aa,b, LET(s, TEXTSPLIT(b,"-"),item, INDEX(s,,1),
idx, INDEX(s,,2), IF(aa=-1, 0, IF(1*item=INDEX(sSector, idx), aa+1,0))))),
pos,MAP(sSector, factor, LAMBDA(m,n, m + groupSize*n)),
SORTBY(sName,pos)
)
We use inside of SCAN a LET function to calculate all required elements for doing the comparison as part of the calculation of the corresponding LAMBDA function. We extract the item and the idx position used to find previous element of sSector via:
1*item=INDEX(sSector, idx)
we are able to compare each element of sSector with previous one, starting from the second element of sSector. We multiply item by 1, because TEXTSPLIT converts the result to text, otherwise the comparison will fail.

want to calculate the count of pass instances of data set using python pandas

x=[]
y1=[]
r1=len(df)
L1=len(df.columns)
for i in range(r1):
ll=(df.loc[i,'LL'])
ul=(df.loc[i,'UL'])
count1 =0
for j in range(5,L1):
if isinstance(df.iloc[i,j],str):
df.loc[i,j]=0
if ll<=df.iloc[i,j]<=ul:
count1=count1+1
if count1==(L1-5):
x.append('Pass')
else:
x.append('Fail')
y1.append(count1)
se = pd.Series(x)
se1=pd.Series(y1)
df['Min']=min1.values
df['Mean']=mean1.values
df['Median']=median1.values
df['Max']=max1.values
df['Pass Count']=se1.values
df['Result']=se.values
min1 = df.iloc[:,5:].min(axis=1)
mean1=df.iloc[:,5:].astype(float).mean(axis=1,skipna = True)
median1=df.iloc[:,5:].astype(float).median(axis=1,skipna = True)
max1=df.iloc[:,5:].max(axis=1)
count1=df.iloc[:,5:].count(axis=1)
yield1=[]
for i in range(len(se1)):
yd1=(se1[i]/(L1-3))*100
yield1.append(yd1)
se2=pd.Series(yield1)
df['Yield']=se2.values
df1=df.loc[:,['PARAMETER','Min','Mean','Median','Max','Result','Pass Count','Yield']]
df1
Below is my data set, it is sensor data on daily basis. Daily data should be within the Lower Limit (LL) and Upper Limit(UL). I want to count how many days sensors data is within the LL and UL.
I am not able to calculate the number of days for sensor data within LL and UL using Pandas. How can I calculate the number of days for sensor data within LL and UL?
Take a few key ideas
need a list of the columns that go into calc daycols
transpose these columns into an array then to test, gives a boolean array
sum this boolean array and you have your desired calc
df = pd.read_csv(io.StringIO("""sensor location,LL,UL,day1,day2,day3,day4,day5,day6,day7,number of days sensor data within LL and UL
A,1,10,12,6,9,4,9,7,15,5
B,1,12,4,15,7,1,11,1,7,6
C,1,15,13,13,13,10,7,13,13,7
D,1,10,12,1,14,12,15,4,4,3
E,1,20,11,15,8,14,1,14,14,7"""))
daycols = [d for i,d in enumerate(df.columns) if "day" in d and "number" not in d]
df = df.assign(
# use fact true is 1 so sum a truth array gives the answer
daysBetween=lambda dfa: ((dfa.loc[:,daycols].T>=dfa["LL"]) &
(dfa.loc[:,daycols].T<=dfa["UL"])).sum()
)
print(df.to_string(index=False))
output
sensor location LL UL day1 day2 day3 day4 day5 day6 day7 number of days sensor data within LL and UL daysBetween
A 1 10 12 6 9 4 9 7 15 5 5
B 1 12 4 15 7 1 11 1 7 6 6
C 1 15 13 13 13 10 7 13 13 7 7
D 1 10 12 1 14 12 15 4 4 3 3
E 1 20 11 15 8 14 1 14 14 7 7
speed up
It you have many columns then you can use slice capability to identify them and turn into indexes so iloc can be used. Additionally the transpose is not necessary.
dayi = [df.columns.get_loc(c) for c in df.columns[3:-1]]
df = df.assign(
# use fact true is 1 so sum a truth array gives the answer
daysBetween=lambda dfa: ((dfa.iloc[:,dayi]>=dfa["LL"]) &
(dfa.iloc[:,dayi]<=dfa["UL"])).sum()
)

Groupby and create a new column by randomly assign multiple strings into it in Pandas

Let's say I have students infos id, age and class as follows:
id age class
0 1 23 a
1 2 24 a
2 3 25 b
3 4 22 b
4 5 16 c
5 6 16 d
I want to groupby class and create a new column named major by randomly assign math, art, business, science into it, which means for same class, the major strings are same.
We may need to use apply(lambda x: random.choice..) to realize this, but I don't know how to do this. Thanks for your help.
Output expected:
id age major class
0 1 23 art a
1 2 24 art a
2 3 25 science b
3 4 22 science b
4 5 16 business c
5 6 16 math d
Use numpy.random.choice with number of values by length of DataFrame:
df['major'] = np.random.choice(['math', 'art', 'business', 'science'], size=len(df))
print (df)
id age major
0 1 23 business
1 2 24 art
2 3 25 science
3 4 22 math
4 5 16 science
5 6 16 business
EDIT: for same major values per groups use Series.map with dictionary:
c = df['class'].unique()
vals = np.random.choice(['math', 'art', 'business', 'science'], size=len(c))
df['major'] = df['class'].map(dict(zip(c, vals)))
print (df)
id age class major
0 1 23 a business
1 2 24 a business
2 3 25 b art
3 4 22 b art
4 5 16 c science
5 6 16 d math

Keeping rows from the previous year

I am just going to give it a try as I know that here are some smart people who might have an r -code for this.
I wont be able to code this by myself.
So I got a dataset that contains the names and years-months between 2000-01 and 2008-12. Looking like this:
Name Date
A 2000-01
A 2000-02
A ...
A 2008-12
A 2000-01
B 2000-01
B ...
B 2008-12
C and so on..
It can happen that for each name in my key column there is one value for each year. Thats the best I can ask for. Unfortunately some years dont have a value in my key column.
Getting further in my dataset looking only at Name A:
So if I do not have 1 observations for every year between 2000-2008 and I want to get the row from the year and month that does not have a value for my key column based on the month from the year on the next observation.
In this example:
2003-02 has a value for my keycolumn and 2002-02 does not, I want to get back the row from the date 2002-02 and Name A.
In a nutshell: "Keeping rows from the previous year based on key column from the next year"
Is there some easy way to code this?
Thank you :)
There's no straightforward and easy way to code what you're describing, but it's certainly possible to break the problem down into easier parts. The core part of the problem is as follows. Given a dataframe of rows with non-NA values, e.g.
year month
1 2002 12
2 2005 11
3 2006 01
4 2008 07
for each row, check the dataframe to see if the previous year exists; if yes, return the row, if no, return an additional row with the previous year and the same month. Here's what a function to do that might look like
check_ym <- function(y, m, dat) {
if ((y - 1) %in% dat$year) {
return(data.frame(Date = paste(y, m, sep = "-"), stringsAsFactors = FALSE))
} else {
return(data.frame(Date = paste(c(y - 1, y), c(m, m), sep = "-"), stringsAsFactors = FALSE))
}
}
Now, let's make some fake data.
library(dplyr)
library(tidyr)
library(purrr)
# Simulate data
set.seed(123)
x <- data.frame(Date = paste(sample(2000:2008, 4),
sprintf("%02d", sample(1:12, 4, replace = TRUE)),
sep = "-"),
KeyColumn = floor(runif(4, 1, 10)))
d <- data.frame(Date = paste(rep(2000:2008, each = 12),
sprintf("%02d", rep(1:12, times = 9)),
sep = "-")) %>%
left_join(x)
Identify the non-NA rows:
dd <- d %>%
na.omit() %>%
separate(Date, into = c("year", "month")) %>%
mutate(year = as.numeric(year))
dd
# year month KeyColumn
# 1 2002 12 5
# 2 2005 11 5
# 3 2006 01 5
# 4 2008 07 9
Then, we run the function above, iterating through the year and month columns. This gives us
out <- map2_df(dd$year, dd$month, .f = check_ym, dat = dd)
out
# Date
# 1 2001-12
# 2 2002-12
# 3 2004-11
# 4 2005-11
# 5 2006-01
# 6 2007-07
# 7 2008-07
Finally, we join this with our original data:
inner_join(out, d)
# Joining, by = "Date"
# Date KeyColumn
# 1 2001-12 NA
# 2 2002-12 5
# 3 2004-11 NA
# 4 2005-11 5
# 5 2006-01 5
# 6 2007-07 NA
# 7 2008-07 9
This is just for one Name. We can also do this for many Names. First create some fake data:
# Simulate data
set.seed(123)
d <- map_df(setNames(1:3, LETTERS[1:3]), function(...) {
x <- data.frame(Date = paste(sample(2000:2008, 4),
sprintf("%02d", sample(1:12, 4, replace = TRUE)),
sep = "-"),
KeyColumn = floor(runif(4, 1, 10)))
data.frame(Date = paste(rep(2000:2008, each = 12),
sprintf("%02d", rep(1:12, times = 9)),
sep = "-")) %>%
left_join(x)
}, .id = "Name")
dd <- d %>%
na.omit() %>%
separate(Date, into = c("year", "month")) %>%
mutate(year = as.numeric(year))
dd
# Name year month KeyColumn
# 1 A 2002 12 5
# 2 A 2005 11 5
# 3 A 2006 01 5
# 4 A 2008 07 9
# 5 B 2000 04 6
# 6 B 2004 01 7
# 7 B 2005 12 9
# 8 B 2006 03 9
# 9 B 2000 04 6
# 10 C 2003 12 1
# 11 C 2005 04 7
# 12 C 2006 11 5
# 13 C 2008 02 8
Now, use split to split the dataframe into three dataframes by Name; for each sub-dataframe, we apply check_ym(), and then we combine the results together and join it with the original data:
lapply(split(dd, dd$Name), function(dat) {
map2_df(dat$year, dat$month, .f = check_ym, dat = dat)
}) %>%
bind_rows(.id = "Name") %>%
inner_join(d)
# Joining, by = c("Name", "Date")
# Name Date KeyColumn
# 1 A 2001-12 NA
# 2 A 2002-12 5
# 3 A 2004-11 NA
# 4 A 2005-11 5
# 5 A 2006-01 5
# 6 A 2007-07 NA
# 7 A 2008-07 9
# 8 B 2000-04 6
# 9 B 2003-01 NA
# 10 B 2004-01 7
# 11 B 2005-12 9
# 12 B 2006-03 9
# 13 C 2002-12 NA
# 14 C 2003-12 1
# 15 C 2004-04 NA
# 16 C 2005-04 7
# 17 C 2006-11 5
# 18 C 2007-02 NA
# 19 C 2008-02 8

How to find exponential formula coefficients?

I have the following pairs of values:
X Y
1 2736
2 3124
3 3560
4 4047
5 4594
6 5205
7 5890
8 6658
9 7518
10 8480
18 21741
32 108180
35 152237
36 170566
37 191068
38 214087
39 239838
40 268679
When I put these pairs in Excel, I get a exponential formula:
Y = 2559*e^(0.1167*X)
with an accuracy of 99,98%.
Is there a way to ask from Excel to provide a formula in the following format:
Y = (A/B)*C^X-D
If not, is it possible to convert the above formula to the wanted one?
Note, that I am not familiar with Matlab.
You already have it !
A = 2559
B = 1
C = exp(0.1167)
D = 0
You'll see that it is equivalent to your formula Y = 2559*e^(0.1167*X), because e^(0.1167*X) = (e^0.1167)^X

Resources