Using xlswrite in MATLAB - excel

I am working with three datasets in MATLAB, e.g.,
Dates:
There are D dates that are chars each, but saved in a cell array.
{'01-May-2019','02-May-2019','03-May-2019'....}
Labels:
There are 100 labels that are strings each, but saved in a cell array.
{'A','B','C',...}
Values:
[0, 1, 2,...]
This is one row of the Values matrix of size D×100.
I would like the following output in Excel:
date labels Values
01-May-2019 A 0
01-May-2019 B 1
01-May-2019 C 2
till the same date repeats itself 100 times. Then, the next date is added (+ repeated 100 times) onto the subsequent row along with the 100 labels in the second column and new values from 2nd row of Values matrix transposed in third column. This repeats until the date length D is reached.
For the first date, I used:
c_1 = {datestr(datenum(dates(1))*ones(100,1))}
c_2 = labels
c_3 = num2cell(Values(1,:)')
xlswrite('test.xls',[c_1, c_2, c_3])
but, unfortunately, this seemed to have put everything in one column, i.e., the date, then, labels, then, 1st row of values array. I need these to be in three columns.
Also, I think that the above needs to be in a for loop over each day that I am considering. I tried using the table function, but, didn't have much luck with it.
How to solve this efficiently?

You can use repmat and reshape to build your columns and (optionally) add them to a table for exporting.
For example:
dates = {'01-May-2019','02-May-2019'};
labels = {'A','B', 'C'};
values = [0, 1, 2];
n_dates = numel(dates);
n_labels = numel(labels);
dates_repeated = reshape(repmat(dates, n_labels, 1), [], 1);
labels_repeated = reshape(repmat(labels, n_dates, 1).', [], 1);
values_repeated = reshape(repmat(values, n_dates, 1).', [], 1);
full_table = table(dates_repeated, labels_repeated, values_repeated);
Gives us the following table:
>> full_table
full_table =
6×3 table
dates_repeated labels_repeated values_repeated
______________ _______________ _______________
'01-May-2019' 'A' 0
'01-May-2019' 'B' 1
'01-May-2019' 'C' 2
'02-May-2019' 'A' 0
'02-May-2019' 'B' 1
'02-May-2019' 'C' 2
Which should export to a spreadsheet with writetable as desired.
What we're doing with repmat and reshape is "stacking" the values and then converting them into a single column:
>> repmat(dates, n_labels, 1)
ans =
3×2 cell array
{'01-May-2019'} {'02-May-2019'}
{'01-May-2019'} {'02-May-2019'}
{'01-May-2019'} {'02-May-2019'}
We transpose the labels and values so they get woven together (e.g [0, 1, 0, 1] vs [0, 0, 1, 1]), as repmat is column-major.
If you don't want the intermediate table, you can use num2cell to create a cell array from values so you can concatenate all 3 cell arrays together for xlswrite (or writematrix, added in R2019a, which also deprecates xlswrite):
values_repeated = num2cell(reshape(repmat(values, n_dates, 1).', [], 1));
full_array = [dates_repeated, labels_repeated, values_repeated];

Related

How to do a row-wise sum in an array formula in Excel?

Say I have the following basic spreadsheet:
A B C D
1 -2 4 2 12
2 -1 1 0
3 0 0 0 22
4 1 1 2 12
5 2 4 6
6 3 9 12
The A column has integers from -2 to 3.
The B column has the a column value squared.
The C column is the row sum of A and B so C1 is =SUM(A1:B1).
D1 has =MAX(C1:C6) and this max is the result I need to get with a single formula.
D3 is =MAX(SUM(A1:B6)) entered with Ctrl+Shift+Enter, but it just results in a regular sum.
D4 is =MAX(A1:A6+B1:B6) with ctrl+shift+enter, and this works and gives the correct result of 12.
However the problem with D4 is that I need to be able to handle large dynamic ranges without entering endless sums. Say SUM(A1:Z1000) would be A1:A1000+B1:B1000+....+Z1:Z1000 which is not a reasonable formula.
So how can I do something like =MAX(SUM(A1:Z1000)) such that it would sum the rows A1:Z1 to A1000:Z1000 and give me the final row-wize max.
I can only use base Excel, so no helper columns and no VBA function.
UPDATE
Since there have not been any successful answers, I have to assume it is not possible with current Excel versions.
So I am trying to build this function in VBA and this is what I have so far.
Function MAXROWSUM(Ref As Range) As Double
Dim Result, tempSum As Double
Dim Started As Boolean
Started = False
Result = 0
For Each Row In Ref.Rows
tempSum = Application.WorksheetFunction.Sum(Row)
If (Started = False) Then
Result = tempSum
Started = True
ElseIf tempSum > Result Then
Result = tempSum
End If
Next Row
MAXROWSUM = Result
End Function
This function works and is quite fast with less than 100k rows, but if the row count in the range approaches the possible 1 million, the function becomes very slow taking several seconds, even if most of the range is empty.
Is there a way to significantly optimize the current code, by possibly filtering out any empty rows?
In my example if I enter MAXROWSUM(A1:B1000000) it will work, but will be slow, can I make this very fast?
Your solution is Matrix Multiplication, via the MMULT function
How this works is as follows: currently, you have an X*N array/matrix, which you want to change into an X*1 matrix (where each new row is the sum of the rows in the old matrix), and then take the maximum value. To do this, you need to multiply it by an N*1 matrix: the two Ns "cancel out".
The first step is simple: every value in your second matrix is 1
Example: [6*2] ∙ [2*1] = [6*1]
[[-2][ 4] [[ 2]
[-1][ 1] [[ 1] [ 0]
[ 0][ 0] ∙ [ 1]] = [ 0]
[ 1][ 1] [ 2]
[ 2][ 4] [ 6]
[ 3][ 9]] [12]]
We then MAX this:
=MAX(MMULT(A1:B6,{1;1}))
To generate our second array dynamically (i.e. for any size), we can use the first Row of our table, convert it entirely to 1 (for example, "Column number > 0"), and then TRANSPOSE this to be a column instead of a row. This can be written as TRANSPOSE(--(COLUMN(A1:B1)>0))
Put it all together, and we get:
=MAX(MMULT(A1:B6, TRANSPOSE(--(COLUMN(A1:B1)>0))))
Since MMULT works with arrays - like SUMPRODUCT - we don't actually need to define this as an Array Formula with Ctrl+Shift+Enter either!
If you wanted to do this column-wise instead, then you would need to swap the arrays around - Matrix Multiplication is not commutative:
=MAX(MMULT(TRANSPOSE(--(ROW(A1:A6)>0)), A1:B6))
[1*6] ∙ [6*2] = [2*1]
[[-2][ 4]
[-1][ 1]
[[ 1][ 1][ 1][ 1][ 1][ 1]] ∙ [ 0][ 0] = [[ 3][19]]
[ 1][ 1]
[ 2][ 4]
[ 3][ 9]]
SUMPRODUCT (The Unsung Hero)
The following are not array formulas.
=SUMPRODUCT(MAX(A:A+A:A^2))
If it doesn't work (older Excel), change to the exact range e.g.:
=SUMPRODUCT(MAX(A$1:A$1000+A$1:A$1000^2))
I could not find a way to do the function with basic Excel, so I had to use VBA.
I was able to create a reasonably fast VBA function that works.
Function MAXROWSUM(Ref As Range) As Double
Dim Result, tempSum As Double
Dim Started As Boolean
Started = False
LastDataRow = Ref.Find("*", SearchOrder:=xlByRows, SearchDirection:=xlPrevious).Row
Result = 0
For Each Row In Ref.Rows
tempSum = Application.WorksheetFunction.Sum(Row)
If Row.Row > LastDataRow Then
Exit For
End If
If (Started = False) Then
Result = tempSum
Started = True
ElseIf tempSum > Result Then
Result = tempSum
End If
Next Row
MAXROWSUM = Result
End Function
A key line is: LastDataRow = Ref.Find("*", SearchOrder:=xlByRows, SearchDirection:=xlPrevious).Row which finds the last populated row of the data range and stops the function past it. So entering a function with a million rows where only 100 or a 1000 rows have data will not cause excessive useless calculation.

selecting different columns each row

I have a dataframe which has 500K rows and 7 columns for days and include start and end day.
I search a value(like equal 0) in range(startDay, endDay)
Such as, for id_1, startDay=1, and endDay=7, so, I should seek a value D1 to D7 columns.
For id_2, startDay=4, and endDay=7, so, I should seek a value D4 to D7 columns.
However, I couldn't seek different column range successfully.
Above-mentioned,
if startDay > endDay, I should see "-999"
else, I need to find first zero (consider the day range) and such as for id_3's, first zero in D2 column(day 2). And starDay of id_3 is 1. And I want to see, 2-1=1 (D2 - StartDay)
if I cannot find 0, I want to see "8"
Here is my data;
data = {
'D1':[0,1,1,0,1,1,0,0,0,1],
'D2':[2,0,0,1,2,2,1,2,0,4],
'D3':[0,0,1,0,1,1,1,0,1,0],
'D4':[3,3,3,1,3,2,3,0,3,3],
'D5':[0,0,3,3,4,0,4,2,3,1],
'D6':[2,1,1,0,3,2,1,2,2,1],
'D7':[2,3,0,0,3,1,3,2,1,3],
'startDay':[1,4,1,1,3,3,2,2,5,2],
'endDay':[7,7,6,7,7,7,2,1,7,6]
}
data_idx = ['id_1','id_2','id_3','id_4','id_5',
'id_6','id_7','id_8','id_9','id_10']
df = pd.DataFrame(data, index=data_idx)
What I want to see;
df_need = pd.DataFrame([0,1,1,0,8,2,8,-999,8,1], index=data_idx)
You can create boolean array to check in each row which 'Dx' column(s) are above 'startDay' and below 'endDay' and the value is equal to 0. For the first two conditions, you can use np.ufunc.outer with the ufunc being np.less_equal and np.greater_equal such as:
import numpy as np
arr_bool = ( np.less_equal.outer(df.startDay, range(1,8)) # which columns Dx is above startDay
& np.greater_equal.outer(df.endDay, range(1,8)) # which columns Dx is under endDay
& (df.filter(regex='D[0-9]').values == 0)) #which value of the columns Dx are 0
Then you can use np.argmax to find the first True per row. By adding 1 and removing 'startDay', you get the values you are looking for. Then you need to look for the other conditions with np.select to replace values by -999 if df.startDay >= df.endDay or 8 if no True in the row of arr_bool such as:
df_need = pd.DataFrame( (np.argmax(arr_bool , axis=1) + 1 - df.startDay).values,
index=data_idx, columns=['need'])
df_need.need= np.select( condlist = [df.startDay >= df.endDay, ~arr_bool.any(axis=1)],
choicelist = [ -999, 8],
default = df_need.need)
print (df_need)
need
id_1 0
id_2 1
id_3 1
id_4 0
id_5 8
id_6 2
id_7 -999
id_8 -999
id_9 8
id_10 1
One note: to get -999 for id_7, I used the condition df.startDay >= df.endDay in np.select and not df.startDay > df.endDay like in your question, but you can cahnge to strict comparison, you get 8 instead of -999 in this case.

how to get a kind of "maximum" in a matrix, efficiently

I have the following problem: I have a matrix opened with pandas module, where each cell has a number between -1 and 1. What I wanted to find is the maximum "posible" value in a row that is also not the maximum value in another row.
If for example 2 rows has their maximum value at the same column, I compare both values and take the bigger one, then for the row that has its maximum value smaller that the other row, I took the second maximum value (and do the same analysis again and again).
To explain myself better consider my code
import pandas as pd
matrix = pd.read_csv("matrix.csv")
# this matrix has an id (or name) for each column
# ... and the firt column has the id of each row
results = pd.DataFrame(np.empty((len(matrix),3),dtype=pd.Timestamp),columns=['id1','id2','max_pos'])
l = len(matrix.col[[0]]) # number of columns
while next = 1:
next = 0
for i in range(0, len(matrix)):
max_column = str(0)
for j in range(1, l): # 1 because the first column is an id
if matrix[max_column][i] < matrix[str(j)][i]:
max_column = str(j)
results['id1'][i] = str(i) # I coul put here also matrix['0'][i]
results['id2'][i] = max_column
results['max_pos'][i] = matrix[max_column][i]
for i in range(0, len(results)): #now I will check if two or more rows have the same max column
for ii in range(0, len(results)):
# if two id1 has their max in the same column, I keep it with the biggest
# ... max value and chage the other to "-1" to iterate again
if (results['id2'][i] == results['id2'][ii]) and (results['max_pos'][i] < results['max_pos'][ii]):
matrix[results['id2'][i]][i] = -1
next = 1
Putting an example:
#consider
pd.DataFrame({'a':[1, 2, 5, 0], 'b':[4, 5, 1, 0], 'c':[3, 3, 4, 2], 'd':[1, 0, 0, 1]})
a b c d
0 1 4 3 1
1 2 5 3 0
2 5 1 4 0
3 0 0 2 1
#at the first iterarion I will have the following result
0 b 4 # this means that the row 0 has its maximum at column 'b' and its value is 4
1 b 5
2 a 5
3 c 2
#the problem is that column b is the maximum of row 0 and 1, but I know that the maximum of row 1 is bigger than row 0, so I take the second maximum of row 0, then:
0 c 3
1 b 5
2 a 5
3 c 2
#now I solved the problem for row 0 and 1, but I have that the column c is the maximum of row 0 and 3, so I compare them and take the second maximum in row 3
0 c 3
1 b 5
2 a 5
3 d 1
#now I'm done. In the case that two rows have the same column as maximum and also the same number, nothing happens and I keep with that values.
#what if the matrix would be
pd.DataFrame({'a':[1, 2, 5, 0], 'b':[5, 5, 1, 0], 'c':[3, 3, 4, 2], 'd':[1, 0, 0, 1]})
a b c d
0 1 5 3 1
1 2 5 3 0
2 5 1 4 0
3 0 0 2 1
#then, at the first itetarion the result will be:
0 b 5
1 b 5
2 a 5
3 c 2
#then, given that the max value of row 0 and 1 is at the same column, I should compare the maximum values
# ... but in this case the values are the same (both are 5), this would be the end of iterating
# ... because I can't choose between row 0 and 1 and the other rows have their maximum at different columns...
This code works perfect to me if I have a matrix of 100x100 for example. But, if the matrix size goes to 50,000x50,000 the code takes to much time in finish it. I now that my code could be the most inneficient way to do it, but I don't know how to deal with this.
I have been reading about threads in python that could help but it doesn't help if I put 50,000 threads because my computer doesn't use more CPU. I also tried to use some functions as .max() but I'm not able to get column of the max an compare it with the other max ...
If anyone could help me of give me a piece of advice to make this more efficient I would be very grateful.
Going to need more information on this. What are you trying to accomplish here?
This will help you get some of the way, but in order to fully achieve what you're doing I need more context.
We'll import numpy, random, and Counter from collections:
import numpy as np
import random
from collections import Counter
We'll create a random 50k x 50k matrix of numbers between -10M and +10M
mat = np.random.randint(-10000000,10000000,(50000,50000))
Now to get the maximums for each row we can just do the following list comprehension:
maximums = [max(mat[x,:]) for x in range(len(mat))]
Now we want to find out which ones are not maximums in any other rows. We can use Counter on our maximums list to find out how many of each there are. Counter returns a counter object that is like a dictionary with the maximum as the key, and the # of times it appears as the value.
We then do dictionary comprehension where the value is == to 1. That will give us the maximums that only show up once. we use the .keys() function to grab the numbers themselves, and then turn it into a list.
c = Counter(maximums)
{9999117: 15,
9998584: 2,
9998352: 2,
9999226: 22,
9999697: 59,
9999534: 32,
9998775: 8,
9999288: 18,
9998956: 9,
9998119: 1,
...}
k = list( {x: c[x] for x in c if c[x] == 1}.keys() )
[9998253,
9998139,
9998091,
9997788,
9998166,
9998552,
9997711,
9998230,
9998000,
...]
Lastly we can do the following list comprehension to iterate through the original maximums list to get the indicies of where these rows are.
indices = [i for i, x in enumerate(maximums) if x in k]
Depending on what else you're looking to do we can go from here.
Its not the speediest program but finding the maximums, the counter, and the indicies takes 182 seconds on a 50,000 by 50,000 matrix that is already loaded.

How to get data in groupby like SQL having with pandas.

I have data like below.
id, name, password, note, num
1, hoge, xxxxxxxx, aaaaa, 2
2, hoge, xxxxxxxx, bbbbb, 1
3, moge, yyyyyyyy, ccccc, 2
4, zape, zzzzzzzz, ddddd, 3
I would like to make framedata using groupby same name and password. In this case, 1,hoge and 2,hoge are treated as same data. Then I would like to get count 3
from num column.
I tried like below.
df1 = pd.read_csv("sample.csv")
df2 = df1.groupby(['name','password']).count()
print(df2[df2[note] > 1])
It goes like this.
name, password, note, num
hoge, xxxxxxxx, 2, 2
How can I get sum of num value?
I belive you need GroupBy.size or count for exclude NaNs rows with transform for new Series with same size like original DaatFrame, so possible filtering with sum:
s = df1.groupby(['name','password'])['note'].transform('size')
s = df1.groupby(['name','password'])['note'].transform('count')
out = df1.loc[s > 1, 'num'].sum()
print (out)
3
If want count only duplicated rows filter by DataFrame.duplicated with specify columns for check dupes:
out = df1.loc[df1.duplicated(['name','password'], keep=False), 'num'].sum()
print (out)
3

Sum values based on first occurrence of other column using excel formula

Let's say I have the following two columns in excel spreadsheet
A B
1 10
1 10
1 10
2 20
3 5
3 5
and I would like to sum the values from B-column that represents the first occurrence of the value in A-column using a formula. So I expect to get the following result:
result = B1+B4+B5 = 35
i.e., sum column B where any unique value exists in the same row but Column A. In my case if Ai = Aj, then Bi=Bj, where i,j represents the row positions. It means that if two rows from A-column have the same value, then its corresponding values from B-column are the same. I can have the value sorted by column A values, but I prefer to have a formula that works regardless of sorting.
I found this post that refers to the same problem, but the proposed solution I am not able to understand.
Use SUMPRODUCT and COUNTIF:
=SUMPRODUCT(B1:B6/COUNTIF(A1:A6,A1:A6))
Here the step by step explanation:
COUNTIF(A1:A6, A1:A6) will produce an array with the frequency of the values: A1:A6. In our case it will be: {3, 3, 3, 1, 2, 2}
Then we have to do the following division: {10, 10, 10, 20, 5, 5}/{3, 3, 3, 1, 2, 2}. The result will be: {3.33, 3.33, 3.33, 20, 2.5, 2.5}. It replaces each value by the average of its group.
Summing the result we will get: (3.33+3.33+3.33) + 20 + (2.5+2.5=35)=35.
Using the above trick we can just get the same result as if we just sum the first element of each group from the column A.
To make this dynamic, so it grows and shrinks with the data set use this:
=SUMPRODUCT($B$1:INDEX(B:B,MATCH(1E+99,B:B))/COUNTIF($A$1:INDEX(A:A,MATCH(1E+99,B:B)),$A$1:INDEX(A:A,MATCH(1E+99,B:B))))
... or just SUMPRODUCT.
=SUMPRODUCT(B2:B7, --(A2:A7<>A1:A6))

Resources