Averaging in Matlab while excluding NaN AND other specific values - excel

I have an Excel spreadsheet with columns of values that represent different variables in an experimental setup. For example, one column in my data may be called "reaction time" and consequently contain values representative of time in milliseconds. If a problem occurs during the trial and no value is recorded for the reaction time, Matlab calls this "NaN." I know that I can use:
data = xlsread('filename.xlsx')
reaction_time = data(:,3)
average_reaction_time = mean(reaction_time, 'omitnan')
This will return the average values listed in the "reaction time" column of my spreadsheet (column 3). It skips over anything that isn't a number (NaN, in the case of an error during the experiment).
Here's what I need help with:
In addition to excluding NaNs, I also need to be able to leave out some values. For example, one type of error results in the printing of a "1 ms" reaction time, and this is consequently printed in the spreadsheet. How can I specify that I need to leave out NaNs, "1"s, and any other values?
Thanks in advance,
Mickey

One option for you might be to try the standardizeMissing function to replace the values that you want to exclude with NaN prior to using mean with 'omitnan'. For instance:
>> x = 1:10;
>> x = standardizeMissing(x, [3 4 5]); % Treat 3, 4, and 5 as missing values
x =
1 2 NaN NaN NaN 6 7 8 9 10
>> y = mean(x, 'omitnan');
If you read your Excel sheet into a table, standardizeMissing can replace the values with NaN only in the column you care about if you use the DataVariables Name-Value pair.

Related

Finding Closest Available Non-0 or NA value

I have an excel dataset that looks something like this:
Variable 1.2018 2.2018 3.2018 ...
A 4 5 8 ...
B 4 5 n.a ...
C 4 0 5 ...
D 4 n.a 9 ...
On a separate sheet I have a summary table that extracts numbers from this dataset using an index match function.
However, I am hoping for my function to not take on 0 or n.a values. Take for example, ideally, I would wish to compare growth between A and B at 3.2018, variable B contains n.a and wouldn't be very useful. In this case I would rather then compare between A and B at 2.2018 instead.
Variable 3.2017 3.2018 Growth
A 5 8 60%
B 5 n.a #VALUE
Variable 2.2017 2.2018 Growth
A 3 5 66%
B 4 5 25%
In the other case, say I were comparing between C and D. If I were to compare them at 3.2018, I would have no problems because they do not contain 0 or n.a values. However if I were to compare them at 2.2018, then I would want the formula to take the values from 1.2018 instead.
In the above cases, I would also like to know when it is the case that the values do not come from the 'ideal' time frame.
I tried to do an "if" before the index match but in the case of the first example it will only change the number of B and not A. It also does not work if I have 2 or more 0's or na's in a row.
Do an IF() function, wherein you check if either of your 3.2018 values are either 0 or #N/A (assuming these are actually the excel value of #N/A, and not a string representation like "n.a.")... if either are true, use the 2.2018 value otherwise use the 3.2018 values
=IF( OR(IFNA(D3=0, TRUE), IFNA(D2=0, TRUE)), C2=C3, D2=D3)

Column of NaN created when concatenating series into dataframe

I've created an output variable 'a = pd.Series()', then run a number of simulations using a for loop that append the results of the simulation, temporarily stored in 'x', to 'a' in successive columns, each renamed to coincide with the simulation number, starting at the zero-th position, using the following code:
a = pandas.concat([a, x.rename(sim_count)], axis=1)
For some reason, the resulting dataframe includes a column of "NaN" values to the left of my first column of simulated results that I can't get rid of, as follows (example shows the results of three simulations):
0 0 1 2
0 NaN 0.136799 0.135325 -0.174987
1 NaN -0.010517 0.108798 0.003726
2 NaN 0.116757 0.030352 0.077443
3 NaN 0.148347 0.045051 0.211610
4 NaN 0.014309 0.074419 0.109129
Any idea how to prevent this column of NaN values from being generated?
Basically, by creating your output variable via pd.Series() you are creating an empty dataset. This is carried over in the concatenation, with the empty dataset's size being defined as the same size (well, same number of rows) as x[sim_count]. The only way Python/Pandas knows to represent this "empty" series is by using a series of NaN values. When you concatenate you are effectively saying: I want to add my new dataframe/series onto the "empty" series...and the empty series just gets NaN.
A more effective way of doing this is to assign "a" to a dataframe then concatenate.
a = pd.DataFrame()
a = pandas.concat([a, x.rename(sim_count)], axis=1)
You might be asking yourself why this works and using pd.Series() forces a column of NaNs. My understanding is the dataframe creates an empty place in memory for the data to be added (i.e. you are putting your new data INTO an empty dataframe), whereas when you do pd.concat([pd.Series(), x.rename(sim_count)], axis1) you are telling pandas that the empty series (pd.Series()) is important and should be retained, and that the new data should be added ONTO "a". Hence the column of NaNs.

How to match two sets of data by dates which do not synchronise and include missing values in Excel

Please forgive any errors or shortcomings in this question, it's my first on stackoverflow.
I have two sets of data in Excel of differing lengths and frequency, and would like to be able to place a value of 0 for where they don't synchronise, and match the rest.
For example, dataset 1 could be:
Date Set1
01-01-2010 10
01-03-2010 4
01-04-2010 8
01-05-2010 5
01-06-2010 10
01-09-2010 12
01-10-2010 9
01-11-2010 4
And dataset 2 could be:
Date Set2
01-03-2010 102
01-06-2010 104
01-10-2010 102
I'm looking for an output table that displays the values alongside each other for dates matching, 0 otherwise, like so:
Date Set1 Set2
01-01-2010 10 0
01-03-2010 4 102
01-04-2010 8 0
01-05-2010 5 0
01-06-2010 10 104
01-09-2010 12 0
01-10-2010 9 102
01-11-2010 4 0
I can't seem to be able to crack this with my limited knowledge and the lack of synchronisation in the data. Any help would be much appreciated, thanks.
You can do this using a VLOOKUP nested in an IFERROR statement.
The two equations used (and dragged down to last unique date row) are:
H3 = IFERROR(VLOOKUP(G3,A:B,2,0),0)) & I3 = IFERROR(VLOOKUP(G3,D:E,2,0),0))
This will not work if you have duplicate dates in the same data set with varying values since VLOOKUP will always return the first matched value (reading top down).
Place Set1 in A1:B9 (header in row 1). Add a column of zeros next to it in column C, so A2:A9 is dates, B2:B9 is values and C2:C9 is zeros.
Place Set2 (without the header) in A10:B12; move the Set2 data to column C and put zeros in column B, so A10:A12 is dates, B10:B12 is zeros, C10:C12 is values.
Sort the range A2:C12 by Date (column A).
Easier to show with a screenshot but newbies are not allowed to post images.

How to fill missing data in excel time series

I need a hand on this problem: In an Excel workbook I reported 10 time series (with monthly frequency) of 10 titles that should cover the past 15 years. Unfortunately, not all titles can cover the 15-year time series. For example, a title only goes up to 2003; So in the column of that title, I have the first 5 years with a "Not Available" instead of a value. Once I’have imported the data into Matlab, obviously, in the column of the title with the shorter series appears NaN where there are no values.
>> Prices = xlsread('PrezziTitoli.xls');
>> whos
Name Size Bytes Class Attributes
Prices 182x10 6360 double
My goal is to estimate the variance-covariance matrix, however, because of the lack of data, the calculation is not possible for me. I thought to an interpolation, before the calculation of the variance-covariance matrix, to cover the values that in Matlab return NaN, for example with a "fillts", but have difficulties in its use.
There is some code that can be useful to me? Can you help me?
Thanks!
Do you have the statistics toolbox installed? In that case, the solution is simple:
>> x = randn(10,4); // x is a 10x4 matrix of random numbers
>> x(randi(40,10,1)) = NaN; // set some random entries to NaN
>> disp(x)
-1.1480 NaN -2.1384 2.9080
0.1049 -0.8880 NaN 0.8252
0.7223 0.1001 1.3546 1.3790
2.5855 -0.5445 NaN -1.0582
-0.6669 NaN NaN NaN
NaN -0.6003 0.1240 -0.2725
-0.0825 0.4900 1.4367 1.0984
-1.9330 0.7394 -1.9609 -0.2779
-0.4390 1.7119 -0.1977 0.7015
-1.7947 -0.1941 -1.2078 -2.0518
>> nancov(x) // Compute covariances after removing all NaN rows
1.2977 0.0520 1.6248 1.3540
0.0520 0.5359 -0.0967 0.3966
1.6248 -0.0967 2.2940 1.6071
1.3540 0.3966 1.6071 1.9358
>> nancov(x, 'pairwise') // Compute covariances pairwise, ignoring NaNs
1.9195 -0.5221 1.4491 -0.0424
-0.5221 0.7325 -0.1240 0.2917
1.4491 -0.1240 2.1454 0.2279
-0.0424 0.2917 0.2279 2.1305
If you don't have the statistics toolbox, we need to think harder - let me know!

How do I get rid of NaNs in MATLAB?

I have files which have many empty cells which appear as NaNs when I use cell2mat, but the problem is when I need to get the average values I cannot work with this as it shows error with NaN. In excel it overlooks NaN values, so how do I do the same in MATLAB?
In addition, I am writing a file using xlswrite:
xlswrite('test.xls',M);
I have data in all rows except 1. How do I write:
M(1,:) = ('time', 'count', 'length', 'width')
In other words, I want M(1,1)='time', M(1,2)='count', and so on. I have data from M(2,1) to M(10,20). How can I do this?
As AP correctly points out, you can use the function isfinite to find and keep only finite values in your matrix. You can also use the function isnan. However, removing values from your matrix can have the unintended consequence of reshaping your matrix into a row or column vector:
>> mat = [1 2 3; 4 NaN 6; 7 8 9] % A sample 3-by-3 matrix
mat =
1 2 3
4 NaN 6
7 8 9
>> mat = mat(~isnan(mat)) % Removing the NaN gives you an 8-by-1 vector
mat =
1
4
7
2
8
3
6
9
Another alternative is to use some functions from the Statistics Toolbox (if you have access to it) that are designed to deal with matrices containing NaN values. Since you mention taking averages, you may want to check out nanmean:
>> mat = [1 2 3; 4 NaN 6; 7 8 9];
>> nanmean(mat)
ans =
4 5 6 % The column means computed by ignoring NaN values
EDIT: To answer your additional question on the use of xlswrite, this sample code should illustrate one way you can write your data:
C = {'time','count','length','width'}; % A cell array of strings
M = rand(10,20); % A 10-by-20 array of random values
xlswrite('test.xls',C); % Writes C to cells A1 through D1
xlswrite('test.xls',M,'A2:T11'); % Writes M to cells A2 through T11
Use ' isfinite ' function to get rid of all NaN and infinities
A=A(isfinite(A))
%create the cell array containing the column headers
columnHeader = {'Column 1', 'Column 2', 'Column 3', 'Column 4', 'Column 5',' '};
%write the column headers first
xlswrite('myFile1.xls', columnHeader );
% write the data directly underneath the column headers
xlswrite('newFile.xls',M,'Sheet1','A2');
Statistics Toolbox has several statistical functions to deal with NaN values. See nanmean, nanmedian, nanstd, nanmin, nanmax, etc.
You can set NaN's to an arbitrary number like so:
mat(isnan(mat))=7 // my lucky number of choice.
May be too late, but...
x = [1 2 3; 4 inf 6; 7 -inf NaN];
x(find(x == inf)) = 0; //for inf
x(find(x == -inf)) = 0; //for -inf
x(find(isnan(x))) = 0; //for NaN

Resources