Matlab - filter out identical cell entries - string

My matrix is a 10000 x 2 one. Looks like this:
Ann Beth
Bob Pete
Sam Sam
Jen Ted
...
There are many lines with identical names in both columns (like Sam). I need just rows with different names. I thought of a for-loop with ismember/string compare but this is very slow and there are some matrixes like this.
Other option that is also slow is to unique the first column and run a for loop with find the unique values and delete every time the values of find are identical. However this is slow as well. Please help to optimize.
Thanks

You can use strcmp to get a logical array of indices corresponding to identical rows, i.e. compare 1st column with 2nd and remove rows corresponding to indices of 1.
Example:
C = {'Ann' 'Beth';
'Bob' 'Pete';
'Sam' 'Sam';
'Jen' 'Ted'};
idx = strcmp(C(:,1),C(:,2))
Here idx looks like this:
idx =
0
0
1
0
Hence the 3rd row contains identical names. Now remove those:
C(idx,:) = [];
C =
'Ann' 'Beth'
'Bob' 'Pete'
'Jen' 'Ted'

Related

Exporting a list as a new column in a pandas dataframe as part of a nested for loop

I am inputting multiple spreadsheets with multiple columns of data. For each spreadsheet, the maximum value of each column is found. Then, for each element in the column, the element is divided by the maximum value of that column. The output should be a value (between 0 and 1) for each element in the column in ascending order. This is appended to a list which should be added to the source spreadsheet as a column.
Currently, the nested loops are performing correctly apart from the final step, as far as I understand. Each column is added to the spreadsheet EXCEPT the values are for the final column of the source spreadsheet rather than values related to each individual column.
I have tried changing the indents to associate levels of the code with different parts (as I think this is the problem) and tried moving the appended column along in the dataframe, to no avail.
for i in distlist:
#listname = i[4:] + '_norm'
df2 = pd.read_excel(i,header=0,index_col=None, skip_blank_lines=True)
df3 = df2.dropna(axis=0, how='any')
cols = []
for column in df3:
cols.append(column)
for x in cols:
listname = x + ' norm'
maxval = df3[x].max()
print(maxval)
mylist = []
for j in df3[x]:
findNL = (j/maxval)
mylist.append(findNL)
df3[listname] = mylist
saveloc = 'E:/test/'
filename = i[:-18] + '_Normalised.xlsx'
df3.to_excel(saveloc+filename, index=False)
New columns are added to the output dataframe with bespoke headings relating to the field headers in the source spreadsheet and renamed according to (listname). The data in each one of these new columns is identical and relates to the final column in the spreadsheet. To me, it seems to be overwriting the values each time (as if looping through the entire spreadsheet, not outputting for each column), and adding it to the spreadsheet.
Any help would be much appreciated. I think it's something simple, but I haven't managed to work out what...
If I understand you correctly, you are overcomplicating things. You dont need a for loop for this. You can simplify your code:
# Make example dataframe, this is not provided
df = pd.DataFrame({'col1':[1, 2, 3, 4],
'col2':[5, 6, 7, 8]})
print(df)
col1 col2
0 1 5
1 2 6
2 3 7
3 4 8
Now we can use DataFrame.apply and use add_suffix to give the new columns _norm suffix and after that concat the columns to one final dataframe
df_conc = pd.concat([df, df.apply(lambda x: x/x.max()).add_suffix('_norm')],axis=1)
print(df_conc)
col1 col2 col1_norm col2_norm
0 1 5 0.25 0.625
1 2 6 0.50 0.750
2 3 7 0.75 0.875
3 4 8 1.00 1.000
Many thanks. I think I was just overcomplicating it. Incidentally, I think my code may do the same job, but because there is so little difference in the values, it wasn't notable.
Thanks for your help #Erfan

How to ignore values already counted in previous rows for cumulative count

I have a dataset that looks like this:
Sample Species1 Species2 Species3 Cumulative count
1 1 1
2 1 1 2
3 1 2
4 2 2
5 1 2 1 3
I would like to count every new species added by each sample. So in the example above, samples 3 and 4 don't add any new species to the total number of species, so their cumulative count remains the same (I am trying to create a species accumulation curve).
I have tried this, but cannot get it to work with numbers >0 (for instance), rather than text:
How to ignore data previously counted by countif and return a specific value in cell
Essentially, I need something to check if the species in the current row were already present in previous rows.
The goal is to produce a graph like this, so I can determine where sampling effort begins to have a diminishing return (in number of species):
Is there an excel formula I could use to fill the 'Cumulative count' column and return the results above? I should also mention that a short solution would be best, because I have 35+ species and the formulas can get long and complicated very quickly. Any assistance would be appreciated.
Going horizontal is easy smaller formula:
=SUMPRODUCT((B3:D3>0)*1,(B$2:D2<=0)*1)+E2
.......
=SUMPRODUCT((B6:D6>0)*1,(B$2:D2<=0)*1,(B$3:D3<=0)*1,(B$4:D4<=0)*1,(B$5:D5<=0)*1)+E5
I found a method but this probably can be refined a bit:
For our first Cum Count Row:
=COUNTIF(B2:D2,"<>" & "")
For each thereafter:
=IF(AND(B3>0,SUM(B$2:B2)<=0),1,0)+IF(AND(C3>0,SUM(C$2:C2)<=0),1,0)+IF(AND(D3>0,SUM(D$2:D2)<=0),1,0)+E2
.....................
=IF(AND(B6>0,SUM(B$2:B5)<=0),1,0)+IF(AND(C6>0,SUM(C$2:C5)<=0),1,0)+IF(AND(D6>0,SUM(D$2:D5)<=0),1,0)+E5
A shorter solution is a UDF, and is provided here by JvdV:
stackoverflow.com/questions/51980149/count-column-if-it-contains-a-filled-cell-in-excel/51980258
Well, I guess if you want to work without any helper rows to use the
COUNTA funtion a smooth way could be a UDF, possibly like so:
Function CountColumns(RNG As Range) As Long
Dim COL As Range
For Each COL In RNG.Columns
If Application.WorksheetFunction.CountA(COL) > 0 Then CountColumns = CountColumns + 1
Next COL
End Function

How do you group data in columns?

I have numeric data under fifty samples that are mostly similar. I want to count identical columns and give statistics on the same. There are too many rows to select them (37,888). Data looks like:
Sample 1 Sample 2 Sample 3 ........ Sample 50
4 4 0
4 4 0
4 4 ...
0 0
0 0
0 0
0 0
... ...
upto thousands of rows for each sample.
There is a column for date/time as well, would be nice if I could include that in the grouping.
In this snippet, there are many rows. Sample 1 and 2 are identical hence should be grouped together. Sample three would form another group and so on.
While I'm not sure what "There are too many rows to select them" means in this context (there is no limit on the number of rows or items that can be selected and included in a formula), this looks like a job for array formulas.
If you want to determine (for instance) whether columns C and D are equal, from rows 1 through 37888, you can use this formula:
=AND(C1:C37888=D1:D37888)
To make Excel treat this as an array formula, you need to press CTRL-SHIFT-ENTER (Windows) or CMD-ENTER (Mac) after typing the formula. The "AND" function will return TRUE if and only if all corresponding entries are equal: C1=D1, C2=D2, C3=D3, ..., C37888=D37888. It returns FALSE if any corresponding entries disagree.
Exactly what you do next will depend on the nature of the statistics that you want to compute for each group, but this formula will at least help you figure out which columns belong in the same group together.

Find the top n values in a range while keeping the sum of values in another range under x value

I'd like to accomplish the following task. There are three columns of data. Column A represents price, where the sum needs to be kept under $100,000. Column B represents a value. Column C represents a name tied to columns A & B.
Out of >100 rows of data, I need to find the highest 8 values in column B while keeping the sum of the prices in column A under $100,000. And then return the 8 names from column C.
Can this be accomplished?
EDIT:
I attempted the Solver solution w/ no luck. 200 rows looks to be the max w/ Solver, and that is what I'm using now. Here are the steps I've taken:
Create a column called rank RANK(B2,$B$2:$B$200) (used column D -- what is the purpose of this?)
Create a column called flag just put in zeroes (used column E)
Create 3 total cells total_price (=SUM(A2:A200)), total_value (=SUM(B2:B200)) and total_flag (=(E2:E200))
Use solver to minimize total_value (shouldn't this be maximize??)
Add constraints -Total_price<=100000 -Total_flag=8 -Flag cells are binary
Using Simplex LP, it simply changes the flags for the first 8 values. However, the total price for the first 8 values is >$100,000 ($140k). I've tried changing some options in the Solver Parameters as well as using different solving methods to no avail. I'd like to post an image of the parameter settings, but don't have enough "reputation".
EDIT #2:
The first 5 rows looks like this, price goes down to ~$6k at the bottom of the table.
Price Value Name Rank Flag
$22,538 42.81905675 Blow, Joe 1 0
$22,427 37.36240932 Doe, Jane 2 0
$17,158 34.12127693 Hall, Cliff 3 0
$16,625 33.97654031 Povich, John 4 0
$15,631 33.58212402 Cow, Holy 5 0
I'll give you the solver solution as a starting point. It involves the creation of some extra columns and total cells. Note solver is limited in the amount of cells it can handle but will work with 100 anyway.
Create a column called rank RANK(B2,$B$2:$B$100)
Create a column called flag just put in zeroes
Create 3 total cells total_price, total_value and total_flag
Use solver to minimize total_value
Add constraints
-Total_price<=100000
-Total_flag=8
-Flag cells are binary
This will flag the rows you want and you can grab the names however you want.

Excel Count matches between 3 criteria

I have 2 columns of data (Names = DataA & DataB).
I have 2 variable sets of codes for which I want to count matches (Names = DataC & DataD).
DataA (Col A)
a
b
b
c
e
.....8000 records
DataB (Col B)
John
Fred
Gerry
Alice
etc.... 8000 records
DataA Variables to match a c ..... (up to 20 - RangeName=DataC)
DataB Variables to match John Fred ... (up to 20 - RangeName=DataD)
I can count the number of matches DataA to DataC using:
SUMPRODUCT((DataA=DataC)*1)
But of I try to add the DataB to DataD criteria it doesn't work
I can do it using multiple Countifs, one for each variable in turn but with larger numbers of variables it gets very messy (example with 4 variables):
COUNTIFS(DataA,$U$72,dataB,AA71)+COUNTIFS(DataA,$V$72,dataB,AA71)+COUNTIFS(DataA,$W$72,dataB,AA71)+COUNTIFS(DataA,$X$72,dataB,AA71)
I don't want to use Pivot Tables and would like a more elegant solution - driving me nuts for 2 days now - hope it doesn't do the same for you!
I may have misunderstood what you are trying to do but using a two criteria SUMPRODUCT works for me.
=SUMPRODUCT(--(DataA=DataC),--(DataB=DataD))
Note that rather than using *1 in the formula to have it calculate calculate the logical functions you should try to use -- in the SUMPRODUCT function.

Resources