How do you group data in columns? - excel

I have numeric data under fifty samples that are mostly similar. I want to count identical columns and give statistics on the same. There are too many rows to select them (37,888). Data looks like:
Sample 1 Sample 2 Sample 3 ........ Sample 50
4 4 0
4 4 0
4 4 ...
0 0
0 0
0 0
0 0
... ...
upto thousands of rows for each sample.
There is a column for date/time as well, would be nice if I could include that in the grouping.
In this snippet, there are many rows. Sample 1 and 2 are identical hence should be grouped together. Sample three would form another group and so on.

While I'm not sure what "There are too many rows to select them" means in this context (there is no limit on the number of rows or items that can be selected and included in a formula), this looks like a job for array formulas.
If you want to determine (for instance) whether columns C and D are equal, from rows 1 through 37888, you can use this formula:
=AND(C1:C37888=D1:D37888)
To make Excel treat this as an array formula, you need to press CTRL-SHIFT-ENTER (Windows) or CMD-ENTER (Mac) after typing the formula. The "AND" function will return TRUE if and only if all corresponding entries are equal: C1=D1, C2=D2, C3=D3, ..., C37888=D37888. It returns FALSE if any corresponding entries disagree.
Exactly what you do next will depend on the nature of the statistics that you want to compute for each group, but this formula will at least help you figure out which columns belong in the same group together.

Related

How to match two sets of data by dates which do not synchronise and include missing values in Excel

Please forgive any errors or shortcomings in this question, it's my first on stackoverflow.
I have two sets of data in Excel of differing lengths and frequency, and would like to be able to place a value of 0 for where they don't synchronise, and match the rest.
For example, dataset 1 could be:
Date Set1
01-01-2010 10
01-03-2010 4
01-04-2010 8
01-05-2010 5
01-06-2010 10
01-09-2010 12
01-10-2010 9
01-11-2010 4
And dataset 2 could be:
Date Set2
01-03-2010 102
01-06-2010 104
01-10-2010 102
I'm looking for an output table that displays the values alongside each other for dates matching, 0 otherwise, like so:
Date Set1 Set2
01-01-2010 10 0
01-03-2010 4 102
01-04-2010 8 0
01-05-2010 5 0
01-06-2010 10 104
01-09-2010 12 0
01-10-2010 9 102
01-11-2010 4 0
I can't seem to be able to crack this with my limited knowledge and the lack of synchronisation in the data. Any help would be much appreciated, thanks.
You can do this using a VLOOKUP nested in an IFERROR statement.
The two equations used (and dragged down to last unique date row) are:
H3 = IFERROR(VLOOKUP(G3,A:B,2,0),0)) & I3 = IFERROR(VLOOKUP(G3,D:E,2,0),0))
This will not work if you have duplicate dates in the same data set with varying values since VLOOKUP will always return the first matched value (reading top down).
Place Set1 in A1:B9 (header in row 1). Add a column of zeros next to it in column C, so A2:A9 is dates, B2:B9 is values and C2:C9 is zeros.
Place Set2 (without the header) in A10:B12; move the Set2 data to column C and put zeros in column B, so A10:A12 is dates, B10:B12 is zeros, C10:C12 is values.
Sort the range A2:C12 by Date (column A).
Easier to show with a screenshot but newbies are not allowed to post images.

How to count the values in ranges in Excel

I have two columns as shown below. Group values is 0,1,2,3,4 and scores is from 0 to 80. I want to count how many 0s (1s, 2s, 3s, 4s) are present for scores between 0 and 10; 10 and 20; 20 and 30 etc.
I am thinking to use Excel pivot table. But I am stuck - how could I achieve this?
Group scores
1 8.56163
2 34.3649
2 12.2291
0 8.75357
2 8.75967
2 5.87806
0 9.33751
2 32.0303
0 43.5567
2 11.1044
2 24.9266
1 18.9314
-------- result should look like below --------
scores group count
0-10 0 2
0-10 1 1
0-10 2 2
0-10 3 0
0-10 4 0
10-20 0 0
10-20 1 1
10-20 2 2
...
------ PS I have solved this problem using matlab. But it would be nice to see someone do it in Excel.
---------------- thanks for all the anwers. I really appreciate it. I have accepted the 1st answer.
I apologize for my previous answer. You can do binning with PivotTable.
select your whole two columns (A1:B13), insert PivotTable
under rows, put your "Group"
under columns, put your "scores"
under values, put your "Group"
click that last one ("Group" within the values quadrant) and change it to count, not sum
intermediate result:
Now in the resulting pivot table, right click on a colum and select "Group and show detail". You can configure your bins there.
Result:
One option is to use a pivot table, but another option is to use COUNTIFS, e.g.:
=COUNTIFS($A$2:$A$13,"="&$F2,$B$2:$B$13,">="&D2,$B$2:$B$13,"<="&E2)
In practice:
You could just use simple countif formulas:
Type out first criteria into cells. D1 = 0, E1 = 1, F1 = 2 etc.
Now you can just say =COUNTIF($A$2:$A$13,D1) and just drag that out.
The other column would require countifs.
Lets say D3 is blank E3 = ">10", F3 = ">20", etc.
Now D4 = "<=10", E4 = "<=20", F4 = "<=30", etc.
Now you can use =COUNTIF($B$2:$B$13,D4) for your first criteria and =COUNTIFS($B$2:$B$13,E4,$B$2:$B$13,E3) for the next criteria and just drag that out.
Hope this helps, good luck!
Please try:
This uses Grouping (by decade) for the Row labels.
To Group, right-click on one of the entries under Row Labels and Group..., then select enter Starting at:, Ending at: and By: to suit:

Count the number of rows with non-zero values in two columns in MS-Excel

I have an excel file with two columns. Let it be the ratings that different users have given for two movies. The movies represent the columns and the users represent the rows. I need to find the number of instances where a user has rated both the movies, in other words, how to find the number of rows where the two columns are both non-zeros.
[User] [Star Wars (1977)] [Star Wars (1983)]
755 1 5
5277 5 3
1577
4388 3
1202 4 3
3823 2 4
5448
5347 4
4117 5 1
2765 4 2
The answer must be 6. I preferably am looking for something which gives 6 right away without any intermediate results.
You can use COUNTIFS function:
=COUNTIFS(B1:B10,">0",C1:C10,">0")
The COUNTIFS function applies criteria to cells across multiple ranges and counts the number of times all criteria are met. See here for details.
Excel has a function COUNTBLANK to test if a field is empty.
You could use this formula
= COUNTBLANK(B1) + COUNTBLANK(C1)
and if it returns 0 then you know there are no blanks.

EXCEL - Comparing Two Columns - Removing Repeats

If two households share, they create a tie and this tie has a kinship rank that does not change, no matter how often two households share with each other.
KINSHIP RANK EXAMPLE
As you can see, it doesn't matter in which "direction" the tie happened whether it was household 5 who shared to household 3 or vice versa, the kinship rank is still 1
HH1 HH2 RANK
5 3 1
3 5 1
Therefore, I do not need every tie that occurs between two households, but only the first instance that a tie occurred between the two households.
So here is a sample list of many households who shared with each other, sometimes sharing resources with themselves, sharing only once, or sharing multiple times with the same household.
TWO HOUSEHOLD WITH REPEATED TIES
COL.A COL.B
ROW HH1 HH2
1 1 1
2 1 2
3 1 3
4 2 1
5 2 4
6 3 1
7 3 2
8 3 4
9 4 2
This is what I need it to look like:
TWO HOUSEHOLDS WITHOUT REPEATED TIES
COL.A COL.B
ROW HH1 HH2
1 1 1
2 1 2
3 1 3
4 2 4
5 3 2
6 3 4
What I have done
I wrote a simple command for placing the HH1 and HH2 information into the same cell:
=A1&"|"&B1
In the case of the second row, this looks like 1|2 inside cell C2
HH1 and HH2 are combined in column C so how will I be able to compare all of the households in column C to each other? Perhaps a highlighting rule if a repeat happens? Or in another column list if it is a delete or a keep?
Thank you for your assistance everyone.
I suggest a simple COUNTIFS to do the job like this:
=(COUNTIFS(A$1:A1,B2,B$1:B1,A2)+COUNTIFS(A$1:A1,A2,B$1:B1,B2))>0
starting in C2 and then copy down. It will show TRUE for each row which is within the range above it and false if not. Ich checks for both x/y and y/x (the order doesn't matter)
Now simply filter col C to only show rows with TRUE in it. Then simply select and delete it.
This also works with non numerical values like names.
If you still have any questions, just ask ;)
You also can wrap it up to get more informations like this:
=IF((COUNTIFS(A$1:A1,B2,B$1:B1,A2)+COUNTIFS(A$1:A1,A2,B$1:B1,B2)),"",COUNTIFS(A:A,B2,B:B,A2)+COUNTIFS(A:A,A2,B:B,B2))
For C2 and copy down. C1 gets:
=COUNTIFS(A:A,B2,B:B,A2)+COUNTIFS(A:A,A2,B:B,B2)
This will show you only at the first occurrence how many times it is within the whole range.
All done by phone, may contain errors
Use =((A1*B1)/(A1+B1))*((A1*B1)+(A1+B1)) to create unique identifiers. Then use Remove Duplicates in the Data Tools Pane of the Data Tab to remove all rows containing duplicates. Or, alternatively, use something like =IF(IFNA(MATCH(A2,A$1:A1,0),TRUE())=TRUE,"First Share","") dragged and dropped from row 2 to identify First Shares.

Find the top n values in a range while keeping the sum of values in another range under x value

I'd like to accomplish the following task. There are three columns of data. Column A represents price, where the sum needs to be kept under $100,000. Column B represents a value. Column C represents a name tied to columns A & B.
Out of >100 rows of data, I need to find the highest 8 values in column B while keeping the sum of the prices in column A under $100,000. And then return the 8 names from column C.
Can this be accomplished?
EDIT:
I attempted the Solver solution w/ no luck. 200 rows looks to be the max w/ Solver, and that is what I'm using now. Here are the steps I've taken:
Create a column called rank RANK(B2,$B$2:$B$200) (used column D -- what is the purpose of this?)
Create a column called flag just put in zeroes (used column E)
Create 3 total cells total_price (=SUM(A2:A200)), total_value (=SUM(B2:B200)) and total_flag (=(E2:E200))
Use solver to minimize total_value (shouldn't this be maximize??)
Add constraints -Total_price<=100000 -Total_flag=8 -Flag cells are binary
Using Simplex LP, it simply changes the flags for the first 8 values. However, the total price for the first 8 values is >$100,000 ($140k). I've tried changing some options in the Solver Parameters as well as using different solving methods to no avail. I'd like to post an image of the parameter settings, but don't have enough "reputation".
EDIT #2:
The first 5 rows looks like this, price goes down to ~$6k at the bottom of the table.
Price Value Name Rank Flag
$22,538 42.81905675 Blow, Joe 1 0
$22,427 37.36240932 Doe, Jane 2 0
$17,158 34.12127693 Hall, Cliff 3 0
$16,625 33.97654031 Povich, John 4 0
$15,631 33.58212402 Cow, Holy 5 0
I'll give you the solver solution as a starting point. It involves the creation of some extra columns and total cells. Note solver is limited in the amount of cells it can handle but will work with 100 anyway.
Create a column called rank RANK(B2,$B$2:$B$100)
Create a column called flag just put in zeroes
Create 3 total cells total_price, total_value and total_flag
Use solver to minimize total_value
Add constraints
-Total_price<=100000
-Total_flag=8
-Flag cells are binary
This will flag the rows you want and you can grab the names however you want.

Resources