I have a csb files with 2,000,000 entries. Some of rows are duplicated rows which I wanted to find and remove it.
Logic
The data will be like in the below format
column 1...column 2 ...column 2 ...column 4
XXX ...123 ...abc ...a
YYY ...456 ...def ...b
ZZZ ...123 ...abc ...c
ZZZ ...123 ...abc ...d
So, I wanted to find out duplicates without considering last column. So as per the above table, last 2 entries will be duplicate entry.
In these duplicates, I wanted to remove the first entry and retain the 2nd one.
Please suggest which method will better achieve this solution.
Related
I have two columns like this:
Initial Table
COL-A and COL-B are coming from two different files. I have to do two things: a) match these two columns, b) find which data is missing. What I do, I insert a third column, COL-C, by adding VALUE(LEFT(B2,6)). Then I sort COL-A individually. After that, I sort COL-B and COL-C based on the value in COL-C. Then I deduct COL-A from COL-C in COL-D, move data manually to find the missing value. Finally, it looks like this.
Final Table
I work with these data every day. Numbers of data change daily, maybe today I will have 250 data, maybe the next day it will be 400 and this is a very important thing to remember. I was wondering if anyone can tell me how to get it done in a single click. I am willing to use VBA if needed. My Excel version is 2016. Thanks.
Sorry, this is not elegant, but will work without manual re-arrangement of COL-A or COL-B
In Col C, do as you thought, strip the first 6 characters of B
=IFERROR(VALUE(LEFT(B2,6)),0)
Then in Col D use a VLOOKUP to identify the missing ID from column A
=IFERROR(VLOOKUP(A2,C:C,1,FALSE),"Missing")
I have a homework assignment where I have to merge data of two excel sheets by performing some cleansing operations using formulas.
Sheet 1:
OrderID | Full Name | Customer Status
1001 Waqar Hussain Silver
2002 Ali Moin Gold
Sheet 2:
OrderID | First Name | Last Name | Customer Status
A1003 Junaid Ali 2
A2004 Kamran Hussain 1
Sheet 3:(Combined Sheet) - Expected
OrderID | Full Name | Customer Status
1001 Waqar Hussain Silver
2002 Ali Moin Gold
1003 Junaid Ali Silver
2004 Kamran Hussain Gold
There are probably a lot of ways to do this. First make sure the data is cleaned. If you are already 100% positive the data is clean you can skip this step. If you aren't sure it's better to be safe than sorry. For each column create a new column using the CLEAN and TRIM functions to remove any non-printable characters and any extra spaces. Something similar to =TRIM(CLEAN(A2)). Then drag the formula for each cell.
After this in order to merge the data together we need something to join on. The full name seems to make the most sense. On sheet two we'll write a new function to join the first name and last name together. The =CONCAT formula should work.
=CONCAT(First Name, " " ,Last Name). Make sure to note the extra space added by the quote. That way it matches the Full Name from Sheet 1. Looks like we'll also need to strip out the letter from Order ID in sheet 2. I'm going to assume that all Order IDs are 5 characters long. If this isn't true then you'll need a different solution. You can use =RIGHT(A2,4). This will grab the right 4 characters from the text string.
At this point let's create a distinct list. Copy the Full Names from Sheet1 and Paste them on to sheet 3. Copy the Full Names we created on Sheet2 and Paste VALUES onto sheet 3 below the full names from sheet 1. Then select all the rows in the column and go to the Data tab. Click "Remove Duplicates". This will now generate a distinct list of values.
We can now merge the data together using an INDEX MATCH. There are lots of great tutorials on how to use INDEX match in combination. It's a little long to explain on this thread, but this is a great thread explaining how it works. It's worth taking 10 minutes to fully understand it because it is a formula you will use thousands of times throughout your life.
https://www.deskbright.com/excel/using-index-match/
Let me know if I can clarify anything.
Best,
Brett
I have a spreadsheet with six columns, and I want to remove duplicates for each line which contains values within columns 3 and 4 that match the values of 3 and 4 on another line. For example, these two lines would need to be deduped:
alpha.txt, beta.txt, 03/12/15, exit, gamma.txt, bravo.txt
gamma.txt, bravo.txt, 03/12/15, exit, alpha.txt, beta.txt
Since columns 3 and 4 match, I want these to be deduped. I tried using the Data > Remove Duplicates feature within Excel (selecting the entire table but removing duplicates only by columns 3 and 4) but it fails to remove all of the duplicates.
Does anyone have an alternative method in Excel or perhaps via sort or some other Linux utility?
I have a file which contains Last Name, First Name MI for about 5000 people.
I need to split them in 3 different columns.
The issue I am facing is , that sometimes there are more than 1 first names, for example I have a person as Davis, Mary Ann L.
I want Davis in one column.
Mary Ann in another column and L in the 3rd column. Basically check if after the comma the number of characters is greater than 1. If it is greater than 1 then consider it as first name. If number of characters is equal to 1, then consider it Middle Initial.
How can I achieve this?
In your case, I would do a first approach by using the "Text to Column" command. Just mark the whole column, then choose Data -> Text to Column. Choose "delimited", then next, then select "Space".
After this, I would look through the processed data and get a picture. I assume that most records will be ok already now. And those records which are exceptions to the standard should be easily identifyable. You could even filter for them.
Only then, in a third step, I'd write a formula which processes the columns you have created in the first step.
Or, possibly a formula is not necessary at all. Possibly you can just easily filter and process some of the exceptions manually.
I have two very large files we'll call Old and New. New contains many entries that Old contains. What I need to do is remove any entry from New that Old contains. There are 9,459 entries in Old with 55 columns. New contains 11,983 entries with 76 columns. I need to make the comparison based on 5 columns; 'name_last', 'name_first', 'name_middle', 'street', and 'type'
I'm using Excel 2010, I'm very new to it, and haven't got a clue where to start.
Make up a concatenated column in each file to "glue" together 'name_last', 'name_first', 'name_middle', 'street', and 'type'. Something like
this
=LOWER(A2&B2&C2&D2&E2)
(The LOWER will let you run a case insensitive search)
Add a formula like this (change sheet names and columns to suit)
=ISNA(MATCH(F2,[old.xlsx]Sheet2!$F:$F,0))
to look up each value in column F of "new.cls" against the entire list of concatenated values in "old.xls"
AutoFilter the TRUE results to return the non-matches, then delete these rows