Remove duplicates by line, based on specific columns - excel

I have a spreadsheet with six columns, and I want to remove duplicates for each line which contains values within columns 3 and 4 that match the values of 3 and 4 on another line. For example, these two lines would need to be deduped:
alpha.txt, beta.txt, 03/12/15, exit, gamma.txt, bravo.txt
gamma.txt, bravo.txt, 03/12/15, exit, alpha.txt, beta.txt
Since columns 3 and 4 match, I want these to be deduped. I tried using the Data > Remove Duplicates feature within Excel (selecting the entire table but removing duplicates only by columns 3 and 4) but it fails to remove all of the duplicates.
Does anyone have an alternative method in Excel or perhaps via sort or some other Linux utility?

Related

How to remove duplicates rows from CSV file using Node.js

I have a csb files with 2,000,000 entries. Some of rows are duplicated rows which I wanted to find and remove it.
Logic
The data will be like in the below format
column 1...column 2 ...column 2 ...column 4
XXX ...123 ...abc ...a
YYY ...456 ...def ...b
ZZZ ...123 ...abc ...c
ZZZ ...123 ...abc ...d
So, I wanted to find out duplicates without considering last column. So as per the above table, last 2 entries will be duplicate entry.
In these duplicates, I wanted to remove the first entry and retain the 2nd one.
Please suggest which method will better achieve this solution.

Feature Selection to add specific coumns

If I want to run a code in python to select all features in a dataset exluding the target variable which is the last 2 columns and exclude the first column, what would the code be for that line. I tried to run the code below and got an error. The total number of columns in the dataframe is 38.
In short I want to define 'features' as every column except the 1st and the last 2 columns.
features = df_model.loc[:,0:38]
Any help would be appreciated.

Excel - rearrange rows with same columns value into columns in the same row

I have data from other software that outputs in the format shown attached. This is what it initially looks like. I have removed data from all other columns that were not relevant to this task (i.e. columns A - H, J, K, M, N and P).
All items have different ID #s. However, each item has the same categories. Each category can have between 1 and 3 values, either numerical or alphabetical.
The actual data I'm working with has close to 500 items.
I'm looking for a way to rearrange the data so it looks like this.
In response to a comment:
I want to do several things
1. Move everything over to start at A1; in a separate sheet is good
2. Rearrange data so the only columns are ID | CAT | Value 1 | Value 2 | Value 3
3. Have each CAT only listed once for each item
4. Move each Value with the same ID and CAT to be listed on one row
If any further elucidation is needed, please inform me.
Thanks to all!
Have you tried creating a PivotChart on a separate sheet and filtering out the blanks?
Or, have you considered filtering the information?
I think we might need some more clarification -- from your message, it seems like you just want to create headers and delete the blank rows/columns.

Attempting to sort a list that is generated from a unique point extraction of an array

The issue is sorting an array that is generated automatically from an data source using a formula that extracts unique data points. (Data points are date/time)
The data is being extracted with this fomula.
=INDEX(Table_ExternalData_1[SampleDateTime],MATCH(0,INDEX(COUNTIF($G$2:G2,Table_ExternalData_1[SampleDateTime]),0,0),0))
Once extracted, the data is not sorted right away. The current data is extracted from a database via an SQL string that pulls in data corresponding to the data and time that the data point was created.
Because of this, the extracted points are not in the correct order. I am attempting to sort the extracted data points from earliest to latest to continue with the data sorting, but need the date/times to be sorted in a separate row.
I have attempted to use a pivot table, but it isn't exactly what I need and ends up being a messier end product than I need.
All assistance is appreciated.
Example is below.
1
2
3
5
1
2
3
4
6
5
3
I need this.
1
2
3
4
5
6
I did end up finding a solution that I will be able to modify. Using a single row of a pivot table, I took just the date/time column and had the PivotTable function sort the data to be utilized as necessary.
Thank you.
The fact that the range in the example you give:
1) Consists of entries of a numeric datatype only
2) Does not contain any blanks
means that the solution is relatively simple.
Assuming that data is in A1:A11, first use a single cell somewhere within the worksheet to count the number of expected returns. For example, using B1 for this purpose, enter this formula in that cell:
=SUM(IF(FREQUENCY(A1:A11,A1:A11),1))
Your main formula is then:
=IF(ROWS($1:1)>B$1,"",SMALL(IF(FREQUENCY(A$1:A$11,A$1:A$11),A$1:A$11),ROWS($1:1)))
the latter being copied down until you start to get blanks for the results.
Regards

Removing entries from another file

I have two very large files we'll call Old and New. New contains many entries that Old contains. What I need to do is remove any entry from New that Old contains. There are 9,459 entries in Old with 55 columns. New contains 11,983 entries with 76 columns. I need to make the comparison based on 5 columns; 'name_last', 'name_first', 'name_middle', 'street', and 'type'
I'm using Excel 2010, I'm very new to it, and haven't got a clue where to start.
Make up a concatenated column in each file to "glue" together 'name_last', 'name_first', 'name_middle', 'street', and 'type'. Something like
this
=LOWER(A2&B2&C2&D2&E2)
(The LOWER will let you run a case insensitive search)
Add a formula like this (change sheet names and columns to suit)
=ISNA(MATCH(F2,[old.xlsx]Sheet2!$F:$F,0))
to look up each value in column F of "new.cls" against the entire list of concatenated values in "old.xls"
AutoFilter the TRUE results to return the non-matches, then delete these rows

Resources