Remove Duplicates with or without sorting

Remove Duplicates with or without sorting - excel

I have a large column of texts (5 digit integers concatenated with two letters, like: 12345AB ) and values (up to 8 digit positive integers, like: 12345678) . The list is around 12,200 total and when I do remove duplicates, it reduces to 7015 total. If I sort the result and then do another remove duplicates, I am left with 6324 entries. On the other hand if I sort first and then do remove duplicates, I am left with 6324 entries.
Is this a common issue that when number and text are mixed up that removing duplicates works only after sorting.
I can upload my file if this is not a common issue and is a problem with my file. I'm guessing if the row starts with numbers (text) then the excel search algorithm only goes down the column till such a point that it stops seeing numbers (text) and we miss out on the duplicates that show up later?
I shudder at the thought that I've been using remove duplicates incorrectly all this while.
Please help. Thanks.
EDIT To Include the actual file I am working with:
Link here

seems like you want to ensure is that they're all the same type, no? an easy way to coerce a cell to be text is:
=A1 & ""
and a number is:
=A1 * 1

I was able to accomplish this by using the Text to Columns option.
Select column (B)
Select Text to Columns on the Data Tab
Select delimited click next
Click next as there are no delimiters
Under column data format select Text
Then remove duplicates
I ran into this issue with VLookup before as well it ensures proper formatting of all data in the column.

Related

Add blank column to query

Is there a way to add a blank column to a query in Query Studio? I tried to use a calculation on an existing column but the only options that I get are for First Characters, Last Characters, Concatenation, and Remove Trailing Spaces. None of these options allow you to enter a decode, case or IF statement.
Any assistance is greatly appreciated. Thanks.

It's a bit of a hack as Query Studio is really all about making it easy to get data and doing anything with layout is really a job for Report Studio, but you can do the following:
a) create a calculated column on a text field. Select 'Concatenation' as the operation and put a space as the preceding text. Click ok
b1) right-click on the new column and select 'Format', then 'Text' and enter 1 for the number of characters
or
b2) create another calculated column from the first calculated column, set it to 'first characters' and enter 1 for the number of characters. The first calculated column can now be deleted.
Both of these approaches will give a column that only contains a single space - not actually blank but close enough for most purposes. The first approach is a little quicker but may result in the text still existing in some output versions (e.g. csv) - I'd need to do more testing to confirm.
The column title can be edited (to be set to blank) by double clicking it, of course.

Find Duplicates in a column with large number (as text)

I have a SpreadSheet with a column with large number represented as text, and when I apply the duplicate operation to check ( I do not use any formula, I am using excel 2010 in-built functionality of "Conditional Formatting" -->"Highlight Cells Rule" --> "Duplicate Values") even distinct values are shown as duplicate values.
For example:
If I just have following values in a column of spread sheet:
26200008004000800000000000000001
26200008004000800000000000000002
26200008004000800000000000000003
It shows as all 3 values being duplicate.
How do I fix this and check for duplicates with these large numbers in excel.
P.S: I know excel has a 15 digit limit to precision, but is there a work around or another application to find duplicates.

It seems that DupUnique property is converting the value to a number. I also note similar behavior with COUNTIF. Accordingly, I would suggest, in this situation, that you use the conditional format option to use a formula. The formula I would suggest (assuming that the range to check for duplicates is A2:A10, would be:
=SUMPRODUCT(--($A2=$A$2:$A$10))>1

I use a helper column in which I concatenate the number with a letter to make it an alphanumeric entry.
=concatenate("a",'large number cell')
or
="a"&'large number cell'
a26200008004000800000000000000001
I hope this works for you.

When pasting the numbers into Excel, put an apostrophe in front of the number to convert the number to text like this
'26200008004000800000000000000001
Thereafter you can do duplicate checks using Data -> Remove Duplicates.
If you already have that kind of data in Excel, it may appear in Exponential values and chances are that Excel chomped it up to 15 digits numeric precision. You may have to re-enter the large data with apostrophe in front of them.

Sorting a long list in excel numerically with numbers formatted as non-numeric values mixed in

I am trying to sort a large column of numbers in excel. This list contains many blank rows. when I highlight the column and select the sort function, I am only given the option to sort from "A-Z" and not from "smallest to largest" as I wish to sort my list. I am assuming this is because though the list appears to be made solely of numbers, some of the numbers must have been added in another format (such as text) and therefore cannot be sorted from smallest to largest. Unfortunately, though I have searched, I cannot find a way short of re-typing each number on the list (which is not feasible) to determine which numbers are in numeric form and which are not.
I have tried to follow the suggestion found here (http://excel.tips.net/T002922_Sorting_Huge_Lists.html) which recommends typing the number "1" into a blank space in the document, highlighting the data you wish to sort in numeric form, and pasting special with the multiply function so resulting list in a reproduction of your original list formatted solely in numeric form. The problem is that my list contains many blank spaces and using this strategy, all blanks are considered "0" and therefore after the multiplication the data produced has converted all of my blanks to "0". This is unacceptable as "0" is one of the possible numbers in my number set and so confusing blanks with true "0"s cannot happen.
I have also tried to highlight the list and change the formatting via the formatting bar on the home tab to "numbers" from "general" but this has not helped.
Is there another way to either convert this column of numbers to a column of numbers solely in numeric form or to otherwise sort this list from smallest to largest?
OR could there be another reason such as the length of the list (3000 rows long) or the multiple blank spaces that are causing the list not to be able to sort from smallest to largest?
Please keep in mind I am a VERY VERY novice excel coder and cannot use macros. Any suggestions must be basic or well explained in a step by step manner.
Thank you so much!

How do I remove duplicate content within a sigle excel cell

I have individual cells in excel with the following content in each of them
http://www.teng.mossdemo.com.au/wp-content/uploads/images/products/m1423.jpg|http://www.teng.mossdemo.com.au/wp-content/uploads/images/products/m1423.jpg
http://www.teng.mossdemo.com.au/wp-content/uploads/images/products/rt2899.jpg|http://www.teng.mossdemo.com.au/wp-content/uploads/images/products/rt2899.jpg
This is one cell in a long row for a dump of data for products within an ecommerce site. A data migration has somehow added the same image more than once to the same product. Each separate image image is separated by the Pipe "|" symbol.
I want to search each cell in this column of the sheet and remove the duplicated image reference and the Pipe symbol.
So the examples above become
http://www.teng.mossdemo.com.au/wp-content/uploads/images/products/m1423.jpg
and
http://www.teng.mossdemo.com.au/wp-content/uploads/images/products/rt2899.jpg

The suggested answer of finding the pipe with SEARCH is a good general answer, however in this instance as the source string is always twice the length of the desired we can just chop it in half with the formula below and drag it down.
=LEFT(A1,(LEN(A1)-1)/2)

In addition to a formula, you can use Data>Text to Columns, which is a good thing to know about. Select the entire column and then you up the dialog. In step one choose "Delimited" and in step two choose the pipe symbol:
When you're finished, delete the first column.

I figured out that this works for some more complex scenarios. I think it should work for this one as well.
=IFERROR(LEFT(C2,(FIND(LEFT(C2,20),C2,2)-2)),C2)
I entered this into D2 and copied it all the way down the column. I then copied and pasted the values back into Column C.
The problem I had was that not all of the cells in my column had duplicate text. Of those that did, the duplications were not delineated by any unique character (There was a single space in front of each duplication.), and the duplicated text was often an incomplete duplication so the length was not consistently symmetrical.
The "20" is an arbitrary number of characters I picked for excel to use from the front of the text to identify where the text started to repeat. There are enough people here who know excel better than I who can explain what the rest of the formula does. I figured it out by poking around.

How can I batch format string attributes in a CSV file?

So I have a csv file of around 15,000 rows. I only need to edit one of the 10 columns which is a Postcode. None of the columns have headers. It is currently in the format 'AB101AA' which I need to change to 'AB10 1AA'.
First off, is there a method for which I can do this for every row?
Then it gets more complicated in that Postcodes vary in format to these four types;
'A1 1AA',
'A10 1AA',
'AB1 1AA' and
'AB10 1AA'.
What I'm trying to do is to find a way to run through every row and first of all test the format to check whether it is as above and then edit if needs be, to force that space.
Any help would be much appreciated.
Cheers.

How about opening it in Excel, then
Use a formula to add a column which takes the first LENGTH(A1)-3 characters, a space, and the last three characters (copy/paste, or drag the + on the lower right corner of the cell, to make sure the formula is replicated in every row of that new column)
Copy the extra column
Paste the values over the original column
Delete the extra column

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string