Removing duplicates within a cell - excel

I can't find a way to remove duplicate values inside a same cell in Excel. For example, in A1, I have:
DOG DOG DOG
I want to have only DOG.
Code output: This will result certain values in the Excel cell (37, 4), such as:
2000 3000 0300 0300 2000
I am lost as to how to delete the repeated values in the cell.

You can use a regex with a backreference to match duplicated words or phrases. The nongreedy pattern ^(.+?)\s(\1\s*)+$ will match any duplicating phrase with whitespace in between.
Function RemoveDupes(strText)
With New RegExp
.Pattern = "^(.+?)\s(\1\s*)+$"
If .Test(strText) Then RemoveDupes = .Replace(strText, "$1")
End With
End Function
Tests:
WScript.Echo RemoveDupes("Dog Cat Dog Cat") ' => Dog Cat
WScript.Echo RemoveDupes("Dog Dog") ' => Dog
WScript.Echo RemoveDupes("Dog Dog Dog") ' => Dog
WScript.Echo RemoveDupes("Dog Dog Dog Dog") ' => Dog
WScript.Echo RemoveDupes("Dog Cat Dog Cat Dog Cat") ' => Dog Cat
Edit:
I see you've added some additional examples that don't repeat perfectly as your original examples did. In that case, you'll need to use an alternative method. If your values are always separated by a space, consider splitting your values into an array and storing them into a Dictionary to keep track of unique values. For example:
Dim d, a, i
Set d = CreateObject("Scripting.Dictionary")
a = Split(objExcel.ActiveSheet.Range("A1"), " ")
For i = 0 To UBound(a)
If Not d.Exists(a(i)) Then d.Add a(i), ""
Next
Now your Dictionary should have unique values from your cell and you can recombine them:
objExcel.ActiveSheet.Range("A1") = Join(d.Keys, " ")

Should be something like... We need to know where the space is between DOG DOG..
Excel:
=LEFT(A1,SEARCH(" ",A1)-1)
VBS:
Left(string, length)
Ref. https://msdn.microsoft.com/en-us/library/sk3xcs8k%28v=vs.84%29.aspx
InStr:
SOMESTRING = "DOG DOG DOG"
InStr (1, SOMESTRING, " ") '-- look for the position of the space
Ref. https://msdn.microsoft.com/en-us/library/wybb344c%28v=vs.84%29.aspx
Putting the two together we get:
SOMESTRING = "DOG DOG DOG"
RESULTS = Left(SOMESTRING, InStr (1, SOMESTRING, " ")-1)
Hope this helps!

Related

cut important parts of string in a column

i have a column called Dateiname which contains a string. my goal is to get only the string Gruen Gelb Orange from the column and create a new column which represents each row if it contains Gruen Gelb Orange
i tried with this code:
result['Y'] = result.Dateiname.str[-10:-4]
as these words are not equally long i get 4_ or 1_ or just _, depending if it is Gruen or Gelb which i want to slice out. Is there any possibility to get the parts Gruen Gelb Orange of the column Dateiname and save it into the column Y?
the goal would be this:
Use str.extract:
result['Y'] = result.Dateiname.str[-10:-4].str.extract('(Gruen|Gelb|Orange)')
Another solution is split by _ or . and get second value from end by indexing:
result.Dateiname.str.split('_|\.').str[-2]
Or if want check all data:
result['Y'] = result.Dateiname.str.extract('(Gruen|Gelb|Orange)')
If your data follows same format as required_word followed by .csv then use str.extract with regex:
For Example:
result = pd.DataFrame({'Dateiname':['asdfjaskld_3242_34.fsdf_450_Violet.csv',
'asdfjaskld_3242_34.fsdf_450_Green.csv',
'asdfjaskld_3242_34.fsdf_450_Indigo.csv',
'asdfjaskld_3242_34.fsdf_450_Red.csv']})
result['Y'] = result.Dateiname.str.extract(r'([a-zA-Z]+).csv')
print(result)
Dateiname Y
0 asdfjaskld_3242_34.fsdf_450_Violet.csv Violet
1 asdfjaskld_3242_34.fsdf_450_Green.csv Green
2 asdfjaskld_3242_34.fsdf_450_Indigo.csv Indigo
3 asdfjaskld_3242_34.fsdf_450_Red.csv Red
You can use:
result['Y'] = result['Dateiname'].str.split('_').str[-1].str[:-4]

comparing two address columns

I have addresses in U and V. I want to see if they are somewhat similar and if they are say "Update" If not say "Omit".
For example 246 N High street in U and 246 North High St in V would return a value of Update.
246 N High Street in U and 458 Auburn Drive in V would return a value of Omit.
Any ideas?
There are a lot of algorithms for doing fuzzy matching. One of the easier ones to implement in excel is N-Gram.
To perform an n-gram match, we have to break each address up into a list of sets of smaller character lengths. A 2-gram list of your address 246 N High street would look like 24,46,6 , N,N , H,Hi,ig,gh,h , s,st,tr,re,ee,et. We could do the same with a 3-gram: 246,46 ,6 N, N ,N H, Hi,Hig,igh,gh ,h s, st,str,tre,ree,eet
We do this with both addresses, then we can check each item in the first address's list to see if it appears in the second address's list; count the matches and divide that by the number of items in the first list. That will give you a percentage of how close they are.
You could get fancy with cell formulas mid() and countif() to do this with sheet formulas, but I think it's easier to just write it out in VBA and make it a UDF.
Function NGramCompare(string1 As String, string2 As String, intGram As Integer) As Double
'Take in two strings and the N-gram
Dim intChar As Integer, intGramMatch As Integer
Dim ngramList1 As String, ngramList2 As String, nGram As Variant
Dim nGramArr1 As Variant
'split the first string into a list of ngrams
For intChar = 1 To Len(string1) - (intGram-1)
If ngramList1 <> "" Then ngramList1 = ngramList1 & ","
ngramList1 = ngramList1 & Mid(string1, intChar, intGram)
Next intChar
'split the secong string into a list of ngrams
For intChar = 1 To Len(string2) - (intGram-1)
If ngramList2 <> "" Then ngramList2 = ngramList2 & ","
ngramList2 = ngramList2 & Mid(string2, intChar, intGram)
Next intChar
'Split the ngramlist1 into an array through which we can iterate
nGramArr1 = Split(ngramList1, ",")
'Iterate through array and compare values to ngramlist2
For Each nGram In nGramArr1
If InStr(1, ngramList2, nGram) Then
'we found a match, add to the counter
intGramMatch = intGramMatch + 1
End If
Next nGram
'output the percentage of grams matching.
NGramCompare = intGramMatch / (UBound(nGramArr1) + 1)
End Function
If you've never used a UDF:
Go to visual basic editor (VBE) with Alt+F11
In the VBA Project window, find your workbook and right click on the name
Choose: Insert>>Module
Double click the new module in the list to bring up it's code window
Paste this function in and save your workbook
Then, assuming address1 is in A1 and address2 is in B1 you can put, in C1:
=NGramCompare(A1, B1, 2)
Which, for your first address, will spit out 56%. Which seems like a reasonably good match. If you find you are getting too many positive hits, you can change your 2-gram to be a 3-gram by changing that last parameter.
To take it a step further so it will say "Update" or "Omit" you could do:
=If(NGramCompare(A1, B1, 2)>.30, "Update", "Omit")
I just set that so that it will consider a match anything above 30%, but you can adjust as necessary. No matter where you set it, you will probably end up with a percentage of compares that are false positives or false negatives, but that's the way fuzzy matching goes.
Some of the naive approaches can be to compare the first few characters
=LEFT(A1,5)=LEFT(B1,5)
or to replace parts until they match
=(SUBSTITUTE(SUBSTITUTE(LOWER(A2)," street"," ST")," north "," N ")
=SUBSTITUTE(SUBSTITUTE(LOWER(B2)," street"," ST")," north "," N "))
both will probably turn into a big ugly formula after adjusting for most cases

Inserting a Word in Excel Cell At Specified Position

In excel I have two columns A, B, C,D and E.In each row of column A, there is a paragraph. In the column B, C, D and E, there are four different words in front of each cell of column A. I want to PUT these 4 different words which are in column B, C, D and E into the paragraph present in column A cell. But all of these 4 words should be equally spaced throughout the paragraph. e.g 1 word should be in the beginning of the paragraph. And the rest of the three words should be equally spaced throughout the paragraph.
I have removed the leading and trailing spaces by applying "TRIM" function. The paragraph is composed of multiple lines with line breaks and multiple sub paragraphs.
Note: If the solution is flexible for more number of words e.g 7,8 or 9 words,then it will be great.
Following code might help. This is a UDF.
Function InsertWord(Source As String, InsWord As String, Pos As Integer)
Dim arr() As String
arr = Split(Source, " ")
wordCount = UBound(arr)
If wordCount < 1 Or (Pos - 1) > wordCount Or Pos < 0 Then
InsertWord = Source
Else
arr(Pos - 1) = arr(Pos - 1) & " " & InsWord
InsertWord = Join(arr, " ")
End If
End Function
See image for reference:

Separating Data in the same excel column

I have a column of data with multiple value types in it. I am trying to separate out out each value type into a separate column. Below an example of the data:
6 - Cutler, Jay (Ovr: 83)
22 - Forte, Matt (Ovr: 88)
86 - Miller, Zach (Ovr: 80)
I tried to separate the data by a) going to data and clicking text to columns; however, the "Ovr: 80" portion of the data does not separate "Ovr" from 80. I also tried b) to convert to .csv file, but again was unable to separate "Ovr" from "80". Is there a formula I can use to separate this portion of the data from the rest?
I would like the data to be separated into different columns as show below:
6 | Cutler, | Jay | Ovr | 83
22 | Forte | Matt | Ovr | 88
86 | Miller | Zach | Ovr | 80
Any insight is much appreciated!
Select the cells you wish to process and run this macro to place results in the cells to the right of the selected cells:
Option Explicit
Sub dural()
Dim r As Range, s As String, ary
Dim i As Long, a
For Each r In Selection
s = r.Value
If s <> "" Then
s = Replace(Replace(s, "-", " "), ",", " ")
s = Replace(Replace(s, "(", " "), ")", " ")
s = Application.WorksheetFunction.Trim(Replace(s, ":", " "))
ary = Split(s, " ")
i = 1
For Each a In ary
r.Offset(0, i).Value = a
i = i + 1
Next a
End If
Next r
End Sub
using the method above your could do something like this...
first clean the text so its more manageable, using this formula and copying in a column you can clean it so it become a space delimited set
=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A1,"- ",""),",",""),"(",""),")",""),":","")
from there just copy the values the formula give you to a new sheet maybe and then use 'Text To Columns to get it split into columns.
For the record I do not recommend this method if you are willing to do the text to column option.
Functions used for this solution are:
LEFT function
FIND function
MID function
for your first column of text use the following:
=left(A1,find(" ",A1))*1
That will pull out the first number presuming you do not have any leading spaces. The *1 converts from text to a number.
for your second column of last times use the following:
=MID(A1,FIND("-",A1)+2,FIND(",",A1)-(FIND("-",A1)+2))
Provided you have a coma and a dash as indicated in your example data you will not get an error and it should pull the last name without the coma.
For your third column of first names follow the same general technique as last names with the following,
=MID(A1,FIND(",",A1)+2,FIND("(",A1)-2-(FIND(",",A1)+2)+1)
Follow the similar pattern to get you over column
=MID(A1,FIND("(",A1)+1,FIND(":",A1)-1-(FIND("(",A1)+1)+1)
and finally to get your age column use this:
=MID(A1,FIND(":",A1)+2,FIND(")",A1)-1-(FIND(":",A1)+2)+1)
copy the above formulas down as far as you need to go.

How to concatenate a list of words into a sentence with "and" before last item in Excel?

I want to join a list of words in Excel (not in VBA... with an Excel formula in the worksheet) to the following specifications:
Formula should ignore empty cells.
Formula should concatenate the words with "and" before final item if there is more than one item in the array of cells.
Formula should add "," between items if there are more than two items.
Examples:
A1=dog
A2=cat
A3=bird
A4=fish
Result would be: dog, cat, bird, and fish
A1=dog
A2=cat
A3=(empty cell)
A4=fish
Result would be: dog, cat, and fish
A1=dog
A2=(empty cell)
A3=bird
A4=(empty cell)
Result would be: dog and bird
A1=dog
A2=(empty cell)
A3=(empty cell)
A4=(empty cell)
Result would be: dog
Pretty please? I promise I've searched and searched for the answer.
Edit: Thank you, ExcelArchitect, I got it! This was the first time I'd ever used a custom function. You use it just like any other function in the worksheet! This is so great.
Not to push my luck, but how to do I get two cells to concatenate with my result if there is only one word in the result and two other cells if there is more than one word? Example: If the function you made for me returns just "dog", I'd want it to concatenate a cell with the text (B1) "My favorite thing to wear is a " and then "dog" and then another cell (B2) that says " costume." to make the sentence "My favorite thing to wear is a dog costume." But if it returns more than one animal, it would concatenate two other cells like this: Cell C1 "My favorite things to wear are " and "dog, cat, and bird" and Cell C2 " costumes." so that it would say "My favorite things to wear are dog, cat, and bird costumes."
If you're curious, my data really has nothing to do with animals or costumes. I am writing a program that will score a psychological test and then create an interpretive report from the test scores (I'm a psychologist).
-Mary Anne
Mary Anne:
This would be a great time to use VBA! But if you don't want to, there is a way to accomplish your goal without it.
You have to account for all of the possible outcomes here. With 4 different animals that means you have 15 outcomes:
Your equation just has to take into account all 15. It is VERY long and drawn out as a result. As such, if you have more than 4 animals that you'd like to turn into phrases, you should go the VBA route.
Here is my set up:
The formula in A7 is the following:
=IF(AND(A2<>"", A3="", A4="", A5=""), A2, IF(AND(A2="", A3<>"", A4="", A5=""), A3, IF(AND(A2="", A3="", A4<>"", A5=""), A4, IF(AND(A2="", A3="", A4="", A5<>""), A5, IF(AND(A2<>"", A3<>"", A4="", A5=""), A2&" and "&A3, IF(AND(A2<>"", A3="", A4<>"", A5=""), A2&" and "&A4, IF(AND(A2<>"", A3="", A4="", A5<>""), A2&" and "&A5, IF(AND(A2="", A3<>"", A4<>"", A5=""),A3&" and "&A4, IF(AND(A2="", A3<>"", A4="", A5<>""), A3&" and "&A5, IF(AND(A2="", A3="", A4<>"", A5<>""),A4&" and "&A5, IF(AND(A2<>"", A3<>"", A4<>"", A5=""), A2&", "&A3&", and "&A4, IF(AND(A2<>"", A3<>"", A4="", A5<>""), A2&", "&A3&", and "&A5, IF(AND(A2<>"", A3="", A4<>"", A5<>""), A2&", "&A4&", and "&A5, IF(AND(A2="", A3<>"", A4<>"", A5<>""), A3&", "&A4&", and "&A5, A2&", "&A3&", "&A4&", and "&A5))))))))))))))
Here it is via Excel:
Mary Anne - I'm such a nerd that I had to do this. Here is the VBA solution, and you can have as many names as you want! Paste this code into a new module in the workbook (go to Developer -> Visual Basic, then Insert -> New Module, and paste), then you can use it in your worksheet like a regular function. Just give it the range where the names are and you should be good to go! -Matt
Function CreatePhrase(NamesRng As Range) As String
'Creates a comma-separated phrase given a list of words or names
Dim Cell As Range
Dim l As Long
Dim cp As String
'Add commas between the values in the cells
For Each Cell In NamesRng
If Not IsEmpty(Cell) And Not Cell.Value = "" And Not Cell.Value = " " Then
cp = cp & Cell.Value & ", "
End If
Next Cell
'Remove trailing comma and space
If Right(cp, 2) = ", " Then cp = Left(cp, Len(cp) - 2)
'If there is only one value (no commas) then quit here
If InStr(1, cp, ",", vbTextCompare) = 0 Then
CreatePhrase = cp
Exit Function
End If
'Add "and" to the end of the phrase
For l = 1 To Len(cp)
If Mid(cp, Len(cp) - l + 1, 1) = "," Then
cp = Left(cp, Len(cp) - l + 2) & "and" & Right(cp, l - 1)
Exit For
End If
Next l
'If there are only two words or names (only one comma) then remove the comma
If InStr(InStr(1, cp, ",", vbTextCompare) + 1, cp, ",", vbTextCompare) = 0 Then
cp = Left(cp, InStr(1, cp, ",", vbTextCompare) - 1) & Right(cp, Len(cp) - InStr(1, cp, ",", vbTextCompare))
End If
CreatePhrase = cp
End Function
Hope that helps!
Matt, via ExcelArchitect.com
VBA is simpler. A formula is quite complicated, since Excel has no native functions allowing concatenation of a range. However, given that you have written that you would have up to eight animals, it is doable with the following formula which concatenates the contents of A1:A8 according to your rules. You can change those locations in the formula in the obvious locations.
I made one change: I may be wrong, but I believe English rules indicate that the comma preceding the last and should be omitted, so I did so. It could be added in if necessary. EDIT: Further investigation reveals a difference between US and UK rules: US rules are as you requested, UK rules omit the comma before the conjunction. I will modify the formulas and UDF to comply with US conventions.
In the formulas, the modification is to place a comma immediately prior to the and. The change in the UDF is likewise minor.
The formula was constructed from the following sequences:
So putting those formulas together, so as only to refer to A1:A8, we wind up with this monster:
=SUBSTITUTE(IFERROR(SUBSTITUTE(MID(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(CONCATENATE(",",A1,",",A2,",",A3,",",A4,",",A5,",",A6,",",A7,",",A8,","),",,",","),",,",","),",,",","),2,LEN(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(CONCATENATE(",",A1,",",A2,",",A3,",",A4,",",A5,",",A6,",",A7,",",A8,","),",,",","),",,",","),",,",","))-2),",",",and ",LEN(MID(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(CONCATENATE(",",A1,",",A2,",",A3,",",A4,",",A5,",",A6,",",A7,",",A8,","),",,",","),",,",","),",,",","),2,LEN(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(CONCATENATE(",",A1,",",A2,",",A3,",",A4,",",A5,",",A6,",",A7,",",A8,","),",,",","),",,",","),",,",","))-2))-LEN(SUBSTITUTE(MID(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(CONCATENATE(",",A1,",",A2,",",A3,",",A4,",",A5,",",A6,",",A7,",",A8,","),",,",","),",,",","),",,",","),2,LEN(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(CONCATENATE(",",A1,",",A2,",",A3,",",A4,",",A5,",",A6,",",A7,",",A8,","),",,",","),",,",","),",,",","))-2),",",""))),MID(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(CONCATENATE(",",A1,",",A2,",",A3,",",A4,",",A5,",",A6,",",A7,",",A8,","),",,",","),",,",","),",,",","),2,LEN(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(CONCATENATE(",",A1,",",A2,",",A3,",",A4,",",A5,",",A6,",",A7,",",A8,","),",,",","),",,",","),",,",","))-2)),",",", ")
Here is a VBA solution which will allow for any number of items; it concatenate according to the same rules as above.
Option Explicit
Function ConcatRangeWithAnd(RG As Range, Optional Delim As String = ", ")
Dim COL As Collection
Dim C As Range
Dim S As String
Dim I As Long
Set COL = New Collection
For Each C In RG
If Len(C.Text) > 0 Then COL.Add C.Text
Next C
Select Case COL.Count
Case 0
Exit Function
Case 1
ConcatRangeWithAnd = COL(1)
Case 2
ConcatRangeWithAnd = COL(1) & " and " & COL(2)
Case Else
For I = 1 To COL.Count - 1
S = S & COL(I) & ", "
Next I
ConcatRangeWithAnd = S & "and " & COL(COL.Count)
End Select
End Function
With the new TEXTJOIN function, this can be done very easily.
Step 1: Use TEXTJOIN function with the ", " delimiter, and set the ignore_empty to TRUE. This will give you comma separated, concatenated string, ignoring the blank values.
Step 2: Count the number of not blank entries in the list using COUNTA function. And subtract 1 from it. You might want to floor the value at 1 using the MAX function at this point.
Step 3: Use the SUBSTITUTE function to replace the last instance of the comma, which was calculated in Step 2, with a " and ".
Putting it all together:
=SUBSTITUTE(TEXTJOIN(", ",TRUE,A1:A14),", "," and ",MAX(1,COUNTA(A1:A14)-1))
Plug in any Range you want instead of A1:A14 in the above formula, and you will get a comma separated concatenate with an and before the last word.
Regarding duplicates:
Firstly, I really love Matt's solution and I've added this to my collection of custom functions.
What I do miss though is the possibility to remove duplicates from the phrase without removing them from the original range.
As you can't create a virtual range (a range that you can just play with in VBA independently from your source data), the solution would probably involve converting the range to an array, running some deduplication code and then creating the phrase from that.
My solution (albeit inelegant) is just to use the UNIQUE and FILTER functions to get a deduplicated list elsewhere on the spreadsheet (can be hidden if it bothers you) and to use Matt's function on that.
=UNIQUE(FILTER(yourRange,yourRange<>""))

Resources