Find specific text in a string - excel

=IFERROR(MID(I46,FIND("-",I46,8)-2,5),"")
I have a vast string of information copied into one cell from another program. There are several parts of this string that mean absolutely nothing for my current purpose. I want to extract the relevant information, it is always formatted the same way, but not always in the same position within the string. I am working with the above code which returns what I'm looking for, but also other variations when "-" is found, which can be often.
The information I want to extract will always be LetterLetter-NumberNumber, ex: AA-01 AA-02 AB-01, and so on, always 5 characters.
How can I extract just this?
Then, in another row if need be, I want to remove duplicate instances (there will nearly always be duplicates).
What I get now is;
HA-03
HA-03
T - S
Y - R
HA-03
HA-03
HA-03
HA-03
Y - R
HA-06
HA-06
R - S
HA-07
HA-09
HA-09
HA-09
First, I'd like to just get;
HA-03
HA-03
HA-03
HA-03
HA-03
HA-03
HA-06
HA-06
HA-07
HA-09
HA-09
HA-09
Then convert that into;
HA-03
HA-06
HA-07
HA-09
If there is a way to skip the middle-man, I'm all ears =)
Thank You.

You could write a macro using Range.RemoveDuplicates to de-duplicate your list and then Like or regular expressions (see the code I posted in Parsing String Mixed with HTML, Words, Numbers, and Dates) in a loop over the remaining cells to check if cells meet your criteria and delete those that do not match.
De-duplicating then matching would probably be a bit more efficient, but you could do it either way around.
If you're new to VBA and want some code, add a comment and I'll post some.
EDIT:
You'll need to go to the Visual Basic editor (Alt-F11), select menu item Tools/References..., find and check "Microsoft VBScript Regular Expressions 5.5", then click OK. Then in the project explorer (Ctrl- R), right-click on the VBA Project for your workbook and Insert > Module.
Add the following code:
Public Function RegEx(strInput As String, strRegEx As String, Optional bIgnoreCase As Boolean = True, Optional bMultiLine As Boolean = False) As Boolean
Dim RegExp As VBScript_RegExp_55.RegExp
Set RegExp = New VBScript_RegExp_55.RegExp
With RegExp
.MultiLine = bMultiLine
.IgnoreCase = bIgnoreCase
.Pattern = strRegEx
End With
RegEx = RegExp.test(strInput)
Set RegExp = Nothing
End Function
(You can add the other Regular Expression code from Parsing String Mixed with HTML, Words, Numbers, and Dates if you think you might use it later)
Add the following code (Assuming that the data that you want to delete is in Column A):
Public Sub DedupeAndFilter()
Dim RCtr As Long
ActiveSheet.Range("A:A").RemoveDuplicates Columns:=1, Header:=xlNo
For RCtr = ActiveSheet.UsedRange.Rows.Count To 1 Step -1
If ActiveSheet.Range("A:A").Rows(RCtr).Text = "" Then
ActiveSheet.Range("A:A").Rows(RCtr).Delete xlShiftUp
ElseIf Not RegEx(ActiveSheet.Range("A1").Rows(RCtr), "[A-Z]{2}-\d\d", True) Then
ActiveSheet.Range("A:A").Rows(RCtr).Delete xlShiftUp
End If
Next
End Sub
then, with the cursor in the DedupeAndFilter code block, press F5 or click the green Run ">" triangle. The code will then remove duplicates, blank cells and non-conforming cells from Column A.
If you want to change the column affected, change "A:A" in ActiveSheet.Range("A:A") to any other column reference, or substitute Activesheet.Selection and select the column you want.

If you want to avoid VBA, then try this:
Column A
asdfHA-03asdfasdf
HA-03sadfsa
asdfT - S
Y - Rasdfsad
asdfHA-03adf
asdHA-04
asdfsadf
Then on use this formula:
=IF(ISERROR(FIND(" ",IFERROR(MID(A1,FIND("-",A1)-2,5),""))),IFERROR(MID(A1,FIND("-",A1)-2,5),""),"")
This should exclude the ones with spaces, then you can just copy paste as values, and Remove Duplicates

Related

How do I extract a series of numbers along with a single letter followed by another series of numbers?

The problem that I'm facing is that I have an entire column that has text separated by _ that contains pixel size that I want to be able to extract but currently can't. For example:
A
Example_Number_320x50_fifty_five
Example_Number_One_300x250_hundred
Example_Number_two_fifty_728x49
I have tried using Substitute function to grab the numbers which works but only grabs the numbers when I need something like: 320x50 instead I'm getting 0, as I'm not sure how to exactly extract something like this. If it was consistent I could easily do LEFT or RIGHT formula's to grab it but as you can see the data varies.
The result that I'm looking for is something along the lines of:
A | B
Example_Number_320x50_fifty_five | 320x50
Example_Number_One_300x250_hundred | 300x200
Example_Number_two_fifty_728x49 | 728x49
Any help would be much appreciated! If any further clarification is needed please let me know and I'll try to explain as best as I can!
-Maykid
I would probably use a Regular Expressions UDF to accomplish this.
First, open up the VBE by pressing Alt + F11.
Right-Click on VBAProject > Insert > Module
Then you can paste the following code in your module:
Option Explicit
Public Function getPixelDim(RawTextValue As String) As String
With CreateObject("VBScript.RegExp")
.Pattern = "\d+x\d+"
If .Test(RawTextValue) Then
getPixelDim = .Execute(RawTextValue)(0)
End If
End With
End Function
Back to your worksheet, you would use the following formula:
=getPixelDim(A1)
Looking at the pattern \d+x\d+, an escaped d (\d) refers to any digit, a + means one or more of \d, and the x is just a literal letter x. This is the pattern you want to capture as your function's return value.
Gosh, K Davis was just so fast! Here's an alternate method with similar concept.
Create a module and create a user defined function like so.
Public Function GetPixels(mycell As Range) As String
Dim Splitter As Variant
Dim ReturnValue As String
Splitter = Split(mycell.Text, "_")
For i = 0 To UBound(Splitter)
If IsNumeric(Mid(Splitter(i), 1, 1)) Then
ReturnValue = Splitter(i)
Exit For
End If
Next
GetPixels = ReturnValue
End Function
In your excel sheet, type in B1 the formula =GetPixels(A1) and you will get 320x50.
How do you create a user defined function?
Developer tab
Use this URL to add Developer tab if you don't have it: https://www.addintools.com/documents/excel/how-to-add-developer-tab.html
Click on the highlighted areas to get to Visual Basic for Applications (VBA) window.
Create module
Click Insert > Module and then type in the code.
Use the user defined function
Note how the user defined function is called.

How to count exact text contain in string [Excel]

I already use these below formula to count exact text contain in string but still formula wrongly counted it. For example, i would like to count "ZIKA" test code in table, the answer should be two. But the formula count ZIKA2 as ZIKA also. How to ignore ZIKA2 from count it?
TEST
HS2, CCAL, EGFR, AFB
ZIKA, AG21
PPB, ZIKA2
ZIKA, AG21
I already try these formulas:
=SUMPRODUCT(--(ISNUMBER(FIND("ZIKA",F:F))))
and also
=COUNTIF(F:F,"ZIKA")
you could count exact zika, and comma-separated vriations
=COUNTIF(F:F,"ZIKA")+COUNTIF(F:F,"ZIKA,*")+COUNTIF(F:F,"*, ZIKA")+COUNTIF(F:F,"*, ZIKA,*")
I assume your data follow this format
xxx, yyy, zzz
space after comma
You may need to split your formula into 3 parts
=COUNTIF(F:F,"ZIKA,*")+COUNTIF(F:F,"*, ZIKA")+COUNTIF(F:F,"ZIKA")
The first part will count those start with ZIKA, second part count those end with ZIKA, last we should count those only with ZIKA
Try this regex, it may need a helpercolumn. I have not tested it that much yet.
Press ALT + F11 to open VBA editor.
Click Insert -> module and copy paste the code below.
Function Regex(Cell, Search)
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
RE.Pattern = "(\b" & Search & "\b)"
RE.Global = True
RE.IgnoreCase = True
Set Matches = RE.Execute(Cell)
For Each res In Matches
Regex = Regex & "," & res
Next res
Regex = Mid(Regex, 2)
End Function
It will return "ZIKA" if it finds ZIKA in the cell you run it on.
And then you just count the ZIKAs in the helper column.
Updated with a new code that you can change the search in.
Use it with =regex(A1, "ZIKA")

Remove text appearing between two characters - multiple instances - Excel

In Microsoft Excel file, I have a text in rows that appears like this:
1. Rc8 {[%emt 0:00:05]} Rxc8 {[%emt 0:00:01]} 2. Rxc8 {[%emt 0:00:01]} Qxc8 {} 3. Qe7# 1-0
I need to remove any text appearing within the flower brackets { and }, including the brackets themselves.
In the above example, there are three instances of such flower brackets. But some rows might have more than that.
I tried =MID(LEFT(A2,FIND("}",A2)-1),FIND("{",A2)+1,LEN(A2))
This outputs to: {[%emt 0:00:05]}. As you see this is the very first instance of text between those flower brackets.
And if we use this to within SUBSTITUTE like this: =SUBSTITUTE(A2,MID(LEFT(A2,FIND("}",A2)),FIND("{",A2),LEN(A2)),"")
I get an output like this:
1. Rc8 Rxc8 {[%emt 0:00:01]} 2. Rxc8 {[%emt 0:00:01]} Qxc8 {} 3. Qe7# 1-0
If you have noticed, only one instance is removed. How do I make it work for all instances? thanks.
Highlight everything
Go to replace
enter {*} in text to replace
leave replace with blank
This should replace all flower brackets and anything in between them
It is not that easy without VBA, but there is still a way.
Either (as suggested by yu_ominae) just use a formula like this and auto-fill it:
=IFERROR(SUBSTITUTE(A2,MID(LEFT(A2,FIND("}",A2)),FIND("{",A2),LEN(A2)),""),A2)
Another way would be iterative calculations (go to options -> formulas -> check the "enable iterative calculations" button)
To do it now in one cell, you need 1 helper-cell (for my example we will use C1) and the use a formula like this in B2 and auto-fill down:
=IF($C$1,A2,IFERROR(SUBSTITUTE(B2,MID(LEFT(B2,FIND("}",B2)),FIND("{",B2),LEN(B2)),""),B2))
Put "1" in C1 and all formulas in B:B will show the values of A:A. Now go to C1 and hit the del-key several times (you will see the "{}"-parts disappearing) till all looks like you want it.
EDIT: To do it via VBA but without regex you can simply put this into a module:
Public Function DELBRC(ByVal str As String) As String
While InStr(str, "{") > 0 And InStr(str, "}") > InStr(str, "{")
str = Left(str, InStr(str, "{") - 1) & Mid(str, InStr(str, "}") + 1)
Wend
DELBRC = Trim(str)
End Function
and then in the worksheet directly use:
=DELBRC(A2)
If you still have any questions, just ask ;)
Try a user defined function. In VBA create a reference to "Microsoft VBScript Regular Expressions 5.5. Then add this code in a module.
Function RemoveTags(ByVal Value As String) As String
Dim rx As New RegExp
rx.Global = True
rx.Pattern = " ?{.*?}"
RemoveTags = Trim(rx.Replace(Value, ""))
End Function
On the worksheet in the cell enter: =RemoveTags(A1) or whatever the address is where you want to remove text.
If you want to test it in VBA:
Sub test()
Dim a As String
a = "Rc8 {[%emt 0:00:05]} Rxc8 {[%emt 0:00:01]}"
Debug.Print RemoveTags(a)
End Sub
Outputs "Rc8 Rxc8"

Please assist with writing a Macro to clean up data entries

Quick question that should be easy enough for a lot of you but I'm not well versed in VBA or code in general for that matter but I have a problem that only a Macro or piece of VBA code can resolve for me. I have to edit a large number of data entries in a spreadsheet, cell by cell.
So on to the question. Could you please show me an example or provide a complete macro for me to use to edit these cells?
The editing that I require is as follows:
I need to read each cell in the following range: B2 to Q383. A typical entry that needs to be examined and edited looks like this: 629.64\3.00\01:30
What needs to happen now is for everything to the left of the first "\" and everything to the right of the second "\" needs to be removed, including the "\", from each cell.
I've tried fiddling around with the LEFT and RIGHT commands and I can output the data that needs to be removed from the cells with something like this
=left(B11, Find("\", B11) - 1)
So what would be the delete command in a macro to target that data selection in that cell? Or how do I use a delete command with those parameters?
Thanks in advance for any advice or answers!
Not 100% sure what you're after - you say you want to remove the bits from the left and right, but the formula returns the bit on the left.
Anyway, here's 3 formula to do it:
Left: =TRIM(LEFT(B11,FIND("\",B11)-1))
Middle: =LEFT(MID(B11,FIND("\",B11)+1,LEN(B11)),FIND("\",MID(B11,FIND("\",B11)+1,LEN(B11)))-1)
Right: =MID(B11,FIND("\",SUBSTITUTE(B11,"\","~",1))+1,LEN(B11))
And three VBA functions to do it
(use these as you would formula - =leftbit(B11), or if you're looking for something other than backslashes - =leftbit(B11,"|") will find the I-bar as a divider. )
Public Function LeftBit(target As Range, Optional Divider As String = "\") As String
LeftBit = Trim(Left(target, InStr(target, Divider) - 1))
End Function
Public Function MiddleBit(target As Range, Optional Divider As String = "\") As String
Dim First As Long, Second As Long
First = InStr(target, Divider)
Second = InStr(First + 1, target, Divider)
MiddleBit = Mid(target, First + 1, Second - First - 1)
End Function
Public Function RightBit(target As Range, Optional Divider As String = "\") As String
RightBit = Right(target, Len(target) - InStrRev(target, Divider))
End Function

Extract tables from pdf (to excel), pref. w/ vba

I am trying to extract tables from pdf files with vba and export them to excel. If everything works out the way it should, it should go all automatic. The problem is that the table are not standardized.
This is what I have so far.
VBA (Excel) runs XPDF, and converts all .pdf files found in current folder to a text file.
VBA (Excel) reads through each text file line by line.
And the code:
With New Scripting.FileSystemObject
With .OpenTextFile(strFileName, 1, False, 0)
If Not .AtEndOfStream Then .SkipLine
Do Until .AtEndOfStream
//do something
Loop
End With
End With
This all works great. But now I am getting to the issue of extracting the tables from the text files.
What I am trying to do is VBA to find a string e.g. "Year's Income", and then output the data, after it, into columns. (Until the table ends.)
The first part is not very difficult (find a certain string), but how would I go about the second part. The text file will look like this Pastebin. The problem is that the text is not standardized. Thus for example some tables have 3-year columns (2010 2011 2012) and some only two (or 1), some tables have more spaces between the columnn, and some do not include certain rows (such as Capital Asset, net).
I was thinking about doing something like this but not sure how to go about it in VBA.
Find user defined string. eg. "Table 1: Years' Return."
a. Next line find years; if there are two we will need three columns in output (titles +, 2x year), if there are three we will need four (titles +, 3x year).. etc
b. Create title column + column for each year.
When reaching end of line, go to next line
a. Read text -> output to column 1.
b. Recognize spaces (Are spaces > 3?) as start of column 2. Read numbers -> output to column 2.
c. (if column = 3) Recognize spaces as start of column 3. Read numbers -> output to column 3.
d. (if column = 4) Recognize spaces as start of column 4. Read numbers -> output to column 4.
Each line, loop 4.
Next line does not include any numbers - End table. (probably the easiet just a user defined number, after 15 characters no number? end table)
I based my first version on Pdf to excel, but reading online people do not recommend OpenFile but rather FileSystemObject (even though it seems to be a lot slower).
Any pointers to get me started, mainly on step 2?
You have a number of ways to dissect a text file and depending on how complex it is might cause you to lean one way or another. I started this and it got a bit out of hand... enjoy.
Based on the sample you've provided and the additional comments, I noted the following. Some of these may work well for simple files but can get unwieldy with bigger more complex files. Furthermore, there may be slightly more efficient methods or tricks to what I have used here but this will definitely get you going an achieve the desired outcome. Hopefully this makes sense in conjunction with the code provided:
You can use booleans to help you determine what 'section' of the text file you are in. Ie use InStr on the current line to
determine you are in a Table by looking for the text 'Table' and then
once you know you are in the 'Table' section of the file start
looking for the 'Assets' section etc
You can use a few methods to determine the number of years (or columns) you have. The Split function along with a loop will do
the job.
If your files always have constant formatting, even only in certain parts, you can take advantage of this. For example, if you know your
file line will always have a dollar sign in front of the them, then
you know this will define the column widths and you can use this on
subsequent lines of text.
The following code will extract the Assets details from the text file, you can mod it to extract other sections. It should handle multiple rows. Hopefully I've commented it sufficient. Have a look and I'll edit if needs to help out further.
Sub ReadInTextFile()
Dim fs As Scripting.FileSystemObject, fsFile As Scripting.TextStream
Dim sFileName As String, sLine As String, vYears As Variant
Dim iNoColumns As Integer, ii As Integer, iCount As Integer
Dim bIsTable As Boolean, bIsAssets As Boolean, bIsLiabilities As Boolean, bIsNetAssets As Boolean
Set fs = CreateObject("Scripting.FileSystemObject")
sFileName = "G:\Sample.txt"
Set fsFile = fs.OpenTextFile(sFileName, 1, False)
'Loop through the file as you've already done
Do While fsFile.AtEndOfStream <> True
'Determine flag positions in text file
sLine = fsFile.Readline
Debug.Print VBA.Len(sLine)
'Always skip empty lines (including single spaceS)
If VBA.Len(sLine) > 1 Then
'We've found a new table so we can reset the booleans
If VBA.InStr(1, sLine, "Table") > 0 Then
bIsTable = True
bIsAssets = False
bIsNetAssets = False
bIsLiabilities = False
iNoColumns = 0
End If
'Perhaps you want to also have some sort of way to designate that a table has finished. Like so
If VBA.Instr(1, sLine, "Some text that designates the end of the table") Then
bIsTable = False
End If
'If we're in the table section then we want to read in the data
If bIsTable Then
'Check for your different sections. You could make this constant if your text file allowed it.
If VBA.InStr(1, sLine, "Assets") > 0 And VBA.InStr(1, sLine, "Net") = 0 Then bIsAssets = True: bIsLiabilities = False: bIsNetAssets = False
If VBA.InStr(1, sLine, "Liabilities") > 0 Then bIsAssets = False: bIsLiabilities = True: bIsNetAssets = False
If VBA.InStr(1, sLine, "Net Assests") > 0 Then bIsAssets = True: bIsLiabilities = False: bIsNetAssets = True
'If we haven't triggered any of these booleans then we're at the column headings
If Not bIsAssets And Not bIsLiabilities And Not bIsNetAssets And VBA.InStr(1, sLine, "Table") = 0 Then
'Trim the current line to remove leading and trailing spaces then use the split function to determine the number of years
vYears = VBA.Split(VBA.Trim$(sLine), " ")
For ii = LBound(vYears) To UBound(vYears)
If VBA.Len(vYears(ii)) > 0 Then iNoColumns = iNoColumns + 1
Next ii
'Now we can redefine some variables to hold the information (you'll want to redim after you've collected the info)
ReDim sAssets(1 To iNoColumns + 1, 1 To 100) As String
ReDim iColumns(1 To iNoColumns) As Integer
Else
If bIsAssets Then
'Skip the heading line
If Not VBA.Trim$(sLine) = "Assets" Then
'Increment the counter
iCount = iCount + 1
'If iCount reaches it's limit you'll have to redim preseve you sAssets array (I'll leave this to you)
If iCount > 99 Then
'You'll find other posts on stackoverflow to do this
End If
'This will happen on the first row, it'll happen everytime you
'hit a $ sign but you could code to only do so the first time
If VBA.InStr(1, sLine, "$") > 0 Then
iColumns(1) = VBA.InStr(1, sLine, "$")
For ii = 2 To iNoColumns
'We need to start at the next character across
iColumns(ii) = VBA.InStr(iColumns(ii - 1) + 1, sLine, "$")
Next ii
End If
'The first part (the name) is simply up to the $ sign (trimmed of spaces)
sAssets(1, iCount) = VBA.Trim$(VBA.Mid$(sLine, 1, iColumns(1) - 1))
For ii = 2 To iNoColumns
'Then we can loop around for the rest
sAssets(ii, iCount) = VBA.Trim$(VBA.Mid$(sLine, iColumns(ii) + 1, iColumns(ii) - iColumns(ii - 1)))
Next ii
'Now do the last column
If VBA.Len(sLine) > iColumns(iNoColumns) Then
sAssets(iNoColumns + 1, iCount) = VBA.Trim$(VBA.Right$(sLine, VBA.Len(sLine) - iColumns(iNoColumns)))
End If
Else
'Reset the counter
iCount = 0
End If
End If
End If
End If
End If
Loop
'Clean up
fsFile.Close
Set fsFile = Nothing
Set fs = Nothing
End Sub
I cannot examine the sample data as the PasteBin has been removed. Based on what I can glean from the problem description, it seems to me that using Regular Expressions would make parsing the data much easier.
Add a reference to the Scripting Runtime scrrun.dll for the FileSystemObject.
Add a reference to the Microsoft VBScript Regular Expressions 5.5. library for the RegExp object.
Instantiate a RegEx object with
Dim objRE As New RegExp
Set the Pattern property to "(\bd{4}\b){1,3}"
The above pattern should match on lines containing strings like:
2010
2010 2011
2010 2011 2012
The number of spaces between the year strings is irrelevant, as long as there is at least one (since we're not expecting to encounter strings like 201020112012 for example)
Set the Global property to True
The captured groups will be found in the individual Match objects from the MatchCollection returned by the Execute method of the RegEx object objRE. So declare the appropriate objects:
Dim objMatches as MatchCollection
Dim objMatch as Match
Dim intMatchCount 'tells you how many year strings were found, if any
Assuming you've set up a FileSystemObject object and are scanning the text file, reading each line into a variable strLine
First test to see if the current line contains the pattern sought:
If objRE.Test(strLine) Then
'do something
Else
'skip over this line
End If
Set objMatches = objRe.Execute(strLine)
intMatchCount = objMatches.Count
For i = 0 To intMatchCount - 1
'processing code such as writing the years as column headings in Excel
Set objMatch = objMatches(i)
e.g. ActiveCell.Value = objMatch.Value
'subsequent lines beneath the line containing the year strings should
'have the amounts, which may be captured in a similar fashion using an
'additional RegExp object and a Pattern such as "(\b\d+\b){1,3}" for
'whole numbers or "(\b\d+\.\d+\b){1,3}" for floats. For currency, you
'can use "(\b\$\d+\.\d{2}\b){1,3}"
Next i
This is just a rough outline of how I would approach this challenge. I hope there is something in this code outline that will be of help to you.
Another way to do this I have some success with is to use VBA to convert to a .doc or .docx file and then search for and pull tables from the Word file. They can be easily extracted into Excel sheets. The conversion seems to handle tables nicely. Note however that it works on a page by page basis so tables extending over a page end up as separate tables in the word doc.

Resources