Stata flag when word found, not strpos - string

I have some data with strings, and I want to flag when a word is found. A word would be defined as at the start of the string, end, or separated a space. strpos will find whenever the string is present, but I am looking for something similar to subinword. Does Stata have a way to use the functionality of subinword without having to replace it, and instead flag the word?
clear
input id str50 strings
1 "the thin th man"
2 "this old then"
3 "th to moon"
4 "moon blank th"
end
gen th_pos = 0
replace th = 1 if strpos(strings, "th") >0
This above code will flag every observation as they all contain "th", but my desired output is:
ID strings th_sub
1 "the thin th man" 1
2 "this old then" 0
3 "th to moon" 1
4 "moon blank th" 1

A small trick is that "th" as a word will be preceded and followed by a space, except if it occurs at the beginning or the end of string. The exceptions are no challenge really, as
gen wanted = strpos(" " + strings + " ", " th ") > 0
works around them. Otherwise, there is a rich set of regular expression functions to play with.
The example above flags that the code that doesn't do what you want condenses to one line,
gen th_pos = strpos(strings, "th") > 0
A more direct answer is that you don't have to replace anything. You just have to get Stata to tell you what would happen if you did:
gen WANTED = strings != subinword(strings, "th", "", .)
If removing a substring if present changes the string, it must have been present.

Regular expressions can be useful for this type of exercise, with word boundaries allowing you to search for whole words indicated by \b, as in "\bword\b".
gen wanted = ustrregexm(strings, "\bth\b")

Related

Extract Uppercase Words on Excel Function

I have supplier name together with product name in one cell as a string.
Each cell has a word that's all uppercase (sometimes with a digit or a number).
Data
I need to extract
3LAB Anti - Aging Oil 30ml
3LAB
3LAB Aqua BB SPF40 #1 14g
3LAB
3LAB SAMPLE Perfect Neck Cream 6ml
3LAB
3LAB SAMPLE Super h" Serum Super Age-Defying Serum 3ml"
3LAB
3LAB TTTTT Perfect Mask Lifting Firming Brightening 28ml
3LAB
3LAB The Cream 50ml
3LAB
3LAB The Serum 40ml
3LAB
4711 Acqua Colonia Intense Floral Fields Of Ireland EDC spray 170ml
EDC
4711 Acqua Colonia Intense Pure Brezze Of Himalaya EDC spray 50m"
EDC
I need to extract only that UPPERCASE supplier name to a new cell.
I've tried to create User Defined Function like this one, but it's not working.
It's returning #NAME? error.
Public Function UpperCaseWords(S As String) As String
Dim X As Long
Dim TempText As String
TempText = " " & S & " "
For X = 2 To Len(TempText) - 1
If Mid(TempText, X, 1) Like "[!A-Z ]" Or Mid(TempText, X - 1, 3) Like "[!A-Z][A-Z][!A-Z]" Then
Mid(TempText, X) = " "
End If
Next
UpperCaseWords = Application.Trim(TempText)
End Function
Any idea how to correct it and make it work?
I've found it here: https://www.mrexcel.com/board/threads/formula-to-extract-upper-case-words-in-a-text-string.684934/page-2#posts
And why in this macro, in line For X = 2 To Len(TempText) - 1 the X is set to 2?
Instead of a custom made UDF, try to utilize what Excel does offer through build-in functionality, for examle FILTERXML():
Formula used in B1:
=FILTERXML("<t><s>"&SUBSTITUTE(A1," ","</s><s>")&"</s></t>","//s[.*0!=0][translate(.,'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789','')='']")
If you are using Microsoft365, add an # before the the function e.g: =#FILTERXML()....., or add [1] as the 3rd xpath-expression to tell the function to only return the 1st node that complied against the previous two rules we used.
Let's have an analysis on the formula:
FILTERXML() - We can utilize this function to "split" a string on a particular delimiter, a space in this case.
"<t><s>"&SUBSTITUTE(A1," ","</s><s>")&"</s></t>" - Here is the part where we create a parent-child structure of start/end-tags; a valid XML-string.
//s - In the 2nd parameter of FILTERXML() we start a valid xpath-expressions. We want all the s-nodes (childres from t-nodes) that comply to the following two rules:
[.*0!=0] - Select all nodes that when multiplied with zero are not the same as zero. Meaning we don't want pure numeric substrings returned.
[translate(.,'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789','')=''] - When we translate() all the characters mentioned in the 2nd parameter of this function to nothing, the result should also be nothing, meaning the node is only made out of uppercase alpha chars and numbers.
A more in-depth explaination on how FILTERXML() works when you want to extract a particular substring can be found here. Happy coding!
How about: split the text into words, for each word ignore it if its numeric & return it if its the same as the upper case version of itself:
Public Function UpperCaseWords(S As String) As String
Dim i As Long
Dim word As String
Dim words() As String
words = Split(S, " ")
For i = 0 To UBound(words)
word = words(i)
If Not IsNumeric(word) And word = UCase$(word) Then
UpperCaseWords = word
Exit Function
End If
Next
End Function

String Operations Confusion? ELI5

I'm extremely new to python and I have no idea why this code gives me this output. I tried searching around for an answer but couldn't find anything because I'm not sure what to search for.
An explain-like-I'm-5 explanation would be greatly appreciated
astring = "hello world"
print(astring[3:7:2])
This gives me : "l"
Also
astring = "hello world"
print(astring[3:7:3])
gives me : "lw"
I can't wrap my head around why.
This is string slicing in python.
Slicing is similar to regular string indexing, but it can return a just a section of a string.
Using two parameters in a slice, such as [a:b] will return a string of characters, starting at index a up to, but not including, index b.
For example:
"abcdefg"[2:6] would return "cdef"
Using three parameters performs a similar function, but the slice will only return the character after a chosen gap. For example [2:6:2] will return every second character beginning at index 2, up to index 5.
ie "abcdefg"[2:6:2] will return ce, as it only counts every second character.
In your case, astring[3:7:3], the slice begins at index 3 (the second l) and moves forward the specified 3 characters (the third parameter) to w. It then stops at index 7, returning lw.
In fact when using only two parameters, the third defaults to 1, so astring[2:5] is the same as astring[2:5:1].
Python Central has some more detailed explanations of cutting and slicing strings in python.
I have a feeling you are over complicating this slightly.
Since the string astring is set statically you could more easily do the following:
# Sets the characters for the letters in the consistency of the word
letter-one = "h"
letter-two = "e"
letter-three = "l"
letter-four = "l"
letter-six = "o"
letter-7 = " "
letter-8 = "w"
letter-9 = "o"
letter-10 = "r"
letter11 = "l"
lettertwelve = "d"
# Tells the python which of the character letters that you want to have on the print screen
print(letter-three + letter-7 + letter-three)
This way its much more easily readable to human users and it should mitigate your error.

Python 3.5: Is it possible to align punctuation (e.g. £, $) to the left side of a word using regex?

As part of my code, I need to align things like the pound sign to the left of a string. For example my code starts with:
"A price of £ 8 is roughly the same as $ 10.23!"
and needs to end with:
"A price of £8 is roughly the same as $10.23!"
I've created the following function to solve this however I feel that it is very inefficient and was wondering if there was a way to do this with regular expressions in Python?
for i in sentence:
if i == "(" or i == "{" or i == "[" or i == "£" or i == "$":
if i != len(sentence):
corrected_sentence.append(" ")
corrected_sentence.append(i)
else:
corrected_sentence.append(i)
What this is doing right now is going through the 'sentence' list where I have split up all of the words and punctuation and t then reforming this followed by a space EXPECT where the listed characters are used and adding to another list to be made into a single string again.
I only want to do this with the characters I have listed above (so I need to ignore things like full stops or exclamation marks etc).
Thanks!
I'm not sure what you want to do with the brackets, but from the description you can use a regex to find and replace whitespace preceded by the characters (lookbehind) and followed by a digit (lookahead).
>>> print(re.sub(r"(?<=[\{\[£\$])\s+(?=\d)", "", "A price of £ 8 is roughly the same as $ 10.23!"))
A price of £8 is roughly the same as $10.23!

Lua frontier pattern match (whole word search)

can someone help me with this please:
s_test = "this is a test string this is a test string "
function String.Wholefind(Search_string, Word)
_, F_result = string.gsub(Search_string, '%f[%a]'..Word..'%f[%A]',"")
return F_result
end
A_test = String.Wholefind(s_test,"string")
output: A_test = 2
So the frontier pattern finds the whole word no problem and gsub counts the whole words no problem but what if the search string has numbers?
s_test = " 123test 123test 123"
B_test = String.Wholefind(s_test,"123test")
output: B_test = 0
seems to work with if the numbers aren't at the start or end of the search string
Your pattern doesn't match because you are trying to do the impossible.
After including your variable value, the pattern looks like this: %f[%a]123test%f[%A]. Which means:
%f[%a] - find a transition from a non letter to a letter
123 - find 123 at the position after transition from a non letter to a letter. This itself is a logical impossibility as you can't match a transition to a letter when a non-letter follows it.
Your pattern (as written) will not work for any word that starts or ends with a non-letter.
If you need to search for fragments that include letters and numbers, then your pattern needs to be changed to something like '%f[%S]'..Word..'%f[%s]'.

How to parse a string (by a "new" markup) with R?

I want to use R to do string parsing that (I think) is like a simplistic HTML parsing.
For example, let's say we have the following two variables:
Seq <- "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA"
Str <- ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<."
Say that I want to parse "Seq" According to "Str", by using the legend here
Seq: GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA
Str: >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.
| | | | | | | || |
+-----+ +--------------+ +---------------+ +---------------++-----+
| Stem 1 Stem 2 Stem 3 |
| |
+----------------------------------------------------------------+
Stem 0
Assume that we always have 4 stems (0 to 3), but that the length of letters before and after each of them can very.
The output should be something like the following list structure:
list(
"Stem 0 opening" = "GCCTCGA",
"before Stem 1" = "TA",
"Stem 1" = list(opening = "GCTC",
inside = "AGTTGGGA",
closing = "GAGC"
),
"between Stem 1 and 2" = "G",
"Stem 2" = list(opening = "TACGA",
inside = "CTGAAGA",
closing = "TCGTA"
),
"between Stem 2 and 3" = "AGGtC",
"Stem 3" = list(opening = "ACCAG",
inside = "TTCGATC",
closing = "CTGGT"
),
"After Stem 3" = "",
"Stem 0 closing" = "TCGGGGC"
)
I don't have any experience with programming a parser, and would like advices as to what strategy to use when programming something like this (and any recommended R commands to use).
What I was thinking of is to first get rid of the "Stem 0", then go through the inner string with a recursive function (let's call it "seperate.stem") that each time will split the string into:
1. before stem
2. opening stem
3. inside stem
4. closing stem
5. after stem
Where the "after stem" will then be recursively entered into the same function ("seperate.stem")
The thing is that I am not sure how to try and do this coding without using a loop.
Any advices will be most welcomed.
Update: someone sent me a bunch of question, here they are.
Q: Does each sequence have the same number of ">>>>" for the opening sequence as it does for "<<<<" on the ending sequence?
A: Yes
Q: Does the parsing always start with a partial stem 0 as your example shows?
A: No. Sometimes it will start with a few "."
Q: Is there a way of making sure you have the right sequences when you start?
A: I am not sure I understand what you mean.
Q: Is there a chance of error in the middle of the string that you have to restart from?
A: Sadly, yes. In which case, I'll need to ignore one of the inner stems...
Q: How long are these strings that you want to parse?
A: Each string has between 60 to 150 characters (and I have tens of thousands of them...)
Q: Is each one a self contained sequence like you show in your example, or do they go on for thousands of characters?
A: each sequence is self contained.
Q: Is there always at least one '.' between stems?
A: No.
Q: A full set of rules as to how the parsing should be done would be useful.
A: I agree. But since I don't have even a basic idea on how to start coding this, I thought first to have some help on the beginning and try to tweak with the other cases that will come up before turning back for help.
Q: Do you have the BNF syntax for parsing?
A: No. Your e-mail is the first time I came across it (http://en.wikipedia.org/wiki/Backus–Naur_Form).
You can simplify the task by using run length encoding.
First, convert Str to be a vector of individual characters, then call rle.
split_Str <- strsplit(Str, "")[[1]]
rle_Str <- rle(split_Str)
Run Length Encoding
lengths: int [1:14] 7 2 4 8 4 1 5 7 5 5 ...
values : chr [1:14] ">" "." ">" "." "<" "." ">" "." "<" "." ">" "." "<" "."
Now you just need to parse rle_Str$values, which is perhaps simpler. For instance, an inner stem will always look like ">" "." "<".
I think the main thing that you need to think about is the structure of the data. Does a "." always have to come between ">" and "<", or is it optional? Can you have a "." at the start? Do you need to be able to generalise to stems within stems within stems, or even more complex structures?
Once you have this solved, contructing your list output should be straightforward.
Also, don't worry about using loops, they are in the language because they are useful. Get the thing working first, then worry about speed optimisations (if you really have to) afterwards.

Resources