Parsing formatted strings in Go - string

The Problem
I have slice of string values wherein each value is formatted based on a template. In my particular case, I am trying to parse Markdown URLs as shown below:
- [What did I just commit?](#what-did-i-just-commit)
- [I wrote the wrong thing in a commit message](#i-wrote-the-wrong-thing-in-a-commit-message)
- [I committed with the wrong name and email configured](#i-committed-with-the-wrong-name-and-email-configured)
- [I want to remove a file from the previous commit](#i-want-to-remove-a-file-from-the-previous-commit)
- [I want to delete or remove my last commit](#i-want-to-delete-or-remove-my-last-commit)
- [Delete/remove arbitrary commit](#deleteremove-arbitrary-commit)
- [I tried to push my amended commit to a remote, but I got an error message](#i-tried-to-push-my-amended-commit-to-a-remote-but-i-got-an-error-message)
- [I accidentally did a hard reset, and I want my changes back](#i-accidentally-did-a-hard-reset-and-i-want-my-changes-back)
What I want to do?
I am looking for ways to parse this into a value of type:
type Entity struct {
Statement string
URL string
}
What have I tried?
As you can see, all the items follow the pattern: - [{{ .Statement }}]({{ .URL }}). I tried using the fmt.Sscanf function to scan each string as:
var statement, url string
fmt.Sscanf(s, "[%s](%s)", &statement, &url)
This results in:
statement = "I"
url = ""
The issue is with the scanner storing space-separated values only. I do not understand why the URL field is not getting populated based on this rule.
How can I get the Markdown values as mentioned above?
EDIT: As suggested by Marc, I will add couple of clarification points:
This is a general purpose question on parsing strings based on a format. In my particular case, a Markdown parser might help me but my intention to learn how to handle such cases in general where a library might not exist.
I have read the official documentation before posting here.

Note: The following solution only works for "simple", non-escaped input markdown links. If this suits your needs, go ahead and use it. For full markdown-compatibility you should use a proper markdown parser such as gopkg.in/russross/blackfriday.v2.
You could use regexp to get the link text and the URL out of a markdown link.
So the general input text is in the form of:
[some text](somelink)
A regular expression that models this:
\[([^\]]+)\]\(([^)]+)\)
Where:
\[ is the literal [
([^\]]+) is for the "some text", it's everything except the closing square brackets
\] is the literal ]
\( is the literal (
([^)]+) is for the "somelink", it's everything except the closing brackets
\) is the literal )
Example:
r := regexp.MustCompile(`\[([^\]]+)\]\(([^)]+)\)`)
inputs := []string{
"[Some text](#some/link)",
"[What did I just commit?](#what-did-i-just-commit)",
"invalid",
}
for _, input := range inputs {
fmt.Println("Parsing:", input)
allSubmatches := r.FindAllStringSubmatch(input, -1)
if len(allSubmatches) == 0 {
fmt.Println(" No match!")
} else {
parts := allSubmatches[0]
fmt.Println(" Text:", parts[1])
fmt.Println(" URL: ", parts[2])
}
}
Output (try it on the Go Playground):
Parsing: [Some text](#some/link)
Text: Some text
URL: #some/link
Parsing: [What did I just commit?](#what-did-i-just-commit)
Text: What did I just commit?
URL: #what-did-i-just-commit
Parsing: invalid
No match!

You could create a simple lexer in pure-Go code for this use case. There's a great talk by Rob Pike from years ago that goes into the design of text/template which would be applicable. The implementation chains together a series of state functions into an overall state machine, and delivers the tokens out through a channel (via Goroutine) for later processing.

Related

How to remove all punctuation from a string in Godot?

I'm building a command parser and I've successfully managed to split strings into separate words and get it all working, but the one thing I'm a bit stumped at is how to remove all punctuation from the string. Users will input characters like , . ! ? often, but with those characters there, it doesn't recognize the word, so any punctuation will need to be removed.
So far I've tested this:
func process_command(_input: String) -> String:
var words:Array = _input.replace("?", "").to_lower().split(" ", false)
It works fine and successfully removes question marks, but I want it to remove all punctuation. Hoping this will be a simple thing to solve! I'm new to Godot so still learning how a lot of the stuff works in it.
You could remove an unwantes character by putting them in an array and then do what you already are doing:
var str_result = input
var unwanted_chars = [".",",",":","?" ] #and so on
for c in unwanted_chars:
str_result = str_result.replace(c,"")
I am not sure what you want to achieve in the long run, but parsing strings can be easier with the use of regular expressions. So if you want to search strings for apecific patterns you should look into this:
regex
Given some input, which I'll just write here as example:
var input := "Hello, It's me!!"
We want to get a modified version where we have filtered the characters:
var output := ""
We need to know what we will filter. For example:
var deny_list := [",", "!"]
We could have a list of things we accept instead, you would just flip a conditional later on.
And then we can iterate over the string, for each character decide if we want to keep it, and if so add it to the output:
for position in input.length():
var current_character := input[position]
if not deny_list.has(current_character):
output += current_character

Regular expression match / break

I am doing text analysis on SEC filings (e.g., 10-K), and the documents I have are the complete submission. The complete filing submission includes the 10-K, plus several other documents. Each document resides within the tags ‘<DOCUMENT>’ and ‘</DOCUMENT>’.
What I want: To count the number of words in the 10-K only before the first instance of ‘</DOCUMENT>’
How I want to accomplish it: I want to use a for loop, with a regex (regex_end10k) to indicate where to stop the for loop.
What is happening: No matter where I put my regex match break, the program counts all of the words in the entire document. I have no error, however I cannot get the desired results.
How I know this: I have manually trimmed one filing, while retaining the full document (results below). When I manually remove the undesired documents after the first instance of ‘</DOCUMENT>’, I yield about 750,000 fewer words.
Current output
Note: Apparently I don't have enough SO reputation to embed a screenshot in my post; it defaults to a link.
What I have tried: several variations of where to put the regex match break. No matter what, it almost always counts the entire document. I believe that the two functions may be performed over the entire document. I have tried putting the break statement within get_text_from_html() so that count_words() only performs on the 10-K, but I have had no luck.
The code below is a snippet from a larger function. It's purpose is to (1) strip html tags and (2) count the number of words in the text. If I can provide any additional information, please let me know and I'll update my post.
The remaining code (not shown) extracts firm and report identifiers, (e.g., ‘file’ or ‘cik’) from the header section between tags ‘<SEC-HEADER>’ and ‘</SEC-HEADER>’. Using the same logic, when extracting header information, I use a regex match break logic and it works perfectly. I need help trying to understand why this same logic isn’t working when I try to count the number of words and how to correct my code. Any help is appreciated.
regex_end10k = re.compile(r'</DOCUMENT>', re.IGNORECASE)
for line in f:
def get_text_from_html(html:str):
doc = lxml.html.fromstring(html)
for table in doc.xpath('.//table'): # optional: removes tables from HTML source code
table.getparent().remove(table)
for tag in ["a", "p", "div", "br", "h1", "h2", "h3", "h4", "h5"]:
for element in doc.findall(tag):
if element.text:
element.text = element.text + "\n"
else:
element.text = "\n"
return doc.text_content()
to_clean = f.read()
clean = get_text_from_html(to_clean)
#print(clean[:20000])
def count_words(clean):
words = re.findall(r"\b[a-zA-Z\'\-]+\b",clean)
word_count = len(words)
return word_count
header_vars["words"] = count_words(clean)
match = regex_end10k.search(line) # This should do it, but it doesn't.
if match:
break
You dont need regx, just split your orginal string, and then in the part before count the words, simple example above:
text = 'Text before <DOCUMENT> text after'
splited_text = text.split('<DOCUMENT>')
splited_text_before = splited_text[0]
count_words = len(splited_text_before.split())
print(splited_text_before)
print(count_words)
output
Text before
2

Formatting string in Powershell but only first or specific occurrence of replacement token

I have a regular expression that I use several times in a script, where a single word gets changed but the rest of the expression remains the same. Normally I handle this by just creating a regular expression string with a format like the following example:
# Simple regex looking for exact string match
$regexTemplate = '^{0}$'
# Later on...
$someString = 'hello'
$someString -match ( $regexTemplate -f 'hello' ) # ==> True
However, I've written a more complex expression where I need to insert a variable into the expression template and... well regex syntax and string formatting syntax begin to clash:
$regexTemplate = '(?<=^\w{2}-){0}(?=-\d$)'
$awsRegion = 'us-east-1'
$subRegion = 'east'
$awsRegion -match ( $regexTemplate -f $subRegion ) # ==> Error
Which results in the following error:
InvalidOperation: Error formatting a string: Index (zero based) must be greater than or equal to zero and less than the size of the argument list.
I know what the issue is, it's seeing one of my expression quantifiers as a replacement token. Rather than opt for a string-interpolation approach or replace {0} myself, is there a way I can tell PowerShell/.NET to only replace the 0-indexed token? Or is there another way to achieve the desired output using format strings?
If a string template includes { and/or } characters, you need to double these so they do not interfere with the numbered placeholders.
Try
$regexTemplate = '(?<=^\w{{2}}-){0}(?=-\d$)'

Replacing a certain part of string with a pre-specified Value

I am fairly new to Puppet and Ruby. Most likely this question has been asked before but I am not able to find any relevant information.
In my puppet code I will have a string variable retrieved from the fact hostname.
$n="$facts['hostname'].ex-ample.com"
I am expecting to get the values like these
DEV-123456-02B.ex-ample.com,
SCC-123456-02A.ex-ample.com,
DEV-123456-03B.ex-ample.com,
SCC-999999-04A.ex-ample.com
I want to perform the following action. Change the string to lowercase and then replace the
-02, -03 or -04 to -01.
So my output would be like
dev-123456-01b.ex-ample.com,
scc-123456-01a.ex-ample.com,
dev-123456-01b.ex-ample.com,
scc-999999-01a.ex-ample.com
I figured I would need to use .downcase on $n to make everything lowercase. But I am not sure how to replace the digits. I was thinking of .gsub or split but not sure how. I would prefer to make this happen in a oneline code.
If you really want a one-liner, you could run this against each string:
str
.downcase
.split('-')
.map
.with_index { |substr, i| i == 2 ? substr.gsub(/0[0-9]/, '01') : substr }
.join('-')
Without knowing what format your input list is taking, I'm not sure how to advise on how to iterate through it, but maybe you have that covered already. Hope it helps.
Note that Puppet and Ruby are entirely different languages and the other answers are for Ruby and won't work in Puppet.
What you need is:
$h = downcase(regsubst($facts['hostname'], '..(.)$', '01\1'))
$n = "${h}.ex-ample.com"
notice($n)
Note:
The downcase and regsubst functions come from stdlib.
I do a regex search and replace using the regsubst function and replace ..(.)$ - 2 characters followed by another one that I capture at the end of the string and replace that with 01 and the captured string.
All of that is then downcased.
If the -01--04 part is always on the same string index you could use that to replace the content.
original = 'DEV-123456-02B.ex-ample.com'
# 11 -^
string = original.downcase # creates a new downcased string
string[11, 2] = '01' # replace from index 11, 2 characters
string #=> "dev-123456-01b.ex-ample.com"

Gradle: How to filter and search through text?

I'm fairly new to gradle. How do I filter text in the following manner?
Pretend that the output/result I want to filter will be the two URLs below.
"http://localhost/artifactory/appNameIwant/moreStuffHereThatsDynamic"
> I want this URL
"http://localhost/artifactory/differentAppName"
> I don't want this URL
I want to put up a "match" variable that would be something like
variable = http://localhost/artifactory/appnameIwant
So essentially, the string will not be a perfect match. I want it to filter and provide back any URLs that start with the variable listed above. It cannot be a perfect match as the characters after the /appnameIwant/ will be changing.
I want to use a for loop to cycle through an array, with an if then statement to return any matches. For instance.
for (i=0; i < results.length; i++){
if (results[i] strings matches (http://localhost/artifactory/appnameIwant) {
return results[i] }
I am just filtering the URL strings themselves, not anything complicated inside the webpages.
Let me know if further explanation would be helpful.
Thanks so much for your time and help!
I figured it out - I just used
if (string.startsWith"texthere")) {println string}
A lot easier than I thought!

Resources