How can i check whether a sentence contain combinations? For example consider sentence.
John appointed as new CEO for google.
I need to write a rule to check whether sentence contains < 'new' + 'Jobtitle' >.
How can i achieve this. I tried following. I need to check is there 'new' before word .
Rule: CustomRules
(
{
Sentence contains {Lookup.majorType == "organization"},
Sentence contains {Lookup.majorType == "jobtitle"},
Sentence contains {Lookup.majorType == "person_first"}
}
)
One way to handle this is to revert it. Focus on the sequence you need and then get the covering Sentence:
(
{Token#string == "new"}
{Lookup.majorType = "jobtitle"}
):newJT
You should check this edge when the Sentence starts after "new", like this:
new
CEO
You can use something like this:
{Token ... }
{!Sentence, Lookup.majorType ...}
And then get the sentence (if you really need it) in the java RHS:
long end = newJTAnnots.lastNode().getOffset();
long start = newJTAnnots.firstNode().getOffset();
AnnotationSet sentences = inputAS.get("Sentence", start, end);
Related
I am doing text analysis on SEC filings (e.g., 10-K), and the documents I have are the complete submission. The complete filing submission includes the 10-K, plus several other documents. Each document resides within the tags ‘<DOCUMENT>’ and ‘</DOCUMENT>’.
What I want: To count the number of words in the 10-K only before the first instance of ‘</DOCUMENT>’
How I want to accomplish it: I want to use a for loop, with a regex (regex_end10k) to indicate where to stop the for loop.
What is happening: No matter where I put my regex match break, the program counts all of the words in the entire document. I have no error, however I cannot get the desired results.
How I know this: I have manually trimmed one filing, while retaining the full document (results below). When I manually remove the undesired documents after the first instance of ‘</DOCUMENT>’, I yield about 750,000 fewer words.
Current output
Note: Apparently I don't have enough SO reputation to embed a screenshot in my post; it defaults to a link.
What I have tried: several variations of where to put the regex match break. No matter what, it almost always counts the entire document. I believe that the two functions may be performed over the entire document. I have tried putting the break statement within get_text_from_html() so that count_words() only performs on the 10-K, but I have had no luck.
The code below is a snippet from a larger function. It's purpose is to (1) strip html tags and (2) count the number of words in the text. If I can provide any additional information, please let me know and I'll update my post.
The remaining code (not shown) extracts firm and report identifiers, (e.g., ‘file’ or ‘cik’) from the header section between tags ‘<SEC-HEADER>’ and ‘</SEC-HEADER>’. Using the same logic, when extracting header information, I use a regex match break logic and it works perfectly. I need help trying to understand why this same logic isn’t working when I try to count the number of words and how to correct my code. Any help is appreciated.
regex_end10k = re.compile(r'</DOCUMENT>', re.IGNORECASE)
for line in f:
def get_text_from_html(html:str):
doc = lxml.html.fromstring(html)
for table in doc.xpath('.//table'): # optional: removes tables from HTML source code
table.getparent().remove(table)
for tag in ["a", "p", "div", "br", "h1", "h2", "h3", "h4", "h5"]:
for element in doc.findall(tag):
if element.text:
element.text = element.text + "\n"
else:
element.text = "\n"
return doc.text_content()
to_clean = f.read()
clean = get_text_from_html(to_clean)
#print(clean[:20000])
def count_words(clean):
words = re.findall(r"\b[a-zA-Z\'\-]+\b",clean)
word_count = len(words)
return word_count
header_vars["words"] = count_words(clean)
match = regex_end10k.search(line) # This should do it, but it doesn't.
if match:
break
You dont need regx, just split your orginal string, and then in the part before count the words, simple example above:
text = 'Text before <DOCUMENT> text after'
splited_text = text.split('<DOCUMENT>')
splited_text_before = splited_text[0]
count_words = len(splited_text_before.split())
print(splited_text_before)
print(count_words)
output
Text before
2
I am trying to perform a smart dynamic lookup with strings in Python for a NLP-like task. I have a large amount of similar-structure sentences that I would like to parse through each, and tokenize parts of the sentence. For example, I first parse a string such as "bob goes to the grocery store".
I am taking this string in, splitting it into words and my goal is to look up matching words in a keyword list. Let's say I have a list of single keywords such as "store" and a list of keyword phrases such as "grocery store".
sample = 'bob goes to the grocery store'
keywords = ['store', 'restaurant', 'shop', 'office']
keyphrases = ['grocery store', 'computer store', 'coffee shop']
for word in sample.split():
# do dynamic length lookups
Now the issue is this Sometimes my sentences might be simply "bob goes to the store" instead of "bob goes to the grocery store".
I want to find the keyword "store" for sure but if there are descriptive words such as "grocery" or "computer" before the word store I would like to capture that as well. That is why I have the keyphrases list as well. I am trying to figure out a way to basically capture a keyword at the very least then if there are words related to it that might be a possible "phrase" I want to capture those too.
Maybe an alternative is to have some sort of adjective list instead of a phrase list of multiple words?
How could I go about doing these sort of variable length lookups where I look at more than just a single word if one is captured, or is there an entirely different method I should be considering?
Here is how you can use a nested for loop and a formatted string:
sample = 'bob goes to the grocery store'
keywords = ['store', 'restaurant', 'shop', 'office']
keyphrases = ['grocery', 'computer', 'coffee']
for kw in keywords:
for kp in keyphrases:
if f"{kp} {kw}" in sample:
# Do something
I was trying to make the bot replace multiple words in one sentence with another word.
ex: User will say "Today is a great day"
and the bot shall answer "Today is a bad night"
the words "great" and "day" were replaced by the words "bad" and "night" in this example.
I've been searching in order to find a similar code, but unfortunately all I could find is "word-blacklisting" scripts.
//I tried to do some coding with it but I am not an expert with node.js the code is written really badly. It's not even worth showing really.
The user will say some sentence and the bot will recognize some predetermined words on the sentence and will replace those words with other words I'll decide in the script
We can use String.replace() combined with Regular Expressions to match and replace single words of your choosing.
Consider this example:
function antonyms(string) {
return string
.replace(/(?<![A-Z])terrible(?![A-Z])/gi, 'great')
.replace(/(?<![A-Z])today(?![A-Z])/gi, 'tonight')
.replace(/(?<![A-Z])day(?![A-Z])/gi, 'night');
}
const original = 'Today is a tErRiBlE day.';
console.log(original);
const altered = antonyms(original);
console.log(altered);
const testStr = 'Daylight is approaching.'; // Note that this contains 'day' *within* a word.
const testRes = antonyms(testStr); // The lookarounds in the regex prevent replacement.
console.log(testRes); // If this isn't the desired behavior, you can remove them.
I am having great troubles with JAPE grammars. I have a small token dictionary for the words that needs to be matched with 5 types of document.
One dictionary for one type: For example Job, the dictionary of the person would contain { "Engineer" , "Doctor", "Manager"}. I need to read this dictionary a create JAPE rules for that. This is my first try
Phase: Jobtitle
Input: Lookup
Options: control = appelt debug = true
Rule: Jobs
(
{Lookup.majorType == "Doctor"}
(
{Lookup.majorType == "Engineer"}
)?
)
:jobs
-->
:jobs.JobTitle = {rule = "Jobs"}
Is there any way to automatically create JAPE rules that only for searching tokens in a dictionary to documents?
Why not to use a standard gazetteer where the last parameter in .def file could have a custom type like "Doctor" or "Engineer"?
Something like: keywords.lst:Doctor:Doctor::Doctor
I'm trying to split a sentence into words using Stanford coreNLP .
I'm having problem with words that contains apostrophe.
For example, the sentence:
I'm 24 years old.
Splits like this:
[I] ['m] [24] [years] [old]
Is it possible to split it like this using Stanford coreNLP?:
[I'm] [24] [years] [old]
I've tried using tokenize.whitespace, but it doesn't split on other punctuation marks like: '?' and ','
Currently, no. The subsequent Stanford CoreNLP processing tools all use Penn Treebank tokenization, which splits contractions into two tokens (regarding "I'm" as a reduced form of "I am" by making it the two "words" [I] ['m]). It sounds like you want a different type of tokenization.
While there are some tokenization options, there isn't one to change this, and subsequent tools (like the POS tagger or parser) would work badly without contractions being split. You could add such an option to the tokenizer, changing (deleting) the treatment of REDAUX and SREDAUX trailing contexts.
You can also join contractions via post processing as #dhg suggests, but you'd want to do it a little more carefully in the "if" so it didn't join on quotes.
How about if you just re-concatenate tokens that are split by an apostrophe?
Here's an implementation in Java:
public static List<String> tokenize(String s) {
PTBTokenizer<CoreLabel> ptbt = new PTBTokenizer<CoreLabel>(
new StringReader(s), new CoreLabelTokenFactory(), "");
List<String> sentence = new ArrayList<String>();
StringBuilder sb = new StringBuilder();
for (CoreLabel label; ptbt.hasNext();) {
label = ptbt.next();
String word = label.word();
if (word.startsWith("'")) {
sb.append(word);
} else {
if (sb.length() > 0)
sentence.add(sb.toString());
sb = new StringBuilder();
sb.append(word);
}
}
if (sb.length() > 0)
sentence.add(sb.toString());
return sentence;
}
public static void main(String[] args) {
System.out.println(tokenize("I'm 24 years old.")); // [I'm, 24, years, old, .]
}
There are possessives and contractions. Your example is a contraction. Just looking for an apostrophe won't find you the difference between the two. "This is Pete's answer. I'm sure you knew that." In these two sentences we have one of each case.
With the part of speech tags we can tell the difference. With the tree surgeon syntax you can assemble those, change them and so forth. The syntax is listed here: http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/tregex/tsurgeon/package-summary.html. I've found tree surgeon to be really useful in pulling apart NP groups as I like to break them up over conjunctions.
Alternatively, does 'm stem to "am"? You might want to look for those and look for it's stem tag and simply revert it to that value. Stemming is extremely useful in many other aspects of machine learning and analysis.