How to extract relationships from a text - nlp

I am currently new with NLP and need guidance as of how I can solve this problem.
I am currently doing a filtering technique where I need to brand data in a database as either being correct or incorrect. I am given a structured data set, with columns and rows.
However, the filtering conditions are given to me in a text file.
An example filtering text file could be the following:
Values in the column ID which are bigger than 99
Values in the column Cash which are smaller than 10000
Values in the column EndDate that are smaller than values in StartDate
Values in the column Name that contain numeric characters
Any value that follows those conditions should be branded as bad.
However, I want to extract those conditions and append them to the program that I've made so far.
For instance, for the conditions above, I would like to produce
`if ID>99`
`if Cash<10000`
`if EndDate < StartDate`
`if Name LIKE %[1-9]%`
How can I achieve the above result using the Stanford NLP? (or any other NLP library).

This doesn't look like a machine learning problem; it's a simple parser. You have a simple syntax, from which you can easily extract the salient features:
column name
relationship
target value or target column
The resulting "action rule" is simply removing the "syntactic sugar" words and converting the relationship -- and possibly the target value -- to its symbolic form.
Enumerate all of your critical words for each position in a lexicon. Then use basic string manipulation operators in your chosen implementation language to find the three needed fields.
EXAMPLE
Given the data above, your lexicons might be like this:
column_trigger = "Values in the column"
relation_dict = {
"are bigger than" : ">",
"are smaller than" : "<",
"contain" : "LIKE",
...
}
value_desc = {
"numeric characters" : "%[1-9]%",
...
}
From here, use these items in standard parsing. If you're not familiar with that, please look up the basics of a simple sentence grammar in your favourite programming language, with rules such as such as
SENTENCE => SUBJ VERB OBJ
Does that get you going?

Related

Search the best match comparing prefixes

I have the numbers codes and text codes like in table1 below. And I have the numbers to search like in table2
for which I want to get the best match for a prefix of minimun length of 3 comparing from left to rigth and show as answer the corresponding TEXT CODE.
If there is an exact match, that would be the answer.
If there is no any value that has at least 3 length prefix then answer would be "not found".
I show some comments explaining the conditions applied in answer expected for each Number to search next to table2.
My current attempt shows the exact matches, but I'm not sure how to compare the values to search for the other conditions, when there is no exact match.
ncode = ["88271","1893","107728","4482","3527","71290","404","5081","7129","33751","3","40489","107724"]
tcode = ["RI","NE","JH","XT","LF","NE","RI","XT","QS","XT","YU","WE","RP"]
tosearch = ["50923","712902","404","10772"]
out = []
out.append([])
out.append([])
for code in tosearch:
for nc in ncode:
if code == nc:
indexOfMatched = ncode.index(nc)
out[0].append(nc)
out[1].append(tcode[indexOfMatched])
>>> out
[['404'], ['RI']]
The expected output would be
out = [
['50923', '712902', '404', '10772'],
['NOT FOUND', 'NE', 'RI', 'JH' ]
]
A simple solution you might consider would be the fuzzy-match library. It compares strings and calculates a similarity score. It really shines with strings rather than numbers, but it could easily be applied to find similar results in your prefix numbers.
Check out fuzzy-match here.
Here is a well written fuzzy-match tutorial.

FuzzyWuzzy for very similar records in Python

I have a dataset with which I want to find the closest string match. For that purpose I'm using FuzzyWuzzy in this way
sol=process.extract(t,dev2,scorer=fuzz.token_sort_ratio)
Where t is the string and dev2 is the list to compare to. My problem is that sometimes it has very similar records and options provided by FuzzyWuzzy seems to be lacking. And I've tested with token_sort, token_set, partial_token sort and set, ratio, partial_ratio, and WRatio.
For example, the string Italy - Serie A gives me the following 2 closest matches.
Token_sort_ratio: (92, 'Italy - Serie D');(86, 'Italian - Serie A')
The one wanted is obviously the second one, but character by character is closer the first one, which is a different league.
This happens as well with teams. If, let's say I have a string Buchtholz I would obtains Buchtholz II before I get TSV Buchtholz.
My main guess now would be to try and weight the presence and absence of several characters more heavily, like single capital letters at the end of the string, so if there is a difference in the letter or an absence it is weighted as less close. Or for () and special characters.
I don't know if there is a way to take this into account or you guys have a better approach to get the string that really matches.
Similarity matches often require knowledge of the data being analysed. i.e. it is not just a blind single round of matching. I recommend that you pass your results through more steps of matching, starting with inclusive/optimistic approaches (like token_set_ratio) with low cut off scores and working toward more exclusive/pessimistic approaches with higher cut off scores until you have a clear winner. If you know more about the text you're analyzing, you can even modify the strings as you progress.
In a case I worked on, I did similarity matches of goods movement descriptions. In the descriptions the numbers sequences were more important than the text. e.g. when looking for a match for "SLURRY VALVE 250MM RAGMAX 2000" the 250 and 2000 part of the string are important, otherwise I get a "SLURRY VALVE 50MM RAGMAX 2000" as the best match instead of "VALVE B/F 250MM,RAGMAX 250RAG2000 RAGON" which is a better result.
I put the similarity match process through two steps: 1. Get a bunch of similar matches using an optimistic matching scorer (token_set_ratio) 2. get the number sequences of these results and pass them through another round of matching with a more strict scorer (token_sort_ratio). Doing this gave me the better result in the example I showed above.
Below is some blocks of code that could be of assistance:
here's a function to get numbers from the sequence. (In your case you might use this to exclude numbers from your string instead?)
def get_numbers_from_string(description):
numbers = ''.join((ch if ch in '0123456789.-' else ' ') for ch in description)
numbers = ' '.join([nr for nr in numbers.split()])
return numbers
and here is a portion of the code I used to put the description match through two rounds:
try:
# get close match from goods move that has material numbers
df_material = pd.DataFrame(process.extract(description,
corpus_material,
scorer=fuzz.token_set_ratio),
columns=['Similar Text','Score']
)
if df_material['Score'][df_material['Score']>=cut_off_accuracy_materials].count()>=1:
similar_text = df_material['Similar Text'].iloc[0]
score = df_material['Score'].iloc[0]
if nr_description_numbers>4:
# if there are multiple matches found, then get best number combination match
df_material = df_material[df_material['Score']>=cut_off_accuracy_materials]
new_corpus = list(df_material['Similar Text'])
new_corpus = np.vectorize(get_numbers_from_string)(new_corpus)
df_material['numbers'] = new_corpus
df_numbers = pd.DataFrame(process.extract(description_numbers,
new_corpus,
scorer=fuzz.token_sort_ratio),
columns=['numbers','Score']
)
similar_text = df_material['Similar Text'][df_material['numbers']==df_numbers['numbers'].iloc[0]].iloc[0]
nr_score = df_numbers['Score'].iloc[0]
hope it helps, and good luck

Replace words marked with offsets

I have a sentence like that:
"My name is Bond. It's a fake name."
and I have to replace some words in a list with offsets of each word:
name, 29-33; Bond, 11-15; name, 3-7
In addition, each word must replace with a specific word:
name -> noun
Bond -> proper
I have to obtain this output:
"My noun is proper. It's a fake noun."
I tried to manage the offsets with a post-offset variable that I update after each replacement but it is not valid because is an unordered list. Note that find method is not valid due to names repetition. Is there any algorithm to do it? Any vectorial implementation (String, Numpy, NLTK) that computes it in one step?
Bro Check this one :
string = "My name is Bond. It's a fake name."
y=list()
y=string.split(" ") #now it will break your strings into words
Now traverse the list and set the condition
for i in y:
if(i==name):
i="noun"
if(i==Bond):
i="Proper"
Now the list values will be changed and use the Join() method to make back the list into string
For more Please refer to this website https://www.tutorialspoint.com/python/python_strings.htm
This page contains all the data related to string processing in python.

Access list element using get()

I'm trying to use get() to access a list element in R, but am getting an error.
example.list <- list()
example.list$attribute <- c("test")
get("example.list") # Works just fine
get("example.list$attribute") # breaks
## Error in get("example.list$attribute") :
## object 'example.list$attribute' not found
Any tips? I am looping over a vector of strings which identify the list names, and this would be really useful.
Here's the incantation that you are probably looking for:
get("attribute", example.list)
# [1] "test"
Or perhaps, for your situation, this:
get("attribute", eval(as.symbol("example.list")))
# [1] "test"
# Applied to your situation, as I understand it...
example.list2 <- example.list
listNames <- c("example.list", "example.list2")
sapply(listNames, function(X) get("attribute", eval(as.symbol(X))))
# example.list example.list2
# "test" "test"
Why not simply:
example.list <- list(attribute="test")
listName <- "example.list"
get(listName)$attribute
# or, if both the list name and the element name are given as arguments:
elementName <- "attribute"
get(listName)[[elementName]]
If your strings contain more than just object names, e.g. operators like here, you can evaluate them as expressions as follows:
> string <- "example.list$attribute"
> eval(parse(text = string))
[1] "test"
If your strings are all of the type "object$attribute", you could also parse them into object/attribute, so you can still get the object, then extract the attribute with [[:
> parsed <- unlist(strsplit(string, "\\$"))
> get(parsed[1])[[parsed[2]]]
[1] "test"
flodel's answer worked for my application, so I'm gonna post what I built on it, even though this is pretty uninspired. You can access each list element with a for loop, like so:
#============== List with five elements of non-uniform length ================#
example.list=
list(letters[1:5], letters[6:10], letters[11:15], letters[16:20], letters[21:26])
#===============================================================================#
#====== for loop that names and concatenates each consecutive element ========#
derp=c(); for(i in 1:length(example.list))
{derp=append(derp,eval(parse(text=example.list[i])))}
derp #Not a particularly useful application here, but it proves the point.
I'm using code like this for a function that calls certain sets of columns from a data frame by the column names. The user enters a list with elements that each represent different sets of column names (each set is a group of items belonging to one measure), and the big data frame containing all those columns. The for loop applies each consecutive list element as the set of column names for an internal function* applied only to the currently named set of columns of the big data frame. It then populates one column per loop of a matrix with the output for the subset of the big data frame that corresponds to the names in the element of the list corresponding to that loop's number. After the for loop, the function ends by outputting that matrix it produced.
Not sure if you're looking to do something similar with your list elements, but I'm happy I picked up this trick. Thanks to everyone for the ideas!
"Second example" / tangential info regarding application in graded response model factor scoring:
Here's the function I described above, just in case anyone wants to calculate graded response model factor scores* in large batches...Each column of the output matrix corresponds to an element of the list (i.e., a latent trait with ordinal indicator items specified by column name in the list element), and the rows correspond to the rows of the data frame used as input. Each row should presumably contain mutually dependent observations, as from a given individual, to whom the factor scores in the same row of the ouput matrix belong. Also, I feel I should add that if all the items in a given list element use the exact same Likert scale rating options, the graded response model may be less appropriate for factor scoring than a rating scale model (cf. http://www.rasch.org/rmt/rmt143k.htm).
'grmscores'=function(ColumnNameList,DataFrame) {require(ltm) #(Rizopoulos,2006)
x = matrix ( NA , nrow = nrow ( DataFrame ), ncol = length ( ColumnNameList ))
for(i in 1:length(ColumnNameList)) #flodel's magic featured below!#
{x[,i]=factor.scores(grm(DataFrame[, eval(parse(text= ColumnNameList[i]))]),
resp.patterns=DataFrame[,eval(parse(text= ColumnNameList[i]))])$score.dat$z1}; x}
Reference
*Rizopoulos, D. (2006). ltm: An R package for latent variable modelling and item response theory analyses, Journal of Statistical Software, 17(5), 1-25. URL: http://www.jstatsoft.org/v17/i05/

In Clojure, what is a good way to compare user input Strings with Longs?

In Clojure I am building a card game. Cards have a suit and a score.
{:suit 1 :score 9}
The cards are created using ranges, e.g. (range suitTotal), so the class of the values of :suit and :score is Long.
Players send command strings, e.g. "discard1.9" is a discard request.
Using a regex to parse this:
(re-seq #"[0-9]+" command)
results in String items "1" and "9". A card created with these results would be
{:suit "1" :score "9"}
I would like this to compare as equal with the original card. At the moment I am using (Integer/parseInt) to convert the strings.
The suit value could be built from a different type, such as a keyword, but the score value is used as a number elsewhere.
use read-string
DEMO
user=> (read-string "1")
1
A good approach would be to parse the strings as numbers and then use = to compare.
user=> (Integer/parseInt "1")
1
The advantage of this over read-string is this is more restricted. This won't parse strings that look like clojure data-structures.

Resources