I have the following two English sentences
(1) Each student has to follow some courses and, specifically,
undergraduate students can follow only undergraduate courses but postgraduate
students can follow postgraduate courses and up to two undergraduate courses.
(2) Each student can have only a single mark for each course.
I have created an ontology in Protege tool but I don't know how to pass the qualified restriction "up to two" or exactly one "single".
What I have done so far:
First I created four classes "Undergraduate Courses", "Undergraduate Student", "Postgraduate Courses", "Postgraduate Student" and an ObjectProperty named followCourse. Based on the first (1) English sentence, Undergraduate students can follow only undergraduate courses. However, Postgraduate students can also have postgraduate courses and some undergraduate courses. Thus, I have written to following for Postgraduate Student class:
followCourse some Postgraduate_Courses or Undergraduate_Courses.
I am not sure that I satisfy the threshold up to two (≤2) courses because some in description logic is at least one (screenshot 1).
For the second sentence about "single mark" I have added the following in Undergraduate_courses and Postgraduate_courses classes:
hasMark max 1 where hasMark is a DataProperty for Curriculum class (screenshot 2).
I believe the hasMark max 1 is wrong in this case because with that expression I typically say that I can't have the same course with two marks for two separate students. This is wrong because two students could have taken the same course and have separate marks.
(screenshot 1)
(screenshot 2)
Sentence 1
As for sentence 1 your are correct regarding using max 2 Undergraduate_Courses.
However, watch out for
isFollowedBy some Undergraduate_Student or Postgraduate_Student.
As you have it, it means
(isFollowedBy some Undergraduate_Student) or (Postgraduate_Student)
but what you really want is
isFollowedBy some (Undergraduate_Student or Postgraduate_Student).
Also, for Undergraduate_Student you specify followCourse only Undergraduate_Course which will correctly ensure that undergraduate students can only take undergraduate courses. However, as is, it will allow for undergraduate students that take zero undergraduate courses. Thus, to force undergraduate students to take at least 1 undergraduate course you have to change the statement as follows:
followCourse only Undergraduate_Course and followCourse some Undergraduate_Course.
Sentence 2
Here it is probably best to introduce a new concept, say StudentCourseMark which is used to link a single student, a single course and a single mark. Assuming you have the object properties hasStudent and hasCourse and data property hasMark you can define StudentCourseMark as the subclass of
hasStudent exactly 1 Student and
hasCourse exactly 1 Course and
hasMark exactly 1 xsd:integer
However, this can still allow for the possibility that you can have 2 different individuals of StudentCourseMark that have the same student and course, but different marks. To avoid this you can add a key on student and course for StudentCourseMark:
HasKey:
hasCourse,
hasStudent
Related
So,
Due to cultural differences people in hispanic countries have quite a number of surnames.
Taking someone elses surname isn´t the norm, you just combine your surnames in most cases:
1st husband, 1st bride, 2nd husband, 2nd bride, 3rd husband, 3rd bride, 4th husband, 4th bride.
You have to add a second surname to get Spanish nationality and some people just repeat their last name because they refuse to understand how culturally important this is in Spain
Athletic Bilbao can get away with saying all of their players have basque origins if they start tracing back the multiple surnames and have been known to do so/approach foreign players with basque surnames among the neverending list to ask if they would be interested in joining.
This can be quite problematic in some cases but it makes it easy to differentiate people:
There can be an elevated number of Thomas Smith's in your city, there is hardly ever two Thomas Smith matchingCommonSecondSurname in the same areas.
Because of this people are used to use at least two of their surnames in hispanic countries unless their name is unique enough.
On to my issue:
My dialogflow agent asks someone to identify themselves in order to provide some extra information to the business.
I have added multiple examples with several surnames, they are identified correctly by the training proccess but the agent struggles with them in actual conversation picking either the second surname as the full person or the person first surname as the entity, never the full thing.
Neither option is valid in a hispanic country where I would be using this solution.
Anything I can do to improve this?
Creating a custom entity for a person seems like an arduous task to me.
It is not vital and I could do without the extra tidbit as I am storing their email already. It just seems like a basic thing that should be doable and I am struggling to believe I am the first person to face this issue.
I have a txt file with around 100 multiple-choice questions. I want to split them into 100 txt files containing one question. The delimiter will be "number.", example - 1. for the first question, 2. for next like this. the issue is a "number." can emerge amidst a question also, so that can be checked by ensuring that "(d)" was encountered prior to that "number.".
Sample text -
1.Consider the following statements with regard to the State Council of Ministers:1. The constitution specifies the size of the state council of ministers and the ranking of ministers.2.The advice tendered by the ministers to the Governor shall not be inquired into in any court. Which of the statements given above is/ are correct? (a)1 only(b)2 only(c)Both 1 and 2(d)Neither 1 nor 22.Consider the following statements with reference to Public Accounts Committee:1. The committee was set up under the provisions of the Government of India Act1919.2.Speaker is the ex- Officio Chairman of the committee. A minister cannot be a member of the committee. Which of the statements given above is/are not correct? (a)1 only(b)2 and 3 only(c)1 and 2 only(d)2 only
The text is like mentioned above, and it should be split like this -
1.Consider the following statements with regard to the State Council of Ministers:1. The constitution specifies the size of the state council of ministers and the ranking of ministers.2.The advice tendered by the ministers to the Governor shall not be inquired into in any court. Which of the statements given above is/ are correct? (a)1 only(b)2 only(c)Both 1 and 2(d)Neither 1 nor 2
2.Consider the following statements with reference to Public Accounts Committee:1. The committee was set up under the provisions of the Government of India Act1919.2.Speaker is the ex- Officio Chairman of the committee. A minister cannot be a member of the committee. Which of the statements given above is/are not correct? (a)1 only(b)2 and 3 only(c)1 and 2 only(d)2 only
I'm new to natural language process so I apologize if my question is unclear. I have read a book or two on the subject and done general research of various libraries to figure out how i should be doing this, but I'm not confident yet that know what to do.
I'm playing with an idea for an application and part of it is trying to find product mentions in unstructured text (e.g. tweets, facebook posts, emails, websites, etc.) in real-time. I wont go into what the products are but it can be assumed that they are known (stored in a file or database). Some examples:
"starting tomorrow, we have 5 boxes of #hersheys snickers available for $5 each - limit 1 pp" (snickers is the product from the hershey company [mentioned as "#hersheys"])
"Big news: 12-oz. bottles of Coke and Pepsi on sale starting Fri." (coca-cola is the product [aliased as "coke"] from coca-cola company and Pepsi is the product from the PepsiCo company)
"#OMG, i just bought my dream car. a mustang!!!!" (mustang is the product from Ford)
So basically, given a piece of text, query the text to see if it mentions a product and receive some indication (boolean or confidence number) that it does mention the product.
Some concerns I have are:
Missing products because of misspellings. I thought maybe i could use a string similarity check to catch these.
Product names that are also English words or things would get caught. Like mustang the horse versus mustang the car
Needing to keep a list of alternative names for products (e.g. "coke" for "coco-cola", etc.)
I don't really know where to start with this but any help would be appreciated. I've already looked at NLTK and SciKit and didn't really gleam how to do this from there. If you know of examples or papers that explain this, links would be helpful. I'm not specific to any language at this point. Java preferably but Python and Scala are acceptable.
The answer that you chose is not really answering your question.
The best approach you can take is using Named Entity Recognizer(NER) and POS tagger (grab NNP/NNPS; Proper nouns). The database there might be missing some new brands like Lyft (Uber's rival) but without developing your own prop database, Stanford tagger will solve half of your immediate needs.
If you have time, I would build the dictionary that has every brands name and simply extract it from tweet strings.
http://www.namedevelopment.com/brand-names.html
If you know how to crawl, it's not a hard problem to solve.
It looks like your goal is to classify linguistic forms in a given text as references to semantic entities (which can be referred to by many different linguistic forms). You describe a number of subtasks which should be done in order to get good results, but they nevertheless are still independent tasks.
Misspellings
In order to deal with potential misspellings of words, you need to associate these possible misspellings to their canonical (i.e. correct) form.
Phonetic similarity: Many reasons for "misspellings" is opacity in the relationship between the word's phonetic form (i.e. how it sounds) and its orthographic form (i.e. how it's spelled). Therefore, a good way to address this is to index terms phonetically so that e.g. innovashun is associated with innovation.
Form similarity: Additionally, you could do a string similarity check, but you may introduce a lot of noise into your results which you would have to address because many distinct words are in fact very similar (e.g. chic vs. chick). You could make this a bit smarter by first morphologically analyzing the word and then using a tree kernel instead.
Hand-made mappings: You can also simply make a list of common misspelling → canonical_form mappings. This would work well for "exceptions" not handled by the above methods.
Word-sense disambiguation
Mustang the car and Mustang the horse are the same form but refer to entirely different entities (or rather classes of entities, if you want to be pedantic). In fact, we ourselves as humans can't tell which one is meant unless we also know the word's context. One widely-used way of modelling this context is distributional lexical semantics: Defining a word's semantic similarity to another as the similarity of their lexical contexts, i.e. the words preceding and succeeding them in text.
Linguistic aliases (synonyms)
As stated above, any given semantic entity can be referred to in a number of different ways: bathroom, washroom, restroom, toilet, water closet, WC, loo, little boys'/girls' room, throne room etc. For simple meanings referring to generic entities like this, they can often be considered to be variant spellings in the same way that "common misspellings" are and can be mapped to a "canonical" form with a list. For ambiguous references such as throne room, other metrics (such as lexical-distributional methods) can also be included in order to disambiguate the meaning, so that you don't relate e.g. I'm in the throne room just now! to The throne room of the Buckingham Palace is beautiful.
Conclusion
You have a lot of work to do in order to get where you want to go, but it's all interesting stuff and there are already good libraries available for doing most of these tasks.
I got a list of sentences like below:
They are some sentences I extracted from job descriptions. I want to extract information like: degree type, major, required or preferred.
There are
The result should be like :
{
degree: Bachelor,
major : Computer Science,
required: True
}
Thers are no obvious rules in these sentences. How can I achieve this goal?
Bachelor ’ s degree in Computer Science or equivalent
Pursuing B.S. or advanced degree in computer science or related technical/engineering degree .
Bachelor 's Degree in Computer Science or equivalent experience
Youre educated ( BS/MS in Computer Science or other technical degree ) .
•BS in Computer Science , Digital Media or similar technical degree with 3 + years of experience
· Bachelors degree .
Bachelor 's degree in computer science , design or related field
Ability to absorb , master and leverage emerging technologies
BA/BS degree or equivalent practical experience
Education Required : Bachelors Degree
• Bachelor 's degree in related field , OR four ( 4 ) years of experience in a directly related field .
So you are dealing with unstructured data, I hope using following steps you may reach to a decent accuracy level.
Create a lookup table of list of all keywords that may occur in each of your required variables like degree, education etc. You need to mine various online sources to grab these keywords.
Split your data into sentence or line by line and Iterate over the list.
While iterating, look for the key words into your lookup tables and find the useful lines.
Create hierarchal rules to accurately extract the variables, and append them in your result.
Overview of hierarchal rules:
for example, Degree name will be completely alphabetic.
Experience will be alphanumeric.
Terms like pursuing will point towards variable name Major
Try to modify these rules on each iteration of code. Keep adding new rules.
This is just the basic approach, I believe that if you do some iterations over your methodology, you will be able to extract information.
You probably need to gather a list of majors and degrees (for example : http://en.wikipedia.org/wiki/List_of_tagged_degrees ) to extract the degree and major. Then based on some general rules (or designing a classifier decide on "required" or "not required").
Another suggestion to do this would be:
First: clean up the data - remove all punctuation, stop
words,unwanted symbols etc.
Second: make a list of keywords are interested in.
Third: split your data into words (word_tokenize in nltk)
Fourth: make a dictionary of values you are looking in.
Fifth: lookup in the dictionary as you read the words list matching
it with your keywords list and then append it into new output
dictionary.
Hope this helps.
Is it possible to count how many times an entity has been mentioned in an article? For example
ABC Company is one of the largest car manufacturers in the
world. It is also the largest
company in terms of annual production.
It is also the second largest exporter of luxury cars, after XYZ
company. Both ABC and XYZ
together produces over n% of total car
production in the country.
mentions ABC company 4 times.
Yes, this is possible. It's a combination of
named-entity recognition (NER), which for English is practically a solved problem, and
coreference resolution, which is the subject of ongoing research (but give this package a try)