How to break line into known words - excel

I need to break a line of string into different columns into excel. Here is te input that i get.
Input:
37006 II Semester P.G. Diploma in Clinical Research and Clinical Data Management Examination, July/August 2012 Pharma Regulatory Affairs Time : 3 Hours Max. Marks : 100
Output: CSV record with structure (Code, Sem/Year, Subject, Course, Exam Date, Time, Marks)
37006 , II Semester, P.G. Diploma in Clinical Research and Clinical Data Management, Pharma Regulatory Affairs, July/August 2012 , 3 Hours , 100
I have data in different sets which constructs above lines. For example:
Grammar (this is an array / dictionary):
Semesters[I,II,III,IV,V,VI,VII,VIII,IX,X,1,2,3,4,5,6,7,8,9,10]
Years[I,II,III,IV,V,VI,VII,VIII,IX,X,1,2,3,4,5,6,7,8,9,10]
Subjects[P.G. Diploma in Clinical Research and Clinical Data Management, LL.B]
Courses[Pharma Regulatory Affairs,Law - Jurisprudence]
ExamDates[ July/August 2012 , Jan./Feb. 2013 ]
Time[3 Hours]
MaxMarks[30,40,50,60,70,80,90,100]
FYI,
I'm not sure that i can use any delimiters to break it as its highly unpredictable or dependable.
I'm not sure the text will be in same order in each line or no fixed length or cars or words
My assumption is, read word by word and try to match with any word in any array that I have. If its match with any word, then categorize that word into falling category and add into relevant column in excel.
Here, I know how to handle data and everything, except what is the optimized / best way to
understand each word falls under which category.
Is there any lexical analysis expert that can share some thoughts on this?

You should use regular expressions for matching such complicated text pattern.

Please take a look at a lexical analyzer like ANTLR. If you know Java or other languages that read regular expressions, you will be able to parse these with ease after an afternoon (or week) of torture. You can also write the regexp in Java, but I would nudge you toward the ANTLR interface, which you may use from Eclipse. It will show you how the lines are being parsed.
Have the output of the ANTLR or Java write out a CSV file. The CSV will get be your vehicle for getting your data into the Excel spreadsheet.

Related

Excel gets incomplete line from CSV

Excel displays the following line incomplete :
Actual line:
"- Guides students through the new course and assessment structure with advice at the start of the book to help them understand what's required- Enables students to put their knowledge in context and helps them start analysing business data with case studies of real businesses included throughout- Prepares students for assessment with the 'Your turn' feature that contains practice questions including multiple choice, case study and data response, and those that test their quantitative skills"
Excel Displaying the incomplete line
- Guides students through the new course and assessment structure with advice at the start of the book to help them understand what's required- Enables students to put their knowledge in context and helps them start analysing business data with case studi
Please help
REgards
Kris
Okay, this is apparently a bug in Excel.
Because the text has a leading dash "-", Excel first tries to parse it as a number which is limited to 255 characters, but when that fails and it switches back to General, it never restores the text to the 32767 character limit.
See this answer at SuperUser for a more thorough answer and a workaround.
A workaround is to avoid having a leading dash. The fix is to declare the column as "Text" instead of "General" in the CSV Importer.

Techniques other than RegEx to discover 'intent' in sentences

I'm embarking on a project for a non-profit organization to help process and classify 1000's of reports annually from their field workers / contractors the world over. I'm relatively new to NLP and as such wanted to seek the group's guidance on the approach to solve our problem.
I'll highlight the current process, and our challenges and would love your help on the best way to solve our problem.
Current process: Field officers submit reports from locally run projects in the form of best practices. These reports are then processed by a full-time team of curators who (i) ensure they adhere to a best-practice template and (ii) edit the documents to improve language/style/grammar.
Challenge: As the number of field workers increased the volume of reports being generated has grown and our editors are now becoming the bottle-neck.
Solution: We would like to automate the 1st step of our process i.e., checking the document for compliance to the organizational best practice template
Basically, we need to ensure every report has 3 components namely:
1. States its purpose: What topic / problem does this best practice address?
2. Identifies Audience: Who is this for?
3. Highlights Relevance: What can the reader do after reading it?
Here's an example of a good report submission.
"This document introduces techniques for successfully applying best practices across developing countries. This study is intended to help low-income farmers identify a set of best practices for pricing agricultural products in places where there is no price transparency. By implementing these processes, farmers will be able to get better prices for their produce and raise their household incomes."
As of now, our approach has been to use RegEx and check for keywords. i.e., to check for compliance we use the following logic:
1 To check "states purpose" = we do a regex to match 'purpose', 'intent'
2 To check "identifies audience" = we do a regex to match with 'identifies', 'is for'
3 To check "highlights relevance" = we do a regex to match with 'able to', 'allows', 'enables'
The current approach of RegEx seems very primitive and limited so I wanted to ask the community if there is a better way to solving this problem using something like NLTK, CoreNLP.
Thanks in advance.
Interesting problem, i believe its a thorough research problem! In natural language processing, there are few techniques that learn and extract template from text and then can use them as gold annotation to identify whether a document follows the template structure. Researchers used this kind of system for automatic question answering (extract templates from question and then answer them). But in your case its more difficult as you need to learn the structure from a report. In the light of Natural Language Processing, this is more hard to address your problem (no simple NLP task matches with your problem definition) and you may not need any fancy model (complex) to resolve your problem.
You can start by simple document matching and computing a similarity score. If you have large collection of positive examples (well formatted and specified reports), you can construct a dictionary based on tf-idf weights. Then you can check the presence of the dictionary tokens. You can also think of this problem as a binary classification problem. There are good machine learning classifiers such as svm, logistic regression which works good for text data. You can use python and scikit-learn to build programs quickly and they are pretty easy to use. For text pre-processing, you can use NLTK.
Since the reports will be generated by field workers and there are few questions that will be answered by the reports (you mentioned about 3 specific components), i guess simple keyword matching techniques will be a good start for your research. You can gradually move to different directions based on your observations.
This seems like a perfect scenario to apply some machine learning to your process.
First of all, the data annotation problem is covered. This is usually the most annoying problem. Thankfully, you can rely on the curators. The curators can mark the specific sentences that specify: audience, relevance, purpose.
Train some models to identify these types of clauses. If all the classifiers fire for a certain document, it means that the document is properly formatted.
If errors are encountered, make sure to retrain the models with the specific examples.
If you don't provide yourself hints about the format of the document this is an open problem.
What you can do thought, is ask people writing report to conform to some format for the document like having 3 parts each of which have a pre-defined title like so
1. Purpose
Explains the purpose of the document in several paragraph.
2. Topic / Problem
This address the foobar problem also known as lorem ipsum feeling text.
3. Take away
What can the reader do after reading it?
You parse this document from .doc format for instance and extract the three parts. Then you can go through spell checking, grammar and text complexity algorithm. And finally you can extract for instance Named Entities (cf. Named Entity Recognition) and low TF-IDF words.
I've been trying to do something very similar with clinical trials, where most of the data is again written in natural language.
If you do not care about past data, and have control over what the field officers write, maybe you can have them provide these 3 extra fields in their reports, and you would be done.
Otherwise; CoreNLP and OpenNLP, the libraries that I'm most familiar with, have some tools that can help you with part of the task. For example; if your Regex pattern matches a word that starts with the prefix "inten", the actual word could be "intention", "intended", "intent", "intentionally" etc., and you wouldn't necessarily know if the word is a verb, a noun, an adjective or an adverb. POS taggers and the parsers in these libraries would be able to tell you the type (POS) of the word and maybe you only care about the verbs that start with "inten", or more strictly, the verbs spoken by the 3rd person singular.
CoreNLP has another tool called OpenIE, which attempts to extract relations in a sentence. For example, given the following sentence
Born in a small town, she took the midnight train going anywhere
CoreNLP can extract the triple
she, took, midnight train
Combined with the POS tagger for example; you would also know that "she" is a personal pronoun and "took" is a past tense verb.
These libraries can accomplish many other tasks such as tokenization, sentence splitting, and named entity recognition and it would be up to you to combine all of these tools with your domain knowledge and creativity to come up with a solution that works for your case.

How to extract information from these sentences

I got a list of sentences like below:
They are some sentences I extracted from job descriptions. I want to extract information like: degree type, major, required or preferred.
There are
The result should be like :
{
degree: Bachelor,
major : Computer Science,
required: True
}
Thers are no obvious rules in these sentences. How can I achieve this goal?
Bachelor ’ s degree in Computer Science or equivalent
Pursuing B.S. or advanced degree in computer science or related technical/engineering degree .
Bachelor 's Degree in Computer Science or equivalent experience
Youre educated ( BS/MS in Computer Science or other technical degree ) .
•BS in Computer Science , Digital Media or similar technical degree with 3 + years of experience
· Bachelors degree .
Bachelor 's degree in computer science , design or related field
Ability to absorb , master and leverage emerging technologies
BA/BS degree or equivalent practical experience
Education Required : Bachelors Degree
• Bachelor 's degree in related field , OR four ( 4 ) years of experience in a directly related field .
So you are dealing with unstructured data, I hope using following steps you may reach to a decent accuracy level.
Create a lookup table of list of all keywords that may occur in each of your required variables like degree, education etc. You need to mine various online sources to grab these keywords.
Split your data into sentence or line by line and Iterate over the list.
While iterating, look for the key words into your lookup tables and find the useful lines.
Create hierarchal rules to accurately extract the variables, and append them in your result.
Overview of hierarchal rules:
for example, Degree name will be completely alphabetic.
Experience will be alphanumeric.
Terms like pursuing will point towards variable name Major
Try to modify these rules on each iteration of code. Keep adding new rules.
This is just the basic approach, I believe that if you do some iterations over your methodology, you will be able to extract information.
You probably need to gather a list of majors and degrees (for example : http://en.wikipedia.org/wiki/List_of_tagged_degrees ) to extract the degree and major. Then based on some general rules (or designing a classifier decide on "required" or "not required").
Another suggestion to do this would be:
First: clean up the data - remove all punctuation, stop
words,unwanted symbols etc.
Second: make a list of keywords are interested in.
Third: split your data into words (word_tokenize in nltk)
Fourth: make a dictionary of values you are looking in.
Fifth: lookup in the dictionary as you read the words list matching
it with your keywords list and then append it into new output
dictionary.
Hope this helps.

What features do NLP practitioners use to pick out English names?

I am trying named entity recognition for the first time. I'm looking for features that will pick out English names. I am using the methods outlined in the coursera nlp course (week three) and the nltk book. In other words: I am defining features, identifying features of words and then running those words/features through a classifier that I train on labeled data.
What features are used to pick out English names?
I can imagine that you'd look for two capital words in a row, or a capital word and then an initial and then a capital word. (ex. John Smith or James P. Smith).
But what other features are used for NER?
Some common features:
Word lists for common names (John, Adam, etc)
casing
contains symbol or numeric characters (names generally don't)
person prefixes (Mr., Mrs., etc...)
person postfixes (Jr., Sr., etc...)
single letter abbreviation (ie, (J.) Smith).
analysis of surrounding words (you may find some words have a high probability of appearing near names).
Named Entities previously recognized (often it is easy to identify NE in some parts of the corpus based on context, but very hard in other parts. If previously identified, this is an excellent hint towards NER)
Depending what language you are working with there may be more language specific features as well. Frankly you can turn up a wealth of information with a simple Google query, I'm really not sure why you haven't turned there. Some starting points however:
Google
A survey of named entity recognition and classification
Named entity recognition without gazetteers
I had done something similar back in school using machine learning. I suppose that you will use a supervised algorithm and you will classify every single word independently and not words in combination. In that case I would choose some features for the word itself like the ones you mentioned (if the word begins with a capital letter, if the word is an abbreviation) but I would add some more features like if the previous or the next words also start from a capital letter, or if they are abbreviations. This way you can add some context and overcome the problems related to your basic independence assumption.
If you want have a look here. In the machine learning section you can find some more information and examples (the problem is slightly different but the method should be similar).
Whatever features you choose it is important that you use some measure to evaluate their relevance and possibly reduce them to the useful ones to avoid over-fitting. One of the measures you can use to evaluate them is the gain ratio but there are many more. Here you can find some basic information about feature extraction.
Hope it helps!

Algorithms for splitting personal names in parts

I'm looking for references on separating a name: "John A. Doe" in parts, first=John, middle=A., last=Doe. In Mexico we have paternal, maternal, first and second given names, and can be written in different permutations, so the problem is quite complex.
As it depends on data, we are working with matching software that calculates a score for every word so we can take decisions (it is based on a big database). The input data is not clean, it is imported from some government web pages and is human filtered so it could have junk that has to be recognized as well. Any suggestions?
[Edit]
Examples:
name:
Javier Abdul Córdoba Gándara
common permutations (or as it may appear in gvt data referring to same person):
Córdoba Gándara Javier Abdul
Javier A. Córdoba Gándara
Javier Abdul Córdoba G.
paternal=Córdoba
maternal=Gándara
first given:Javier
second given:Abdul
name: María de la Luz Sánchez Martínez
paternal:Sánchez
maternal: Martínez
first given: María de la Luz
name: Paloma Viridiana Alin Arias Medina
paternal: Arias
maternal: Medina
first given: Paloma
second given: Viridiana Alin
As I said what the meaning of each word depends on the score. One has no way of knowing that Viridiana and Alin are given names if not from the score.
We have a very strong database (80 million records or so) so we can get some use of the scoring system. I am designing some algorithm that uses that but looking for other references.
Unfortunately - and having done quite a bit of this work myself - your ideal algorithm will be very data specific, and you will need to work this out for your particular situation.
Of the total time and effort to develop this algorithm, I'd say the time will be split roughly as follows:
10% for general string manipulation
30% for the specific nature of the
data (Mexican name formats, data input quirks)
60% to cater for data quality / lack of
quality
And I believe that's quite generous towards the general string manipulation. Of course it depends whether you need quality results for all records, or only the 'clean' records etc, and if you are able to ignore the 'difficult' records it makes it a lot simpler.
Some general tips
If they are not required, remove non alphanumeric / whitespace characters
Split on spaces
Use hyphens / punctuation to identify surnames or family names
Initials (which are generally single
letters) are not surnames; i.e. they
must be first / middle
determine the level of confidence that you have programmatically identified the each name (and test this thoroughly). You may find there are subsets of data that contain similar patterns that need to be catered for individually (they may come from different sources etc)
You may need to add some natural language or machine learning to check. The problem of identifying author names (e.g. in scientific papers) is difficult as they can be reported with differing orders, degrees of abbreviation, elisions etc. If your database is dirty you will end with ambiguity whatever you do.

Resources