recognize words in a sequence of characters [closed] - string

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
I need an algorithm that can recognize words (dictionary based)
in a sequence of characters that has no white spaces.
lets say for example, the sequence is:
spaceless
it should recognize space and less.
and there might be situations where more words can be recognized.
its hard to give such an example but I'll give it a try:
example: spaceslight
recognized words: space and slight (1)
recognized words: spaces and light (2)
so the algorithm should be able to find those kind of variations too.

If you need multiple queries on the same string a suffix trie is a good solution. This will store the string very efficiently and allows lookup of queries in O(n) where n is the length of the query (note that you cannot do better unless you have more knowledge of the queries).
If a suffix trie still is using up too much space, you can use a DAWG, but this is much more complicated to build.

You can also try the Knuth-Morris-Pratt algorithm. It searches for strings in text... If I remember it correctly it has a linear complexity. Here have a look:
http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
PS: You might need to tweak it a little bit to your needs...

You might want to look at the Rabin-Karp algorithm, it allows a single pass through the text file to search for all the n letter words in the dictionary for some value of n. Standard Rabin-Karp will find overlaps: spaceslight -> spaces, a, ace, aces, slight, light, i. You would need to modify it if you didn't want overlapping words.

Related

Framework for minimizing time complexity of generalized search

I have training in pure math but not in statistics, computer science, and information theory so I am a bit lost here and would really appreciate any guidance.
I am looking for some helpful ways to frame a general search approach which would minimize the time complexity of the search.
For example, let's say I was playing a modified version of 20-questions with a friend. The friend has thought of a human, presently alive in the US, and I can ask upto 20 questions to uncover the truth. I want to ask as few questions as possible on average to win the game. We will play this game repeatedly and I want to develop a strategy that would minimize my average win time (as measured by the number of questions asked).
Sample Space: 329.5 million humans currently alive in the US
Rule: Ask any question. The question can have yes or no answer or even a descriptive answer. So for instance, it is allowed to ask the first name of the person.
Intuitively, it seems to me that immediately (as a first quesiton) asking a question like "Is it Barack Obama?" is a terrible question because it splits the sample space (or search space) into two sets, one with 1 person, namely the former US President, and the second containing rest of the US population.
Asking, what is their sex (or old school gender) may be a better question as it will split the yes and no answers into sets of roughly equal sizes.
Instead of asking a binary question, asking an n-ary question is likely better because it will split the sample space into n sub-spaces of varying sizes and if the sizes are similar then that's fantastic. For instance, the question could be, what is the first letter of their last name? There are 26 possible answers, although we know that people in the US are much more likely to have their last name begin with "J" rather than "X".
Of course, I can conceivably ask a 329.5 million-ary question whereby I'll have the answer in one-shot.
My questions for you guys are as follows:
If we fix "n", so asking only binary or ternary or fixed-n-ary questions, it seems to me that the efficient approach would be to ask questions which would divide the sample space into "n" roughly equal parts, if I am minimizing time complexity. How can I prove this? What is the right approach or mathematical fraemwork to prove this? Assuming that I am only minimizing time complexity or the average number of questions I need to ask to get to the solution.
If we don't fix "n" then what would be a general way to frame this mathematically? Now I have two variables over which I am operating, "n" and "the relative size of subsets the answer to a n-ary question splits the sample space", to minimize the time complexity. How can I frame this problem mathematically?
Is my intuition even correct? Or are there faster ways to approach this?
What I am describing sounds an awful lot like a Classificaiton Decision Tree in Machine Learning. Is minimzing Entorpy the right way to frame my question?
Who would know or think about this type of stuff ? Information theorists? Computer Scientists? Statisticians? Probability Theorists? Machine Learning folks? Someone else?
What's the right forum on the internet to get help on this question? Reddit? Some specific stackexchange? Anything else?
Thx

Embeddings vs text cleaning (NLP)

I am a graduate student focusing on ML and NLP. I have a lot of data (8 million lines) and the text is usually badly written and contains so many spelling mistakes.
So i must go through some text cleaning and vectorizing. To do so, i considered two approaches:
First one:
cleaning text by replacing bad words using hunspell package which is a spell checker and morphological analyzer
+
tokenization
+
convert sentences to vectors using tf-idf
The problem here is that sometimes, Hunspell fails to provide the correct word and changes the misspelled word with another word that don't have the same meaning. Furthermore, hunspell does not reconize acronyms or abbreviation (which are very important in my case) and tends to replace them.
Second approache:
tokenization
+
using some embeddings methode (like word2vec) to convert words into vectors without cleaning text
I need to know if there is some (theoretical or empirical) way to compare this two approaches :)
Please do not hesitate to respond If you have any ideas to share, I'd love to discuss them with you.
Thank you in advance
I post this here just to summarise the comments in a longer form and give you a bit more commentary. No sure it will answer your question. If anything, it should show you why you should reconsider it.
Points about your question
Before I talk about your question, let me point a few things about your approaches. Word embeddings are essentially mathematical representations of meaning based on word distribution. They are the epitome of the phrase "You shall know a word by the company it keeps". In this sense, you will need very regular misspellings in order to get something useful out of a vector space approach. Something that could work out, for example, is US vs. UK spelling or shorthands like w8 vs. full forms like wait.
Another point I want to make clear (or perhaps you should do that) is that you are not looking to build a machine learning model here. You could consider the word embeddings that you could generate, a sort of a machine learning model but it's not. It's just a way of representing words with numbers.
You already have the answer to your question
You yourself have pointed out that using hunspell introduces new mistakes. It will be no doubt also the case with your other approach. If this is just a preprocessing step, I suggest you leave it at that. It is not something you need to prove. If for some reason you do want to dig into the problem, you could evaluate the effects of your methods through an external task as #lenz suggested.
How does external evaluation work?
When a task is too difficult to evaluate directly we use another task which is dependent on its output to draw conclusions about its success. In your case, it seems that you should pick a task that depends on individual words like document classification. Let's say that you have some sort of labels associated with your documents, say topics or types of news. Predicting these labels could be a legitimate way of evaluating the efficiency of your approaches. It is also a chance for you to see if they do more harm than good by comparing to the baseline of "dirty" data. Remember that it's about relative differences and the actual performance of the task is of no importance.

String meaning comparison [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Is there some sort of algorithm out there or concept that can help with the following problem?
Say I have two snippets of text, snippet 1, and snippet two.
Snippet 1 reads as follows:
"The dog was too scared to go out into the storm"
Snippet 2 reads as follows:
"The canine was intimidated to venture into the rainy weather"
Is there a way to compare those snippets using some sort of algorithm, or maybe some sort of string theory system? I want to know if there are any kinds of systems that have solved this problem before I tackle it.
UPDATE:
Okay, to give a more specific example, say I wanted to reduce the number of bugs in a ticketing system. And I wanted to do some sort of scan, to see if there are any related or similar tickets. I wanted to know the best systematic way of determining the issue based on the body of a ticket. The Levenshtein Distance algorithm doesn't particularly work well, since it wouldn't know the difference between wet and dry.
Is there a way to compare those snippets using some sort of algorithm, or maybe some sort of string theory system? I want to know if there are any kinds of systems that have solved this problem before I tackle it.
Well, this is a very famous problem in NLP, and to be more precise, you are comparing semantics of two sentences.
Maybe you can look into libraries like gensim, Wordnet::Similarity etc which provide ways to retrieve semantically similar documents.
Here's another semantically similar SO question question.
An option here could be the Levenshtein Distance between two strings.
It is a measure of the number of operations required to get from one string to another. So, the larger the distance, the less similar the two strings.
This kind of algorithm is great for spell checking or voice recognition because the given string and expected string generally only differ by just a couple words/characters.
For your example, the Levenshtein Distance is 32 (you can try this calculator) which indicates that the strings are not very similar (since the strings are not much longer than the distance of 32).
This algorithm is not great for context sensitive comparisons but your example is kind of an extreme case. Very likely there would be more words in common which would result in a smaller Levenshtein Distance. You could use this algorithm in conjunction with some other methods (See: What are some algorithms for comparing how similar two strings are?) to try to get a more optimal comparison.

NLP - subject of sentence [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I am trying to get the main subject of a sentence, i.e what a sentence is talking about (not the grammatical subject which may be different).
So far I have got
1.) OpenNLP in Java which is giving me sentence detection, POS tagging, parsing, tokenizer and Name Finder.
2.) MatlParser,stanford Parser - which can give the grammatical subject of a simple sentence by dependency parsing.
I think a noun or a noun phrase will always be subject in more general sense,but a sentence can have many nouns and noun phrases.
Any help would be much appreciated.
As you correctly pointed out, syntax is not sufficient. One would have to use some form of shallow semantic analysis to identify what you call the "subject". I believe it is more often referred to as Agent in the context of SRL (Semantic Role Labeling). There are open source tools (e.g. UIUC SRL parser) to perform semantic role labeling, at least for English, but they usually work on separate predicates, of which in a sentence there may be several, so one has to somehow figure out which "subject" is the "main" one.
I do not think that the latter notion is well defined, in fact, as in a complex sentence it might not be clear which subject is the "main" one. It might make more sense for a particular kind of sentences, but not in general. I think it would help if you described the data you're working with and/or gave some examples.
P.S. you might consider asking this on https://linguistics.stackexchange.com/

Best Algorithmic Approach to Sentiment Analysis [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
My requirement is taking in news articles and determining if they are positive or negative about a subject. I am taking the approach outlined below, but I keep reading NLP may be of use here. All that I have read has pointed at NLP detecting opinion from fact, which I don't think would matter much in my case. I'm wondering two things:
1) Why wouldn't my algorithm work and/or how can I improve it? ( I know sarcasm would probably be a pitfall, but again I don't see that occurring much in the type of news we will be getting)
2) How would NLP help, why should I use it?
My algorithmic approach (I have dictionaries of positive, negative, and negation words):
1) Count number of positive and negative words in article
2) If a negation word is found with 2 or 3 words of the positive or negative word, (ie: NOT the best) negate the score.
3) Multiply the scores by weights that have been manually assigned to each word. (1.0 to start)
4) Add up the totals for positive and negative to get the sentiment score.
I don't think there's anything particularly wrong with your algorithm, it's a fairly straightforward and practical way to go, but there are a lot of situations where it will get make mistakes.
Ambiguous sentiment words - "This product works terribly" vs. "This product is terribly good"
Missed negations - "I would never in a millions years say that this product is worth buying"
Quoted/Indirect text - "My dad says this product is terrible, but I disagree"
Comparisons - "This product is about as useful as a hole in the head"
Anything subtle - "This product is ugly, slow and uninspiring, but it's the only thing on the market that does the job"
I'm using product reviews for examples instead of news stories, but you get the idea. In fact, news articles are probably harder because they will often try to show both sides of an argument and tend to use a certain style to convey a point. The final example is quite common in opinion pieces, for example.
As far as NLP helping you with any of this, word sense disambiguation (or even just part-of-speech tagging) may help with (1), syntactic parsing might help with the long range dependencies in (2), some kind of chunking might help with (3). It's all research level work though, there's nothing that I know of that you can directly use. Issues (4) and (5) are a lot harder, I throw up my hands and give up at this point.
I'd stick with the approach you have and look at the output carefully to see if it is doing what you want. Of course that then raises the issue of what you want you understand the definition of "sentiment" to be in the first place...
my favorite example is "just read the book". it contains no explicit sentiment word and it is highly depending on the context. If it apears in a movie review it means that the-movie-sucks-it's-a-waste-of-your-time-but-the-book-is-good. However, if it is in a book review it delivers a positive sentiment.
And what about - "this is the smallest [mobile] phone in the market". back in the '90, it was a great praise. Today it may indicate that it is a way too small.
I think this is the place to start in order to get the complexity of sentiment analysis: http://www.cs.cornell.edu/home/llee/opinion-mining-sentiment-analysis-survey.html (by Lillian Lee of Cornell).
You may find the OpinionFinder system and the papers describing it useful.
It is available at http://www.cs.pitt.edu/mpqa/ with other resources for opinion analysis.
It goes beyond polarity classification at the document level, but try to find individual opinions at the sentence level.
I believe the best answer to all of the questions that you mentioned is reading the book under the title of "Sentiment Analysis and opinion mining" by Professor Bing Liu. This book is the best of its own in the field of sentiment analysis. it is amazing. Just take a look at it and you will find the answer to all your 'why' and 'how' questions!
Machine-learning techniques are probably better.
Whitelaw, Garg, and Argamon have a technique that achieves 92% accuracy, using a technique similar to yours for dealing with negation, and support vector machines for text classification.
Why don't you try something similar to how SpamAsassin spam filter works? There really not much difference between intension mining and opinion mining.

Resources