How to automatically detect code snippet from a text sample? - nlp

I'm doing some analysis on GitHub comments. But for that, I need to exclude the code samples and error messages from the comments automatically from a large set.
The other easier way to say this would be, I can keep only the English part of the comments. Although there are few libraries to detect the language of a sentence, there are few challenges in my case too. 1) the comment part does not always follow proper English grammar, 2) the code sample and error message mainly consist of English words too.
So what should be my best approach. The results don't need to be 100% accurate, I just want to know the best approach that can give me a satisfactory result at least. Any idea?

This question is old, but my Google search led me to this question; so offering this answer in case anyone stumbles into this question, too.

Related

Semantics based code search

We have a large number of repositories. We want to implement a semantics(functionality) based code search on those repositories. Right now, we already have implemented keyword based code search in which we crawled through all the repository files and indexed them using elasticsearch. But that doesn't solve our problem as some of the repositories are poorly commented and documented, thus searching for specific codes/libraries become difficult.
So my question is: Is there any opensource libraries or any previous work done in this field which could help us index the semantics of the repository files, so that searching the code becomes easy and this would also help us in re-usability of the codes. I have found some research papers like Semantic code browsing, Semantics-based code search etc. but were of no use as there was no actual implementation given. So can you please suggest some good libraries or projects which could help me in achieving the same.
P.S:-Moreover, companies like Koders, Google, cocycles.com etc. started their code search based on functionality. But most of them have shut down their operations without giving any proper feedback, can anyone please tell me what kind of difficulties they are facing.
not sure if this is what you're looking for, but I wrote https://github.com/google/zoekt , which uses ctags-based understanding of code to improve ranking.
Take a look at insight.io
It provides semantic search and browsing

Simple toolkits for emotion (sentiment) analysis (not using machine learning)

I am looking for a tool that can analyze the emotion of short texts. I searched for a week and I couldn't find a good one that is publicly available. The ideal tool is one that takes a short text as input and guesses the emotion. It is preferably a standalone application or library.
I don't need tools that is trained by texts. And although similar questions are asked before no satisfactory answers are got.
I searched the Internet and read some papers but I can't find a good tool I want. Currently I found SentiStrength, but the accuracy is not good. I am using emotional dictionaries right now. I felt that some syntax parsing may be necessary but it's too complex for me to build one. Furthermore, it's researched by some people and I don't want to reinvent the wheels. Does anyone know such publicly/research available software? I need a tool that doesn't need training before using.
Thanks in advance.
I think that you will not find a more accurate program than SentiStrength (or SoCal) for this task - other than machine learning methods in a specific narrow domain. If you have a lot (>1000) of hand-coded data for a specific domain then you might like to try a generic machine learning approach based on your data. If not, then I would stop looking for anything better ;)
Identifying entities and extracting precise information from short texts, let alone sentiment, is a very challenging problem specially with short text because of lack of context. Hovewer, there are few unsupervised approaches to extracting sentiments from texts mainly proposed by Turney (2000). Look at that and may be you can adopt the method of extracting sentiments based on adjectives in the short text for your use-case. It is hovewer important to note that this might require you to efficiently POSTag your short text accordingly.
Maybe EmoLib could be of help.

Evaluate the content of a paragraph

We are building a database of scientific papers and performing analysis on the abstracts. The goal is to be able to say "Interest in this topic has gone up 20% from last year". I've already tried key word analysis and haven't really liked the results. So now I am trying to move onto phrases and proximity of words to each other and realize I'm am in over my head. Can anyone point me to a better solution to this, or at very least give me a good term to google to learn more?
The language used is python but I don't think that really affects your answer. Thanks in advance for the help.
It is a big subject, but a good introduction to NLP like this can be found with the NLTK toolkit. This is intended for teaching and works with Python - ie. good for dabbling and experimenting. Also there's a very good open source book (also in paper form from O'Reilly) on the NLTK website.
This is just a guess; not sure if this approach will work. If you're looking at phrases and proximity of words, perhaps you can build up a Markov Chain? That way you can get an idea of the frequency of certain phrases/words in relation to others (based on the order of your Markov Chain).
So you build a Markov Chain and frequency distribution for the year 2009. Then you build another one at the end of 2010 and compare the frequencies (of certain phrases and words). You might have to normalize the text though.
Other than that, something that comes to mind is Natural-Language-Processing techniques (there is a lot of literature surrounding the topic!).

How to choose a Feature Selection Algorithm? - advice

Is there a research paper/book that I can read which can tell me for the problem at hand what sort of feature selection algorithm would work best.
I am trying to simply identify twitter messages as pos/neg (to begin with). I started out with Frequency based feature selection (having started with NLTK book) but soon realised that for a similar problem various individuals have choosen different algorithms
Although I can try Frequency based, mutual information, information gain and various other algorithms the list seems endless.. and was wondering if there an efficient way then trial and error.
any advice
Have you tried the book I recommended upon your last question? It's freely available online and entirely about the task you are dealing with: Sentiment Analysis and Opinion Mining by Pang and Lee. Chapter 4 ("Extraction and Classification") is just what you need!
I did an NLP course last term, and it came pretty clear that sentiment analysis is something that nobody really knows how to do well (yet). Doing this with unsupervised learning is of course even harder.
There's quite a lot of research going on regarding this, some of it commercial and thus not open to the public. I can't point you to any research papers but the book we used for the course was this (google books preview). That said, the book covers a lot of material and might not be the quickest way to find a solution to this particular problem.
The only other thing I can point you towards is to try googling around, maybe in scholar.google.com for "sentiment analysis" or "opinion mining".
Have a look at the NLTK movie_reviews corpus. The reviews are already pos/neg categorized and might help you with training your classifier. Although the language you find in Twitter is probably very different from those.
As a last note, please post any successes (or failures for that matter) here. This issue will come up later for sure at some point.
Unfortunately, there is no silver bullet for anything when dealing with machine learning. It's usually referred to as the "No Free Lunch" theorem. Basically a number of algorithms work for a problem, and some do better on some problems and worse on others. Over all, they all perform about the same. The same feature set may cause one algorithm to perform better and another to perform worse for a given data set. For a different data set, the situation could be completely reversed.
Usually what I do is pick a few feature selection algorithms that have worked for others on similar tasks and then start with those. If the performance I get using my favorite classifiers is acceptable, scrounging for another half percentage point probably isn't worth my time. But if it's not acceptable, then it's time to re-evaluate my approach, or to look for more feature selection methods.

Effective strategies for studying frameworks/ libraries partially [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I remember the old effective approach of studying a new framework. It was always the best way to read a good book on the subject, say MFC. When I tried to skip a lot of material to speed up coding it turned out later that it would be quicker to read the whole book first. There was no good ways to study a framework in small parts. Or at least I did not see them then.
The last years a lot of new things happened: improved search results from Google, programming blogs, much more people involved in Internet discussions, a lot of open source frameworks.
Right now when we write software we much often depend on third-party (usually open source) frameworks/ libraries. And a lot of times we need to know only a small amount of their functionality to use them. It's just about finding the simplest way of using a small subset of the library without unnecessary pessimizations.
What do you do to study as less as possible of the framework and still use it effectively?
For example, suppose you need to index a set of documents with Lucene. And you need to highlight search snippets. You don't care about stemmers, storing the index in one file vs. multiple files, fuzzy queries and a lot of other stuff that is going to occupy your brain if you study Lucene in depth.
So what are your strategies, approaches, tricks to save your time?
I will enumerate what I would do, though I feel that my process can be improved.
Search "lucene tutorial", "lucene highlight example" and so on. Try to estimate trust score of unofficial articles ( blog posts ) based on publishing date, the number and the tone of the comments. If there is no a definite answer - collect new search keywords and links on the target.
Search for really quick tutorials/ newbie guides on official site
Estimate how valuable are javadocs for a newbie. (Read Lucene highlight package summary)
Search for simple examples that come with a library, related to what you need. ( Study "src/demo/org/apache/lucene/demo")
Ask about "simple Lucene search highlighting example" in Lucene mail list. You can get no answer or even get a bad reputation if you ask a silly question. And often you don't know whether you question is silly because you have not studied the framework in depth.
Ask it on Stackoverflow or other QA service "could you give me a working example of search keywords highlighting in Lucene". However this question is very specific and can gain no answers or a bad score.
Estimate how easy to get the answer from the framework code if it's open sourced.
What are your study/ search routes? Write them in priority order if possible.
I use a three phase technique for evaluating APIs.
1) Discovery - In this phase I search StackOverflow, CodeProject, Google and Newsgroups with as many different combination of search phrases as possible and add everything that might fit my needs into a huge list.
2) Filter/Sort - For each item I found in my gathering phase I try to find out if it suits my needs. To do this I jump right into the API documentation and make sure it has all of the features I need. The results of this go into a weighted list with the best solutions at the top and all of the cruft filtered out.
3) Prototype - I take the top few contenders and try to do a small implementation hitting all of the important features. Whatever fits the project best here wins. If for some reason an issue comes up with the best choice during implementation, it's possible to fall back on other implementations.
Of course, a huge number of factors go into choosing the best API for the project. Some important ones:
How much will this increase the size of my distribution?
How well does the API fit with the style of my existing code?
Does it have high quality/any documentation?
Is it used by a lot of people?
How active is the community?
How active is the development team?
How responsive is the development team to bug patch requests?
Will the development team accept my patches?
Can I extend it to fit my needs?
How expensive will it be to implement overall?
... And of course many more. It's all very project dependent.
As to saving time, I would say trying to save too much here will just come back to bite you later. The time put into selecting a good library is at least as important as the time spent implementing it. Also, think down the road, in six months would you rather be happily coding or would you rather be arguing with a xenophobic dev team :). Spending a couple of extra days now doing a thorough evaluation of your choices can save a lot of pain later.
The answer to your question depends on where you fall on the continuum of generality/specificity. Do you want to solve an immediate problem? Are you looking to develop a deep understanding of the library? Chances are you’re somewhere between those extremes. Jeff Atwood has a post about how programmers move between these levels, based on their need.
When first getting started, read something on the high-level design of the framework or library (or language, or whatever technology it is), preferably by one of the designers. Try to determine what problems they are trying to address, what the organizing principles behind the design are, and what the central features are. This will form the conceptual framework from which future understanding will hang.
Now jump in to it. Create something. Do not copy and paste somebody's code. Instead, when things don’t work, read the error messages in detail, and the help on those error messages, and figure out why that error occurred. It can be frustrating, when things don’t work, but it forces you to think, and that’s when you learn.
1) Search Google for my task
2) look at examples with a few different libraries, no need to tie myself down to Lucene for example, if I don't know what other options I have.
3) Look at the date of last update on the main page, if it hasn't been updated in 6-months leave (with some exceptions)
4) Search for sample task with library (don't read tutorials yet)
5) Can I understand what's going on without a tutorial? If yes continue if no start back at 1
6) Try to implement the task
7) Watch myself fail
8) Read a tutorial
9) Try to implement the task
10) Watch myself fail and ask on StackOverflow, or mail the authors, post on user group (if friendly looking)
11) If I could get the task done, I'll consider the framework worthy of study and read up the main tutorial for 2 hours (if it doesn't fit in 2 hours I just ignore what's left until I need it)
I have no recipe, in the sense of a set of steps I always follow, that's largely because everything I learn is different. Some things are radically new to me (Dojo for example, I have no fluency in Java script so that's a big task), some just enhancements of previous knowledge (Iknow EJB 2 well, so learning EJB 3 while on the surface is new with all its annotations, its building on concepts.)
My general strategy though is I'd describe as "Spiral and Park". I try to circle the landscape first, understand the general shape, I Park concepts that I don't get just yet, don't let it worry me. Then i go a little deeper into some areas, but again try not to get obsessed with one, Spiralling down into the subject. Hopefully I start to unpark and understand, but also need to park more things.
Initially I want answers to questions such as:
What's it for?
Why would I use this rather than that other thing I already know
What's possible? Any interesting sweet spots. "Eg. ooh look at that nice AJAX-driven update"
I do a great deal of skim reading.
Then I want to do more exploring on the hows. I start to look for gotchas and good advice. (Eg. in java: why is "wibble".equals(var) a useful construct?)
Specific techniques and information sources:
Most important: doing! As early as possible I want to work a tutorial or two. I probably have to get the first circuit of the spiral done, but then I want to touch and experiment.
Overview documents
Product documents
Forums and discussion groups, learning by answering questions is my favourite technique.
if at all possible I try to find gurus. I'm fortunate in having in my immediate colleagues a wealth of knowledge and experience.
Quick-start guides.
A quick look at the API documentation if available.
Reading sample codes.
Messing around YOU HAVE TO MESS AROUND (sorry for the caps).
If it's a small library/API with a small or no community you can always contact the developer himself and ask for help 'cause he'll probably be more than happy to help you; he's happy that one more person is using his API.
Mailing lists are a great resource as long as you do your homework first before asking questions.
Mailing list archives are invaluable for most of the questions I've had on CoreAudio related stuff.
I would never read javadoc. As there often is none. And when there is, most likely it isnt up to date. So one gets confused at the best.
Start with the simplest possible tutorial you find within some minutes.
Often the tutorial will lead you to further sources at the end, so then most of the time one is on a path that goes on and on, deeper and deeper.
It really depends on what the topic is and how much info is on it. Learning by example is a good way to start a topic brand new to you, especially if you're knowledgeable in other similar libraries or languages. You can take a topic you're familiar with, and say "I understand how to implement using X, lets see how it's done using Y".
So what are your strategies, approaches, tricks to save your time?
Well, I search. I generally never ask questions, preferring to research myself. If worse comes to worse I'll read the documentation. In some cases (say, when I was doing some work with SharpSVN) I had to look at the source, specifically the test cases, to get some information about how the API worked.
Generally, I have to be honest, most of my 'study' and 'learning' is by accident.
For example, just a few seconds ago, I discovered how to get the "Recent" folder in C#. I had no idea how to do that before seeing the question, considering it interesting, and then searching.
So for me the real 'trick' is that I hang around on forums, answer questions, and accidentally pick up knowledge. Then when it comes time for me to research something; chances are I know a bit about it, and searching is easier and I can focus on the implementation [typically implementing a test program first] and progressing from there.

Resources