I'm inheriting some code from a third-party system. The vendor will supply me with their documentation, but if their language is some variant of a more general language then it would be useful to know. A couple of snippets are below:
for var i = 0 to short_q.getSize()-1
{
sumq+=short_q[i];
sumpq+=short_q[i]*short_p[i];
}
Number trd_qty_short := trd_qty_short1 + trd_qty_short2;
You can try SourceClassifier:
Source classifier identifies programming language using a Bayesian
classifier trained on a corpus generated from the Computer Language
Benchmarks Game . It is written in Ruby and availabe as a gem. To
train the classifier to identify new languages download the sources
from github.
Out of the box SourceClassifier recognises Css, C, Java, Javascript,
Perl, Php, Python and Ruby.
Related
I'm currently working on a project design where I will create a program/model to translate my native dialect to English, I'm asking is there any books or anything that can you recommend to me in creating my project.
On the NLP side of things there's this course: Natural Language Processing with spaCy & Python - Course for Beginners and this older course: Natural Language Processing (NLP) Tutorial with Python & NLTK on Free Code Camp, which is generally a good place to start. Their courses provide in depth explanations of concepts and provide good examples.
On the translation side of things, the DeepL translator is easy to use in multiple languages and offers a free api. It also offers and incredibly easy to use python library if that's the language you intend to use (which you should because python is the best out there for NLP).
I hope this helps, but dennlinger is right - you shouldn't typically ask broad recommendation questions on StackOverflow!
I am new to Devnagaric NLP, Is there any group or resources that would help me get started with NLP in Devnagaric language(Mostly Nepali language or similar like Hindi). I want to be able to develop fonts for Devanagaric and also do some font processing application. If anyone (working in this field), could give me some advice then it would be highly appreciable.
Thanks in advance
I am new to Devnagaric NLP, Is there any group or resources that would help me get started with NLP in Devnagaric language(Mostly Nepali language or similar like Hindi)
You can use embeddings given by fasttext [https://fasttext.cc/docs/en/pretrained-vectors.html#content] and use some deep learning RNN models like LSTM for text-classification, sentiment analysis.
You can find some datasets for named entity recoginition here [http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5]
For Processing Indian languages, you can refer here [https://github.com/anoopkunchukuttan/indic_nlp_library]
Nltk supports the indian lanugages, for pos tagging and nlp related tasks you can refer here [http://www.nltk.org/_modules/nltk/corpus/reader/indian.html]
Is there any group or resources that would help me get started with NLP in Devnagaric language?
The Bhasa Sanchar project under Madan Puraskar Pustakalaya has developed a Nepali corpus. You may request a Nepali corpus for non-commerical purposes from the contact provided in the link above.
Python's NLTK has the Hindi Language corpus. You may import it using
from nltk.corpus import indian
For gaining insight to Devnagari based NLP, I suggest you go through research papers.Nepali being an under-resourced language;much work yet to be done, and it might be difficult to get contents for the same.
You should probably look into language detection,text classification,sentiment analysis among others (preferably based on POS tagging library from the corpus) for grasping the basics.
For the second part of the question
I am pretty sure font development doesn't come under the domain of Natural Language Processing. Did you mean something else?
Do you have any tips of well documented, developer friendly NLP libraries for text analysis (morphology, text concept) for Slovan languages like Czech, Polish etc?
The API could be in any language - java, python, c, node, whatever.
Nice lib for stemming as an example could be this one: https://github.com/dundalek/czech-stemmer
I am studying the best options for text analysis. I want to be able to get most out of a sentence in specific topic. Let's say that i will have medical sentence and thanks to my dictionary words in the databases I will be able to do analysis based on NLP algorithm.
Thanks!
Try polyglot. it supports both Polish and Czech.
I'm wondering if it is possible to use Stanford CoreNLP to detect which language a sentence is written in? If so, how precise can those algorithms be?
Almost certainly there is no language identification in Stanford COreNLP at this moment. 'almost' - because nonexistence is much harder to prove.
EDIT: Nevertheless, below are circumstantial evidences:
there is no mention of language identification neither on main
page, nor CoreNLP page, nor in FAQ (although there is
a question 'How do I run CoreNLP on other languages?'), nor in 2014
paper of CoreNLP's authors;
tools that combine several NLP libs
including Stanford CoreNLP use another lib for language
identification, for example DKPro Core ASL; also other
users talking about language identification and CoreNLP don't mention this capability
source file of CoreNLP contains Language
classes, but nothing related to language identification - you can
check manually for all 84 occurrence of 'language' word here
Try TIKA, or TextCat, or Language Detection Library for Java (they report "99% over precision for 53 languages").
In general, quality depends on the size of input text: if it is long enough (say, at least several words and not specially chosen), then precision can be pretty good - about 95%.
Standford CoreNLP doesn't have language ID (at least not yet), see http://nlp.stanford.edu/software/corenlp.shtml
There are loads more on language detection/identification tools. But do take the reported precision with a pinch of salt. It is usually evaluated narrowly, bounded by:
a fix list of languages,
a substantial length of the test sentences and
of the same language and
a skewed proportion of training to testing instances.
Notable language ID tools includes:
TextCat (http://cran.r-project.org/web/packages/textcat/index.html)
CLD2 (https://code.google.com/p/cld2/)
LingPipe (http://alias-i.com/lingpipe/demos/tutorial/langid/read-me.html)
LangID (https://github.com/saffsd/langid.py)
CLD3 (https://github.com/google/cld3)
An exhaustive list from meta-guide.com, see http://meta-guide.com/software-meta-guide/100-best-github-language-identification/
Noteworthy Language Identification related shared task (with training/testing data) includes:
Native Language ID (NLI 2013)
Discriminating Similar Languages (DSL 2014)
TweetID (2015)
Also take a look at:
Language Identification: The Long and the Short of the Matter
The Problems of Language Identification within Hugely Multilingual Data Sets
Selecting and Weighting N-Grams to Identify 1100 Languages
Indigenous Tweets
Microblog Language Identification: Overcoming the Limitations of Short, Unedited and Idiomatic Text
Named Entity Extraction (extract ppl, cities, organizations)
Content Tagging (extract topic tags by scanning doc)
Structured Data Extraction
Topic Categorization (taxonomy classification by scanning doc....bayesian )
Text extraction (HTML page cleaning)
are there libraries that i can use to do any of the above functions of NLP ?
dont really feel like forking out cash to AlchemyAPI
There are actually plenty of freely available open-source natural language processing packages out there. Here's a brief list, organized by what language the toolkit is implemented in:
Python: Natural Language Toolkit NLTK
Java: OpenNLP, Gate, and Stanford's JavaNLP
.NET: Sharp NLP
If you're uncertain which one to go with, I would recommend starting with NLTK. The package is reasonably easy to use and has great documentation online, including a free book.
You should be able to use NLTK to easily accomplish the NLP tasks you've listed, e.g. named entity recognition (NER), extracting tags for documents, and document categorization.
What the Alchemy people call structured data extraction looks like it's just HTML scrapping that is robust against changes to the underlying HTML as long as the page still visually renders the same way. So, it's not really a NLP task.
For the extraction of text from HTML, just use boilerpipe. It's fast, good, and free.
The Apache UIMA project was originally created by IBM and provides an NLP framework much like GATE. There are various annotators out there that are built for UIMA.