Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I'm trying to use NLP within a web application. What I wanna do is a little information extraction on Persian sentences. So I need some RTL-friendly NLP tools. I've tried python's nltk before but I don't know if it does support RTL languages as well. It's very good if it does because I have a good relationship with Django as well. Any information on this topic is appreciated.
I have never tried using it for RTL, but I think it is perfectly capable of serving your needs, as it is a toolkit, not a system per se.
I could not find any restrictions regarding this. In fact, I have found some other references on people using it for Arabic:
Tokenization of Arabic words using NLTK
Python Arabic NLP
Now, you do need to find some Persian corpora. I could not find any during my brief research, but you can always hit the NLTK Users Mailing List.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Please suggest me a downloadable English corpus that contains informal, playful words such as 'gonna', 'LOL' and 'wanna'
I don't know such lexicon but you can try to do this, alternatively:
Get the vocabulary V1 of Twitter or other web and chat corpus.
Get the vocabulary V2 of literary corpus.
The lexicon you want might be V1 \ V2 i.e. all the words of V1 which are not in V2.
Using Python, NLTK provides corpora (see nltk.corpus.webtext). Moreover, as #mbatchkarov said in the comments: Twitter is full of informal language.
Use 'NetLingo'. They have a rich content :)
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I have to split Chinese text into multiple sentences. I tried the Stanford DocumentPreProcessor. It worked quite well for English but not for Chinese.
Please can you let me know any good sentence splitters for Chinese preferably in Java or Python.
Using some regex tricks in Python (c.f. a modified regex of Section 2.3 of http://aclweb.org/anthology/Y/Y11/Y11-1038.pdf):
import re
paragraph = u'\u70ed\u5e26\u98ce\u66b4\u5c1a\u5854\u5c14\u662f2001\u5e74\u5927\u897f\u6d0b\u98d3\u98ce\u5b63\u7684\u4e00\u573a\u57288\u6708\u7a7f\u8d8a\u4e86\u52a0\u52d2\u6bd4\u6d77\u7684\u5317\u5927\u897f\u6d0b\u70ed\u5e26\u6c14\u65cb\u3002\u5c1a\u5854\u5c14\u4e8e8\u670814\u65e5\u7531\u70ed\u5e26\u5927\u897f\u6d0b\u7684\u4e00\u80a1\u4e1c\u98ce\u6ce2\u53d1\u5c55\u800c\u6210\uff0c\u5176\u5b58\u5728\u7684\u5927\u90e8\u5206\u65f6\u95f4\u91cc\u90fd\u5728\u5feb\u901f\u5411\u897f\u79fb\u52a8\uff0c\u9000\u5316\u6210\u4e1c\u98ce\u6ce2\u540e\u7a7f\u8d8a\u4e86\u5411\u98ce\u7fa4\u5c9b\u3002'
def zng(paragraph):
for sent in re.findall(u'[^!?。\.\!\?]+[!?。\.\!\?]?', paragraph, flags=re.U):
yield sent
list(zng(paragraph))
Regex explanation: https://regex101.com/r/eNFdqM/2
Either of these open sources projects should be useful afaik:
HanLP https://github.com/hankcs/HanLP
FudanNLP https://github.com/FudanNLP/fnlp
For unsegmented text, using the Stanford libraries, you probably want to use their Chinese CoreNLP. This isn't as well documented as the base corenlp, but it will work for your task.
http://nlp.stanford.edu/software/corenlp-faq.shtml#languages
http://nlp.stanford.edu/software/corenlp.shtml
You will want the segmenter and the sentence splitter. "segment, ssplit" The others are not relevant.
Alternatively, you can use the WordToSentenceSplitter class in edu.stanford.nlp.process.WordToSentenceSplitter directly. If you do that, you can look at how it is used in WordsToSentencesAnnotator.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am doing a project in news classification. Basically the system will classifying news articles based on the pre-defined topic (e.g. sports, politic, international). To build the system, I need free data sets for training the system.
So far, after few hours googling and links from here the only suitable data sets I could find is this. While this will hopefully enough, I think I will try to find more.
Note that the data sets I want:
Contains full news articles, not just title
Is in English
In .txt format,not XML or db
Can anybody help me?
Have you tried to use Reuters21578? It is the most common dataset for text classification. It is formated in SGML, but it is quite simple to parse and transform to a txt format.
You can build it, you can write a Python/Perl/PHP script where you run a search, then when you find the answers you can isolate the attributes with regex... I think is the best option. Is not easy but should be fun, finally you can share this dataset with us.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I have a really big (~50MB) file of Spanish sentences. I want to check which of these don't contain foreign words. To achieve that, I am planning to filter out sentences that contain words that don't exist in the spellchecker dictionary. Does such a tool exist? Is it worth to play around with search trees and hash tables to create an efficient spellchecker myself?
You ca try the spell checker in Whoosh, via a short python script as described here:
http://pythonhosted.org/Whoosh/spelling.html
or use Pyenchant:
http://pythonhosted.org/pyenchant/tutorial.html
You could use Hunspell, the spell checker of OpenOffice, Mozilla Firefox and Google Chrome. It is an open source C++ library with bindings for Java, Perl, Python, .NET and Ruby.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I've noticed that the wiki transcriptions for some of the recent Stack Overflow Podcasts are kind of weak. Clearly, this task calls for a computer program. Is transcribing audio to text (ideally with speaker labels so we know who said what) something that could feasibly be accomplished in software? Are there any active open-source software projects attempting to implement such functionality?
Believe me, I have searched for this before. There are slim to none text to speech that are open source or free to use. From my search there weren't any free speech to text synthesizers. These things are so hard to code and expensive that they can't really be made with an open source approach. If you really need this you would have to purchase it from a company. (although I don't know any off the top of my head).
I've looked into this a little. I tried the Microsoft Speech API but got very poor results. I've been wanting to look into the CMU Sphinx project, especially the Transcriber demo.