Splitting chinese document into sentences [closed] - nlp

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I have to split Chinese text into multiple sentences. I tried the Stanford DocumentPreProcessor. It worked quite well for English but not for Chinese.
Please can you let me know any good sentence splitters for Chinese preferably in Java or Python.

Using some regex tricks in Python (c.f. a modified regex of Section 2.3 of http://aclweb.org/anthology/Y/Y11/Y11-1038.pdf):
import re
paragraph = u'\u70ed\u5e26\u98ce\u66b4\u5c1a\u5854\u5c14\u662f2001\u5e74\u5927\u897f\u6d0b\u98d3\u98ce\u5b63\u7684\u4e00\u573a\u57288\u6708\u7a7f\u8d8a\u4e86\u52a0\u52d2\u6bd4\u6d77\u7684\u5317\u5927\u897f\u6d0b\u70ed\u5e26\u6c14\u65cb\u3002\u5c1a\u5854\u5c14\u4e8e8\u670814\u65e5\u7531\u70ed\u5e26\u5927\u897f\u6d0b\u7684\u4e00\u80a1\u4e1c\u98ce\u6ce2\u53d1\u5c55\u800c\u6210\uff0c\u5176\u5b58\u5728\u7684\u5927\u90e8\u5206\u65f6\u95f4\u91cc\u90fd\u5728\u5feb\u901f\u5411\u897f\u79fb\u52a8\uff0c\u9000\u5316\u6210\u4e1c\u98ce\u6ce2\u540e\u7a7f\u8d8a\u4e86\u5411\u98ce\u7fa4\u5c9b\u3002'
def zng(paragraph):
for sent in re.findall(u'[^!?。\.\!\?]+[!?。\.\!\?]?', paragraph, flags=re.U):
yield sent
list(zng(paragraph))
Regex explanation: https://regex101.com/r/eNFdqM/2

Either of these open sources projects should be useful afaik:
HanLP https://github.com/hankcs/HanLP
FudanNLP https://github.com/FudanNLP/fnlp

For unsegmented text, using the Stanford libraries, you probably want to use their Chinese CoreNLP. This isn't as well documented as the base corenlp, but it will work for your task.
http://nlp.stanford.edu/software/corenlp-faq.shtml#languages
http://nlp.stanford.edu/software/corenlp.shtml
You will want the segmenter and the sentence splitter. "segment, ssplit" The others are not relevant.
Alternatively, you can use the WordToSentenceSplitter class in edu.stanford.nlp.process.WordToSentenceSplitter directly. If you do that, you can look at how it is used in WordsToSentencesAnnotator.

Related

Is there a downloadable corpus (dictionary/ lexicon) for informal, playful words such as 'gonna', 'LOL', 'wanna' in English? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Please suggest me a downloadable English corpus that contains informal, playful words such as 'gonna', 'LOL' and 'wanna'
I don't know such lexicon but you can try to do this, alternatively:
Get the vocabulary V1 of Twitter or other web and chat corpus.
Get the vocabulary V2 of literary corpus.
The lexicon you want might be V1 \ V2 i.e. all the words of V1 which are not in V2.
Using Python, NLTK provides corpora (see nltk.corpus.webtext). Moreover, as #mbatchkarov said in the comments: Twitter is full of informal language.
Use 'NetLingo'. They have a rich content :)

NLP tools for right-to-left languages? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I'm trying to use NLP within a web application. What I wanna do is a little information extraction on Persian sentences. So I need some RTL-friendly NLP tools. I've tried python's nltk before but I don't know if it does support RTL languages as well. It's very good if it does because I have a good relationship with Django as well. Any information on this topic is appreciated.
I have never tried using it for RTL, but I think it is perfectly capable of serving your needs, as it is a toolkit, not a system per se.
I could not find any restrictions regarding this. In fact, I have found some other references on people using it for Arabic:
Tokenization of Arabic words using NLTK
Python Arabic NLP
Now, you do need to find some Persian corpora. I could not find any during my brief research, but you can always hit the NLTK Users Mailing List.

Speech/ Music classification [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I want to determine which part of audio file contain speech or music.
I hope someone has a made something like this or can tell me where to start.
Can you please suggest some method/tutorial for doing the same.
Thank you.
Check out the pyAudioAnalysis python library. Among others, it has a pre-trained speech-music classifier and two segmentation-classification methods (one based on fix-sized windows and another based on HMMs).
You can extract speech and music parts of an audio recording quite easily, e.g.:
from pyAudioAnalysis import audioSegmentation as aS
[flagsInd, classesAll, acc] = aS.mtFileClassification("data/scottish.wav", "data/svmSM", "svm", True, 'data/scottish.segments')
with a result as the one in this image
There's lots of prior art in this area, but I'd suggest browsing through some of Dan Ellis's papers. The slides for this talk has some good background. In short it's all down to picking the right feature vectors.

News Article Data Sets [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am doing a project in news classification. Basically the system will classifying news articles based on the pre-defined topic (e.g. sports, politic, international). To build the system, I need free data sets for training the system.
So far, after few hours googling and links from here the only suitable data sets I could find is this. While this will hopefully enough, I think I will try to find more.
Note that the data sets I want:
Contains full news articles, not just title
Is in English
In .txt format,not XML or db
Can anybody help me?
Have you tried to use Reuters21578? It is the most common dataset for text classification. It is formated in SGML, but it is quite simple to parse and transform to a txt format.
You can build it, you can write a Python/Perl/PHP script where you run a search, then when you find the answers you can isolate the attributes with regex... I think is the best option. Is not easy but should be fun, finally you can share this dataset with us.

Fast automated spellchecking [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I have a really big (~50MB) file of Spanish sentences. I want to check which of these don't contain foreign words. To achieve that, I am planning to filter out sentences that contain words that don't exist in the spellchecker dictionary. Does such a tool exist? Is it worth to play around with search trees and hash tables to create an efficient spellchecker myself?
You ca try the spell checker in Whoosh, via a short python script as described here:
http://pythonhosted.org/Whoosh/spelling.html
or use Pyenchant:
http://pythonhosted.org/pyenchant/tutorial.html
You could use Hunspell, the spell checker of OpenOffice, Mozilla Firefox and Google Chrome. It is an open source C++ library with bindings for Java, Perl, Python, .NET and Ruby.

Resources