List of Acronyms for Text Mining - text

I'm trying to do some text mining and my sample text has a lot of acronyms in it. I would like to try to flatten the acronyms into the phrases.
Does anyone know where I can get a list? http://www.acronymslist.com is a pretty awesome site and I was looking for something that just had an open list I could download.
Thanks,
mj

http://en.wikipedia.org/wiki/List_of_acronyms_and_initialisms
Wikipedia content is freely reusable, subject to the terms of the Creative Commons Attribution-ShareAlike License.

Related

Do I need to add updated phoneme sequence of words to .dict file while adapting AM using cmusphinx?

I am trying to adapt en-us acoustic model with indian english accent recordings. Since many words are pronounced in different accent, do I need to add the updated phoneme representation of words? Currently I am following this link: https://cmusphinx.github.io/wiki/tutorialadapt/#accumulating-observation-counts and here nothing is mentioned about updating your .dict file.
PS: Should I add new words directly in the dictionary?
There is Indian English model in downloads, you should use it instead. It comes with Indian English dictionary.

Search the sentence in large text sentence corpus

I am a beginner and I want to know if there's way to search a text sentence in a large text sequence of data (say 1 million) and search accordingly like when a user type:
I shouldn't be there
then it should search for sequence like this:
I should not be there
similar like this :
I gonna go there.
to
I going to go there.
I have been thinking for couple of days to figure out solution of this
problem.
If you know anything about how to deal with this problem then please provide a solution or just a hint would be more than enough. Thank you.
I would firstly go trough both the sentence and text and replace all contractions with the long form. Then after that use Knuth-Morris-Pratt.

text to phonemes converter

I'm searching for a tool that converts text to phonemes, (like text to speech software)
I can program one but it will not be without errors and takes a lot of time!
so my question is:
is there a simple tool for converting e.g.
"hello" to "HH AH0 L OW1"
maybe some command-line tool so i can capture the stdout?
i'm searching for the phonemes in 'Arpabet' style (see the 'hello' example).
espeak does something like that but the output is not in Arpabet style and the phonemes are
not split by some determiner.
If you had searched for Arpabet on wiki you would have found your answer. The CMU guys have prepared scripts which convert most english words to their respective Arpabet phonetic break up.
If you want the phone sequence of a couple of words you can use their interface here. But, if you want it for a big file then you might have to run their scripts on your own. They used to have a working page here, but it seems to be not working now.

Text mining MS Word documents?

I have about 30 .docx documents (Résumés) with data about peoples' names, skills and so forth. I need to populate a spreadsheet with some of this information, and to reduce manual work I thought I could use a text mining approach.
Are there any tools or approaches that would be useful in mining (sort of semi-structured) information from these documents?
The best I can come up with is using perl, as I know you can pull from word documents (though that in itself can be tricky) and populate xml spreadsheets using perl modules.
I haven't written perl in anger in a long time, so I can't offer examples of how to do this, but if I were to put something together to do this, I would recommend perl. I am sure someone will say there are equivalent functions in python, and maybe even in Ruby, but perl is what I've used, and I've found it very effective for manipulating/matching/parsing/processing text.
You can try using the catdoc http://www.wagner.pp.ru/~vitus/software/catdoc/ tool which will extract the text contents from an MS Word file, and then after that do whatever text processing you want. I'd probably just grep for the existence of certain words in the resume against the output of catdoc. No point in over-engineering a solution.
There are multiple ways to read word file in docx or doc ,
docx files are nohing but a fancy container . but doc file is little tricky to extract.
i will tell you some ways to extract text from word
.doc/docx >> open with open suit >> user pyUNO with python and get your data.
.doc/docx >> using python .docx module and Textract and extract data .
.doc/docx >> using R Programming which have many modules like officer and ReporteRS >> extract data .
using Text mining for conversion of text from one form to another.

Text indexer search tool which can filter by punctuation?

This is not a programming question per se but a question about searching source code files, which help me in programming.
I use a search tool, X1, which quickly tells me which source code files contain some keywords I am looking for. However it doesn't work well for keywords which have punctuation attached to them. For example, if I search for "show()", X1 shows everything that has "show" in it including the too many results from "MessageBox.Show(.....)" which I don't want to see.
Another example: I need to filter to show ".parent" (notice the dot) and not show everything that has "parent" (no dot) in it.
Anyone knows a text search tool which can filter by keywords that have punctuation? I really prefer a desktop app instead of web based tool like Google (I find it clunky).
I am looking for a tool which indexes words and not a general file searcher like Windows File Explorer.
If you want to search code files efficiently for keywords and punctuation,
consider the SD Source Code Search Engine. It indexes each source langauge according
to langage-specific rules, so it knows exactly the identifiers, keywords,
strings, comments, operators in that langauge and indexes it according to
those elements. It will handle a wide variety of languages: C, C++, Java, VB6, C#, COBOL,
all at once.
Your first query would be posed as:
I=show - I=MessageBox ... '('
(locate identifiers named "show" but eliminate those that are overlapped by
MessageBox leftparen).
You second query would be posed as simply
'.' I=parent
See http://www.semanticdesigns.com/Products/SearchEngine/index.html
It seem to be the job of tools like ctags and cscope.
Ctags is used to index declarations of source files (many languages supported) and Cscope for in-depth c file analysis.
These tools are more suited for a per project use in my opinion. Moreover, you may need to use another tool to use these index, I use vim myself for this purpose, but many text editors use ctags.
The tool from DTSearch.com.

Resources