Text mining MS Word documents? - text

I have about 30 .docx documents (Résumés) with data about peoples' names, skills and so forth. I need to populate a spreadsheet with some of this information, and to reduce manual work I thought I could use a text mining approach.
Are there any tools or approaches that would be useful in mining (sort of semi-structured) information from these documents?

The best I can come up with is using perl, as I know you can pull from word documents (though that in itself can be tricky) and populate xml spreadsheets using perl modules.
I haven't written perl in anger in a long time, so I can't offer examples of how to do this, but if I were to put something together to do this, I would recommend perl. I am sure someone will say there are equivalent functions in python, and maybe even in Ruby, but perl is what I've used, and I've found it very effective for manipulating/matching/parsing/processing text.

You can try using the catdoc http://www.wagner.pp.ru/~vitus/software/catdoc/ tool which will extract the text contents from an MS Word file, and then after that do whatever text processing you want. I'd probably just grep for the existence of certain words in the resume against the output of catdoc. No point in over-engineering a solution.

There are multiple ways to read word file in docx or doc ,
docx files are nohing but a fancy container . but doc file is little tricky to extract.
i will tell you some ways to extract text from word
.doc/docx >> open with open suit >> user pyUNO with python and get your data.
.doc/docx >> using python .docx module and Textract and extract data .
.doc/docx >> using R Programming which have many modules like officer and ReporteRS >> extract data .
using Text mining for conversion of text from one form to another.

Related

CSV to Excel Doc?

I was curious what the community thinks is the easiest way to take a CSV file and 'save as' a Excel document with only a couple formulas pasted in?
I am trying to do this behind the scenes, and not physically navigating. e.g. opening, selecting save As, etc -- even though this is already VERY simple I **need to do this in code (Think automation)
Background: I have a c++ command line program generating the .csv, and a C# GUI starting this process. Either programs could hold the code, but I figure this is easiest in C# (InterOp?) The reasons I don't directly send code into the csv is because of the amount of comma characters that will mess up the csv and because other Excel documents need to reference the sheets so they need to be in .xls format.
=AVERAGE(C2:C999)
=COUNTIFS(C:C,">0",C:C,"<31")
=COUNTIFS(C:C,">31",C:C,"<55")
=COUNTIF(C:C,">55")
Have a look and see whether command-line scipting of openoffice will do the job. It can do quite a lot of conversions very easily. Otherwise there are a lot of Excel-producing libraries, for example PHPExcel, but you'd need to wrap some programming around them.

Serialized Printing Method

I am looking for a method by which I can print one document, and have a field that is incremented on each copy printed. I currently run linux, so bash in concert with several programs might be the way to go, but I'm just not sure where to start.
I have a document that is used for our business that currently is hand stamped for serialization... We would like to simply print them but cant find a method by which to increment a specific field. I would like to use either a PDF or an ODF/ODT for the document.
Thanks for any help you can give!
How is the document produced at the first place?
If you master that process, you could certainly add serialization at that level. For instance if using LibreOffice you could do that in LibreOffice. If using a text formatter (like LaTeX, Lout, ....) just emit the formatting instructions (e.g. the .tex or .lout source file) with some unique counting (perhaps simpler to do in some scripting language like Python or Ocaml).
Then run the relevant tool to get a .pdf file.

Using different program office extension

I have a program that can access a database with a whole bunch of articles.
Due to copyright, I can't access the database straight from my program, but I have a different program that can access it, and it's legitimate to copy small bits from the articles.
Because my friends and I quote a lot from these articles, I thought it would be useful if we could find an add-in for Word that will copy the requested part from an article.
Is there any add-in for Word that would let me use the program that I mentioned above so that I can access the database from within Word?
I would like to program this add-in myself, if possible.
Without further information about which operating system, and version of Word you are using, I can offer only a general outline.
1) It seems to me that you want to make a Word macro using Word Basic, or Visual Basic.
2) When you want to call your program which is external to Word, you need to use the shell command as outlined here from Microsoft's webpage.
I hope that helps you get started writing your macro!
CHEERS
Well its a wrokaround but you can use an automation tool which can run a sequence of actions on a given GUI like Winrunner or TestQuest to semulate the usage of the program, i assume these tools can get an input from a given xml or text file and log outputs in log text file.
If you have the output in a text file you will be able to parse the file using any programmign language and get the information you need and write it to eord or whatever format using OLE objects.

Efficient Way of Recording Page Numbers from a Search of a PDF

I have a list of ~1200 queries (part numbers) that are specified somewhere inside of a 100 page PDF. Pretty much what I need to do is take record of what pages each of the queries appear on, in the PDF. I can't think of a clever way of doing this. It should take me 5-20 hours to do this search by search, so if someone can give me a good idea before the 5 hour mark that would be great!
Assumed you can determine what a "query" is in your context programatically from the plain text (for example, by using regular expressions):
You could split your PDF into different files (1 file per page) using pdftk
http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/
Then convert those files to text with a pdf-to-text utility like this one:
http://www.fileguru.com/PDF-To-TXT-Converter/download
or this one
http://www.pdf2text.com/
And finally write yourself a simple script using your favorite programming language to determine which of those files contains a "query" (whatever that looks like).

Text indexer search tool which can filter by punctuation?

This is not a programming question per se but a question about searching source code files, which help me in programming.
I use a search tool, X1, which quickly tells me which source code files contain some keywords I am looking for. However it doesn't work well for keywords which have punctuation attached to them. For example, if I search for "show()", X1 shows everything that has "show" in it including the too many results from "MessageBox.Show(.....)" which I don't want to see.
Another example: I need to filter to show ".parent" (notice the dot) and not show everything that has "parent" (no dot) in it.
Anyone knows a text search tool which can filter by keywords that have punctuation? I really prefer a desktop app instead of web based tool like Google (I find it clunky).
I am looking for a tool which indexes words and not a general file searcher like Windows File Explorer.
If you want to search code files efficiently for keywords and punctuation,
consider the SD Source Code Search Engine. It indexes each source langauge according
to langage-specific rules, so it knows exactly the identifiers, keywords,
strings, comments, operators in that langauge and indexes it according to
those elements. It will handle a wide variety of languages: C, C++, Java, VB6, C#, COBOL,
all at once.
Your first query would be posed as:
I=show - I=MessageBox ... '('
(locate identifiers named "show" but eliminate those that are overlapped by
MessageBox leftparen).
You second query would be posed as simply
'.' I=parent
See http://www.semanticdesigns.com/Products/SearchEngine/index.html
It seem to be the job of tools like ctags and cscope.
Ctags is used to index declarations of source files (many languages supported) and Cscope for in-depth c file analysis.
These tools are more suited for a per project use in my opinion. Moreover, you may need to use another tool to use these index, I use vim myself for this purpose, but many text editors use ctags.
The tool from DTSearch.com.

Resources