Search and replace strings in PDF with Perl, Ruby, or PHP - string

Is there a way to script replacing strings in PDF documents? I can use either Perl, Ruby or PHP. If possible, a regular expression would be a great.

As part of my open-source CAM::PDF Perl library, I include a tiny front-end program called changepagestring.pl which does what you ask.
However, it only replaces text that's contiguous in the PDF syntax. If you switch fonts, size, style, etc. mid-phrase then it won't match. If you do any advanced kerning then it won't match.
Those limitations aside, it's really easy to use and it's simple enough that you can easily fork it and hack it to your needs.

In Perl, you can parse the contents of your PDF using the PDF::API2 module. You should then be able to search and replace your target strings in the usual way (s///), and write the new document back to disk.

Related

creating tags for a script language for easy browsing in vim

I use ctags+Vim for a lot of my projects and I really like the ability to easily browse through large chunks of code quickly.
I am also using Stata, a statistical package, which has a script language. Even though you can have routines in the code, its code tends to be series of commands that perform data and statistics operations. And the code files can be very long. So I always find myself in need of a way to browse it efficiently.
Since I use Vim, I can use marks. But I was wondering if I could use ctags to do this. That is, I want to create a tag marker which (1) won't cause a problem when I run the script (2) easy to introduce to ctags.
Because it is supposed to not break the script, it needs to be a comment. In Stata, comment lines start with * and flow comments can be made by /* ..... */.
It would be great, for example, have sections in the code, marked by comments:
* Section: Data
And ctags picks up "Data Manipulation" as the tag. So I can see a list of sections and jump to them easily without the needs for creating marks.
Is there anyway to do this? I'd appreciate any comments.
You need a way to generate a tags database for your Stata files. The format is simple, see :help tags-file-format. The default tags program, Exuberant Ctags can be extended with regular expressions (--langmap, --regex); that probably only yields an approximate parsing for complex languages, but it should suffice for custom section marks; maybe you could even directly extract interesting language keywords.

Text mining MS Word documents?

I have about 30 .docx documents (Résumés) with data about peoples' names, skills and so forth. I need to populate a spreadsheet with some of this information, and to reduce manual work I thought I could use a text mining approach.
Are there any tools or approaches that would be useful in mining (sort of semi-structured) information from these documents?
The best I can come up with is using perl, as I know you can pull from word documents (though that in itself can be tricky) and populate xml spreadsheets using perl modules.
I haven't written perl in anger in a long time, so I can't offer examples of how to do this, but if I were to put something together to do this, I would recommend perl. I am sure someone will say there are equivalent functions in python, and maybe even in Ruby, but perl is what I've used, and I've found it very effective for manipulating/matching/parsing/processing text.
You can try using the catdoc http://www.wagner.pp.ru/~vitus/software/catdoc/ tool which will extract the text contents from an MS Word file, and then after that do whatever text processing you want. I'd probably just grep for the existence of certain words in the resume against the output of catdoc. No point in over-engineering a solution.
There are multiple ways to read word file in docx or doc ,
docx files are nohing but a fancy container . but doc file is little tricky to extract.
i will tell you some ways to extract text from word
.doc/docx >> open with open suit >> user pyUNO with python and get your data.
.doc/docx >> using python .docx module and Textract and extract data .
.doc/docx >> using R Programming which have many modules like officer and ReporteRS >> extract data .
using Text mining for conversion of text from one form to another.

Search tool that uses a grammar rather than regular expression?

Are there any search tools that allow you to set up a simple token/grammar parsing system that work similar to regular expressions?
What we want to do is search our ColdFusion code for queries that do not have cfqueryparams in them.
A regular expression gets a bit tough in this situation because I can't keep track of the start tags while looking for something else before getting an end tag.
It seems like a parsing system would work more accurately.
Seeing it is XML, I would just use XSLT.

Intelligent file search for windows that can ignore whitespace and search in code?

Does anybody know a Windows based searching tool that is easy to use and is programmer
friendly.
The functions I am looking for:
Ignore white space in search
= capable to find
myTestFunction ( $parameter, $another_parameter, $yet_another_parameter )
{ doThis();
using the query
myTestFunction($parameter,$another_parameter,$yet_another_parameter){doThis();
without Regexes.
Search code "semantically" (for me, it would have to be PHP):
Search in comments only
Search in function names only
Search for parameters that are named $xyz
Search in (insert code construct here) only
If there is none around, it's high time somebody developed it! :)
I have opened a bounty for this.
See our SD Search Engine. This is a language-sensitive search engine designed to search large code bases, with special language classifiers for C, C++, Java, C#, COBOL, JavaScript, Ada, Python, Ruby and lot of other languages, including your specific target langauge PHP (PHP4 and PHP5).
I think it does everything you requested.
It indexes the language elements so search across large code bases are extremely fast (Linux Kernal ~~ 7.5 Million lines --> 2.5 seconds). (The indexing step runs
on Windows, but the display engine is in Java.)
Search hits are shown in one-line context hit window showing the file and line number, as well as the line with the hit highlighted. Clicks on hits bring up the source code, tabs expanded appropriately, and the line count right even for languages which have odd line counting rules (such as GCC WRT form characters), with the hit line and hit text highlighted. Clicking in the source window will launch your favorite editor on the file.
Because it understands language elements, it ignores language-specific whitespace. It skips over comments unless you insist they be inspected. Searches thus ignore whitespace, comments and lineboundaries (if the language thinks lineboundaries are whitespace, which is why there are langauge-specific scanners). The query language allows you to specify which language tokens you want (specific tokens in quotes, or generic tokens such as identifiers I, numbers N, strings S, operators O and punctuation P) with constraints on the token value as well as a series of tokens.
Your example search:
myTestFunction($parameter,$another_parameter,$yet_another_parameter){doThis();
would be expressed to the search engine precisely as:
I=myTestFunction '(' I ',' I ',' I ')' '{' I=dothis '(' ')' ';'
but it would probably be easier (less typing) to find it as:
I=myTest* ... I=dothis
where I=myTest* means an identifier starting with myTest and ... means "near".
The Search Engine also offer regular expressions searches on the text, if you insist.
So you still have grep-like searches (a lot slower than indexed searches)
but with the hit window and source display windows too.
I use ack really successfully for this kind of thing, particularly when trying to find things in large codebases. I run it linux myself but I don't see any reason why it won't run on windows or in Cygwin at the very least. Check it out, I think you'll find it is exactly what you're looking for.
Search code "semantically" (for me, it would have to be PHP):
For this you could (and I think should) use some custom code using token_get_all()
See also the available tokens
Ignore white space in search
A simple regex should be sufficient. It depends on your regex-library, but most come with a whitespace modifier/flag.
For my Windows desktop search, I use Agent Ransack. I use this as a replacement for the windows search.
You can use regular expressions, but there is a nice entry screen if you want to avoid entering them directly.
Take a look at Google Desktop API, it has very powerful set of methods to do what you're looking for.
Of course it requires you to have the Google Desktop installed.
After reviewing it a little, it provides some functionality but not that specific as what you require.
I really like Crimson Editor and it allows RegEx searches. It has helped me a bunch over the past six years. I think it will fit your needs. Try it.
I use TextPad for searching code files in Windows. It has a very handy find-in-files function (Search / Find In Files) and you can use regex which should meet any search requirements. In the search results it will list the file location, line number and a snippet from that line.

Text indexer search tool which can filter by punctuation?

This is not a programming question per se but a question about searching source code files, which help me in programming.
I use a search tool, X1, which quickly tells me which source code files contain some keywords I am looking for. However it doesn't work well for keywords which have punctuation attached to them. For example, if I search for "show()", X1 shows everything that has "show" in it including the too many results from "MessageBox.Show(.....)" which I don't want to see.
Another example: I need to filter to show ".parent" (notice the dot) and not show everything that has "parent" (no dot) in it.
Anyone knows a text search tool which can filter by keywords that have punctuation? I really prefer a desktop app instead of web based tool like Google (I find it clunky).
I am looking for a tool which indexes words and not a general file searcher like Windows File Explorer.
If you want to search code files efficiently for keywords and punctuation,
consider the SD Source Code Search Engine. It indexes each source langauge according
to langage-specific rules, so it knows exactly the identifiers, keywords,
strings, comments, operators in that langauge and indexes it according to
those elements. It will handle a wide variety of languages: C, C++, Java, VB6, C#, COBOL,
all at once.
Your first query would be posed as:
I=show - I=MessageBox ... '('
(locate identifiers named "show" but eliminate those that are overlapped by
MessageBox leftparen).
You second query would be posed as simply
'.' I=parent
See http://www.semanticdesigns.com/Products/SearchEngine/index.html
It seem to be the job of tools like ctags and cscope.
Ctags is used to index declarations of source files (many languages supported) and Cscope for in-depth c file analysis.
These tools are more suited for a per project use in my opinion. Moreover, you may need to use another tool to use these index, I use vim myself for this purpose, but many text editors use ctags.
The tool from DTSearch.com.

Resources