Unable to search pdf-files' contents in terminal - search

I have pdf -files which contents I have not managed to search by any terminal program.
I can only search them by Acrobat Reader and Skim.
How can you search contents of pdf -files in terminal?
It seems that a better question is
How is the search done in the pdf viewers such as Acrobat Reader and Skim?
Perhaps, I need to make such a search tool if no such tools exist.

Try installing xpdf from MacPorts; it is supposed to come with a tool called pdftotext which should then allow you to search using grep.

pdftotext is indeed an excellent tool, but it produces very long lines; in order to grep you will want to break them up, e.g.,
pdftotext drscheme.pdf - | fmt | grep -i spidey

PDF files are usually compressed. PDF viewers such as Acrobat Reader and Skim search the contents by decompressing the PDF text into memory, and then searching that text. If you want to search from the command line, one possible suggestion is to use pdftk to decompress the PDF, and then use grep (or your favorite command line text searching utility) to find the desired text. For example:
# Search for the text "text_to_search_for", and print out 3 lines of context
# above and below each match
pdftk mydoc.pdf output - uncompress | grep -C3 text_to_search_for

Related

How can i download wiki part in one txt file

I need a huge natural text file for machine learning and Wikipedia dump is great for this purpose. So how can i download several gb of text in some language (non-eng) without xml tags (just content)?
You could grab a dump of all content of a Wikipedia of your choice from dumps.wikimedia.org. You will likely want one of the *wiki-20160501-pages-articles.xml files. Then, you could strip all XML tags from the dump using a tool like xmlstarlet:
xml sel -t -c "//text()" fywiki-20160501-pages-articles.xml > articles.txt
However, the text in a Wikipedia dump will be wiki markup, not natural text. You strip everything that's not alphanumeric with something like sed:
cat dump.txt | sed 's/\W/ /g'
This doesn't give you a clean corpus (for example, wikimarkup keywords and html entities will still be in your dump file) but it may be close enough for your purpose.
Phase a:
Go to dumps.wikimedia.org. Find a dump that fits your request. For machine learning - best way is to download "All pages, current versions only" dump of your language. Download and unzip it.
Phase b:
As the dump is xml file with a wiki-markup syntax of content - it has to be converted to plain text. The best solution i've found is to use this toolkit - https://github.com/yohasebe/wp2txt . It needs no much memory and works well.
Phase c:
wp2txt produces a hundreds of 10mb txt file, so we need to concatenate them. Use
cat * > all.txt
at nix systems or
cat dump.txt | sed 's/\W/ /g'
for windows one.
p.s. Also i've found better semi-legal solution for ML case. The solution is to download some huge txt-literature library. Have a nice learning!
for Python try this after downloading .xml dump
pip install wiki-dump-reader
https://pypi.org/project/wiki-dump-reader/

how to print a text/plain document in CUPS printer without using raw option

I am using a CUPs command to print the pages of documents,But it is printing all the pages ignoiring the pages option. After some investigation I came to know raw option is overwriting the pages option , Please tell me how to print the pages without using raw option ,If I am not using this option , text file not supporting error is coming ,Here is my code :
system("lpr -P AFSCMSRPRNT3 -o pages=1,2,6 -o raw -T test_womargin abc.txt"
Plain text files don't really specify how things should be printed, and thus aren't allowed.
Try to convert the text to any usable format first. There's a popular tool a2ps which should be available for every linux distribution in the world. Try that!
EDIT you seem to be confused by the word "convert":
What I meant is that instead of printing the text file, you print a postscript file generated form that; something that you can get by doing something like
a2ps -o temporaryoutput.ps input.txt
and then
lpr -P AFSCMSRPRNT3 -o pages=1,2,6 -T test_womargin temporaryoutput.ps

xsel -o equivalent for OS X

Is there an equivalent solution to grab selected text in OS X as 'xsel -o' works for Linux?
Just need the current selection so I can use the text in shell script.
Cheers,
Erik
You can probably install xsel on MacOS. (UPDATE: According to Arkku's comment, that will only work if you have the X11 server running and synchronized to the OS X pasteboard.)
If not, a quick Google search turns up pbcopy / pbpaste, which apparently is pre-installed.
Link: https://github.com/raymontag/keepassc/issues/59
The Linux tool xsel is not required as pbcopy and pbpaste are Apple command line utilities that provide this functionality and are installed by default on macOS.
From the manual page (man pbcopy):
pbcopy, pbpaste - provide copying and pasting to the pasteboard (the
Clipboard) from command line
pbcopy takes the standard input and places it in the specified pasteboard. If no pasteboard is specified, the general
pasteboard will be
used by default. The input is placed in the pasteboard as plain text data unless it begins with the Encapsulated PostScript
(EPS) file
header or the Rich Text Format (RTF) file header, in which case it is placed in the pasteboard as one of those data types.
pbpaste removes the data from the pasteboard and writes it to the standard output. It normally looks first for plain text data
in the
pasteboard and writes that to the standard output; if no plain text data is in the pasteboard it looks for Encapsulated PostScript;
if no
EPS is present it looks for Rich Text. If none of those types is present in the pasteboard, pbpaste produces no output.
To copy filename.txt to the clipboard, use the following:
pbcopy < filename.txt

How can doc/docx files be converted to markdown or structured text?

Is there a program or workflow to convert .doc or .docx files to Markdown or similar text?
PS: Ideally, I would welcome the option that a specific font (e.g. consolas) in the MS Word document will be rendered to text-code: ```....```.
Pandoc supports conversion from docx to markdown directly:
pandoc -f docx -t markdown foo.docx -o foo.markdown
Several markdown formats are supported:
-t gfm (GitHub-Flavored Markdown)
-t markdown_mmd (MultiMarkdown)
-t markdown (pandoc’s extended Markdown)
-t markdown_strict (original unextended Markdown)
-t markdown_phpextra (PHP Markdown Extra)
-t commonmark (CommonMark Markdown)
docx -> markdown
Specifically regarding the question (docx --> markdown), use the Writeage plugin for Microsoft Word. It also works the other way round markdown --> docx.
More Options
Use a Conversion Tool for multi-file conversion.
Use a WYSIWYG Editor for single files and superior fonts.
Which Conversion Tools?
I've tested these three: (1) Pandoc (2) Mammoth (3) w2m
Pandoc
By far the superior tool for conversions with support for a multitude of file types (see Pandoc's man page for supported file types):
pandoc -f docx -t gfm somedoc.docx -o somedoc.md
NB
To get pandoc to export markdown tables ('pipe_tables' in pandoc) use multimarkdown or gfm output formats.
If formatting to PDF, pandoc uses LaTeX templates for this so you may need to install the LaTeX package for your OS if that command does not work out of the box. Instructions at LaTeX Installation
Which WYSIWYG Editors?
For docx, use Writeage.
Maintaining Superior Fonts
If you wish to preserve unicode characters, emojis and maintain superior fonts, you'll get some milage from the editors below when using copy-and-paste operations between file formats. Note, these do not read or write natively to docx.
Typora
iaWriter
Markdown Viewer for Chrome.
Programatic Equivalent
For a programatic equivalent, you might get some results by calling a different pdf-engine and their respective options but I haven't tested this. The pandoc defaults to 'pdflatex'.
pandoc --pdf-engine=
pandoc --pdf-engine-opt=STRING
Update: A4 vs US Letter
For outside the US, set the geometry variable:
pandoc -s -V geometry:a4paper -o outfile.pdf infile.md
Footnote
Its worth mentioning here - what's not obvious when discovering Markdown is that MultiMarkdown is by far the most feature rich markdown format.
MultiMarkdown supports amongst other things - metadata, table of contents, footnotes, maths, tables and YAML.
But Github's default format uses gfm which also supports tables. I use gfm for Github/GitLab and MultiMarkdown for everything else.
Given that you asked this question on stackoverflow you're probably wanting a programmatic or command line solution for which I've included another answer.
However, an alternative solution might be to use the Writage Markdown plugin for Microsoft Word.
Writage turns Word into your Markdown WYSIWYG editor, so you will be able to open a Markdown file and edit it like you normally edit any document in Microsoft Word. Also it will be possible to save your Word document as a Markdown file without any other converters.
Under the covers, Writage uses Pandoc that you'll also need to install for this plugin to work.
It currently supports the following Markdown elements:
Headings
Lists (numbered and bulleted)
Links
Font styles such as bold, italic
Tables
Footnotes
This might be the ideal solution for many end users as they won't need to install or run any command line tools - but rather just stick with what they are most familiar.
Mammoth is best known as a Word to HTML converter but it now supports a Markdown writer module. When I last checked, Mammoth Markdown support was still in its early stages, so you may find some features are unsupported. As usual ... check the website for the latest details.
Install
To use the Javascript version ... install NodeJS and then install Mammoth:
npm install -g mammoth
Command line
Command line to convert a Word document to Markdown ...
mammoth document.docx --output-format=markdown
API
NodeJS API to convert to Markdown ...
var mammoth = require("mammoth");
mammoth.convertToMarkdown({path: "path/to/document.docx"});
Features:
Mammoth Markdown writer currently supports:
Lists (numbered and bulleted)
Links
Font styles such as bold, italic
Images
The Mammoth command line tools and API have been ported to several languages:
With NO Markdown (May 2016):
.NET
Java/JVM
Wordpress
With Markdown:
Javascript
Python
You can use Word to Markdown (Ruby Gem) to convert it in one step. Conversion can be as simple as:
$ gem install word-to-markdown
$ w2m path/to/document.docx
It routes the document through LibreOffice, but also does it best to semantice headings based on their relative font size.
There's also a hosted version which would be as simple as drag-and-drop to convert.
Word to Markdown might be worth a shot, or the procedure described here using Calibre and Pandoc via HTMLZ, here's a bash script they use:
#!/bin/bash
mkdir temp
cp $1 temp
cd temp
ebook-convert $1 output.htmlz
unzip output.htmlz
cd ..
pandoc -f html -t markdown -o output.md temp/index.html
rm -R temp
From here:
unoconv -f html test.docx
pandoc -f html -t markdown -o test.md test.html
You can convert Word documents from within MS Word to Markdown using this Visual Basic Script:
https://gist.github.com/hawkrives/2305254
Follow the instructions under "To use the code" to create a new Macro in Word.
Note: This converts the currently open Word document ato Markdown, which removes all the Word formatting (headings, lists, etc.). First save the Word document you plan to converts, and then save the document again as a new document before running the macro. This way you can always go back to the original Word document to make changes.
There are more examples of Word to markdown VB scripts here:
https://www.mediawiki.org/wiki/Microsoft_Word_Macros
Here's an open-source web application built in Ruby to do this exact thing:
https://word2md.com
If you're using Linux, try Pandoc (first convert .doc/.docx into html with LibreOffice or something and then run it).
On Windows (or if Pandoc doesn't work), you can try this website (online demo, you can download it): Markdownify
For bulleted lists you can paste a list into Sublime Text and use multiselect ( tested ) or find and replace ( not tested ) to replace eg the proprietary MS Word characters with -, -- etc
This doesn't work with headings but it may be possible to use a similar technique with other elements.
For .doc Word files:
antiword -f some_file.doc
antiword's homepage: http://www.winfield.demon.nl/

splitting text files based column wise

So I have an invoice that I need to make a report out of. It is on average to be about 250 pages long. So I'm trying to create a script that would extract the specific value of the invoice and make a report. Here's my problem:
the invoice is in pdf format with it spanning two column. In Linux command, I want to use 'pdftotext' Linux command to convert into multiple text files (with each txt file representing each pdf page). How do I do that
I recognize that 'pdftotext' command splits it left part of the page and right part of the page by having 21 spaces in between. How do I the right side of the data(identified after reading at least 21 spaces in a row) to the end of the file
Since the file is large and I only last few page of the files, how do I delete all those text files in a script (not manually) until I read a keyword (let's just say the keyword = Start Invoice)?
I know this is a lot of questions, but I'm confused in what Linux command can do. Can you guys guide me to the right direction? Thanks
PS: I'm using CentOS 5.2
What about:
pdftotext YOUR.pdf | sed 's/^\([^ ]\+\) \{21\}.*/\1/' > OUTPUT
pdftotext YOUR.pdf | sed 's/.* \{21\}\(.*\)/\1/' >> OUTPUT
But you should check out pdftotext's -raw and -layout options too. And there are more ways to do it...

Resources