In asciidoc how do I execute a macro only for pdf generation? - asciidoctor

In the asciidoc there is an image that I'd like to include only for PDF output. Is there a kind of attribute I can pass to image:: so that the image is processed for PDF generation, and ignored when generating epubs etc.? Or more probably by using ifdef, but how exactly?

You can use the ifdef directive like this:
Some text for all outputs.
ifdef::backend-pdf[]
This is only displayed in the PDF document, you can use image:
image::mypict.png[]
endif::[]

If you need a simple and small solution, I would prefer Jminis answer.
For cases that need more differentiation, you can define your own filter parameters, for different types of output, that can be defined as parameters in the commandline call. You can filter your content as follows e.g. if you want to differ between different audiences as well.
Some text for all outputs.
ifeval::["{myfilter1}"=="pdf"]
This content is pdf-only!
ifeval::["{myfilter2}"=="adminpdf"]
This content is admin-pdf-only!
endif::[]
endif::[]
You can add the parameters in the commandline call as follows:
--attribute myfilter1='pdf'
The exact expression of the command depends on your system. The following syntax could work (I cannot test it at the moment in the absence of a asciidoc-setup).
OS X asciidoc -> docbook:
SCRIPT=${...path to asciidoc installation...}/asciidoc.py
INPUT=myAsciidocInput.ad
OUTPUT=MyDocBookOutput.xml
MYFILTER="--attribute myfilter1='pdf' --attribute myfilter2='adminpdf'"
python "$SCRIPT" -o "$OUTPUT" "$MYFILTER "$INPUT"
WIN asciidoc -> docbook:
set SCRIPT=%{...path to asciidoc installation...}%\asciidoc.py
set INPUT=myAsciidocInput.ad
set OUTPUT=MyDocBookOutput.xml
set MYFILTER=--attribute myfilter1='pdf' --attribute myfilter2='adminpdf'
python "%SCRIPT%" -o "%OUTPUT%"

Related

Is there a way to list all categories in perluniprops?

perluniprops lists the Unicode properties of the version of Unicode it supports. For Perl 5.32.1, that's Unicode 13.0.0.
You can obtain a list of the characters that match a category using Unicode::Tussle's unichars.
unichars '\p{Close_Punctuation}'
And the help:
$ unichars --help
Usage:
unichars [*options*] *criterion* ...
Each criterion is either a square-bracketed character class, a regex
starting with a backslash, or an arbitrary Perl expression. See the
EXAMPLES section below.
OPTIONS:
Selection Options:
--bmp include the Basic Multilingual Plane (plane 0) [DEFAULT]
--smp include the Supplementary Multilingual Plane (plane 1)
--astral -a include planes above the BMP (planes 1-15)
--unnamed -u include various unnamed characters (see DESCRIPTION)
--locale -l specify the locale used for UCA functions
Display Options:
--category -c include the general category (GC=)
--script -s include the script name (SC=)
--block -b include the block name (BLK=)
--bidi -B include the bidi class (BC=)
--combining -C include the canonical combining class (CCC=)
--numeric -n include the numeric value (NV=)
--casefold -f include the casefold status
--decimal -d include the decimal representation of the code point
Miscellaneous Options:
--version -v print version information and exit
--help -h this message
--man -m full manpage
--debug -d show debugging of criteria and examined code point span
Special Functions:
$_ is the current code point
ord is the current code point's ordinal
NAME is charname::viacode(ord)
NUM is Unicode::UCD::num(ord), not code point number
CF is casefold->{status}
NFD, NFC, NFKD, NFKC, FCD, FCC (normalization)
UCA, UCA1, UCA2, UCA3, UCA4 (binary sort keys)
Singleton, Exclusion, NonStDecomp, Comp_Ex
checkNFD, checkNFC, checkNFKD, checkNFKC, checkFCD, checkFCC
NFD_NO, NFC_NO, NFC_MAYBE, NFKD_NO, NFKC_NO, NFKC_MAYBE
Other than reading the list of categories from the webpage, is there a way to programmatically get all the possible \p{...} categories?
From the comments, I believe you are trying to port a Perl program using \p regex properties to Python. You don't need a list of all categories (whatever that means); you just need to know what Code Points each of the property used by the program matches.
Now, you could get the list of Code Points from the Unicode database. But a much simpler solution is to use Python's regex module instead of the re module. This will give you access to the same Unicode-defined properties that Perl exposes.
The latest version of the regex module even uses Unicode 13.0.0 just like the latest Perl.
Note that the program uses \p{IsAlnum}, a long way of writing \p{Alnum}. \p{Alnum} is not a standard Unicode property, but a Perl extension. It's the union of Unicode properties \p{Alpha} and \p{Nd}. I don't know know if the regex module defines Alnum identically, but it probably does.

Windows CMD expand filename if file has .double.extension

I'm writing a script that converts a Markdown file to a PDF, facilitated by Pandoc.
So if you drag C:\Users\User\Documents\English\PAPER1.md onto the script, it'll create C:\Users\User\Documents\English\PDFs\PAPER1.pdf.
This is achieved in the header of the script via
set INPUT=%1
set PDFDIR=%~dp1\PDFs
set PDF=%PDFDIR%\%~n1.pdf
However, in certain circumstances, the input filename will be Something.md.txt, in which case I only still want to output Something.pdf. (Removing more than three extensions will probably not be necessary or desirable.)
But the current setup only strips one extension, producing Something.md.pdf.
However, %~nn1.pdf does not work, nor does set p=%~n1 set PDF=%~np, giving
The following usage of the path operator in batch-parameter
substitution is invalid: %~np
How do I get the bare filename of a file with "two extensions"?
I'd change every line from the provided header of the script to:
Set "INPUT=%~1"
Set "PDFDIR=%~dp1PDFs"
For %%A In ("%~dpn1") Do Set "PDF=%PDFDIR%\%%~nA.pdf"
Because there's no surety of the input content, please use best practice and always reference these variables wherever possible using doublequotes:
Echo "%INPUT%"
Echo "%PDFDIR%"
Echo "%PDF%"

PDF Data and Table Scraping to Excel

I'm trying to figure out a good way to increase the productivity of my data entry job.
What I am looking to do is come up with a way to scrape data from a PDF and input it into Excel.
More specifically the data I am working with is from grocery store flyers. As it stands now we have to manually enter every deal in the flyer into a database. A sample of a flyer is http://weeklyspecials.safeway.com/customer_Frame.jsp?drpStoreID=1551
What I am hoping to do is have columns for products, price, and predefined options (Loyalty Cards, Coupons, Select Variety... that sort of thing).
Any help would be appreciated, and if I need to be more specific let me know.
After looking at the specific PDF linked to by the OP, I have to say that this is not quite displaying a typical table format.
It contains many images inside the "cells", but the cells are not all strictly vertically or horizontally aligned:
So this isn't even a 'nice' table, but an extremely ugly and awkward one to work with...
Having said that, I'll have to add:
Extracting even 'nice' tables from PDFs in general is extremely difficult...
Standard PDFs do not provide any hints about the semantics of what they draw on a page:
the only distinction that the syntax provides is the distinctions between vector elements (lines, fills,...), images and text.
Whether any character is part of a table or part of a line or just a lonely, single character within an otherwise empty area is not easy to recognize programmatically by parsing the PDF source code.
For a background about why the PDF file format should never, ever be thought of as suitable for hosting extractable, structured data, see this article:
Why Updating Dollars for Docs Was So Difficult (ProPublica-Website)
...but doing so with TabulaPDF works very well!
Having said the above now let me add this:
For an amazing open source family of tools that gets better and better from week to week for extracting tabular data from PDFs (unless they are scanned pages) -- contradicting what I said in my introductionary paragraphs! -- check out TabulaPDF. See these links:
Introducing Tabula: Upload a PDF, get back tabular CSV data. Poof!
Tabula-Extractor: A Command Line Interface to Tabula
Tabula source code repository
Tabula API (upcoming, not ready yet)
Tabula-Extractor is written in Ruby.
In the background it makes use of PDFBox (which is written in Java) and a few other third-party libs.
To run, Tabula-Extractor requires JRuby-1.7 installed.
Installing Tabula-Extractor
I'm using the 'bleeding-edge' version of Tabula-Extractor directly from its GitHub source code repository.
Getting it to work was extremely easy, since on my system JRuby-1.7.4_0 is already present:
mkdir ~/svn-stuff
cd ~/svn-stuff
git clone https://github.com/tabulapdf/tabula-extractor.git git.tabula-extractor
Included in this Git clone will already be the required libraries, so no need to install PDFBox.
The command line tool is in the /bin/ subdirectory.
Exploring the command line options:
~/svn-stuff/git.tabula-extractor/bin/tabula -h
Tabula helps you extract tables from PDFs
Usage:
tabula [options] <pdf_file>
where [options] are:
--pages, -p <s>: Comma separated list of ranges, or all. Examples:
--pages 1-3,5-7, --pages 3 or --pages all. Default
is --pages 1 (default: 1)
--area, -a <s>: Portion of the page to analyze
(top,left,bottom,right). Example: --area
269.875,12.75,790.5,561. Default is entire page
--columns, -c <s>: X coordinates of column boundaries. Example
--columns 10.1,20.2,30.3
--password, -s <s>: Password to decrypt document. Default is empty
(default: )
--guess, -g: Guess the portion of the page to analyze per page.
--debug, -d: Print detected table areas instead of processing.
--format, -f <s>: Output format (CSV,TSV,HTML,JSON) (default: CSV)
--outfile, -o <s>: Write output to <file> instead of STDOUT (default:
-)
--spreadsheet, -r: Force PDF to be extracted using spreadsheet-style
extraction (if there are ruling lines separating
each cell, as in a PDF of an Excel spreadsheet)
--no-spreadsheet, -n: Force PDF not to be extracted using
spreadsheet-style extraction (if there are ruling
lines separating each cell, as in a PDF of an Excel
spreadsheet)
--silent, -i: Suppress all stderr output.
--use-line-returns, -u: Use embedded line returns in cells. (Only in
spreadsheet mode.)
--version, -v: Print version and exit
--help, -h: Show this message
Extracting the table which the OP wants
I'm not even trying to extract this ugly table from the OP's monster PDF. I'll leave it as an excercise to these readers who are feeling adventurous enough...
Instead, I'll demo how to extract a 'nice' table. I'll take pages 651-653 from the official PDF-1.7 specification, here represented with screenshots:
I used this command:
~/svn-stuff/git.tabula-extractor/bin/tabula \
-p 651,652,653 -g -n -u -f CSV \
~/Downloads/pdfs/PDF32000_2008.pdf
After importing the generated CSV into LibreOffice Calc, the spreadsheet looks like this:
To me this looks like the perfect extraction of a table which did spread over 3 different PDF pages. (Even the newlines used within table cells made it into the spreadsheet.)
Update
Here is an ASCiinema screencast (which you also can download and re-play locally in your Linux/MacOSX/Unix terminal with the help of the asciinema command line tool), starring tabula-extractor:

Passing data into perl script from command line

I have a perl script the creates a report based on an xml definition. Currently these definitions all exist as .xml files.
So I have the script run-report.pl, which can take a path to a definition file and create the report.
Now I want to create run-reports-from-db.pl, which will generate the report definition based on same database entries. I don't want to create temp files to pass to run-report.pl, I would just like to pass in the definition somehow.
So instead of saying:
run-report.pl -def=./path/to/def.xml
I want to be able to say:
run-report.pl --stream
And have the report definition available in <STDIN>
I am sure there is pretty trivial way to do this???
If I understand your question correctly, all you need is one | (pipe).
./generate-xml-from-db.pl | ./run-report.pl --stream
Anything the first process in the pipeline prints to stdout will appear in the second process's stdin.
As long as you read from STDIN, you have it available. Notice what happens with you take the code below name it something like echo.pl run it at the command line and paste reams of text.
#!/usr/bin/perl -w
use 5.010;
use strict;
use warnings;
while ( <> ) {
say;
}
<> is the Perl shorthand for "read from STDIN".
As long as the method you're using to launch the process has a way to get a hold of the standard input and outputs, you can just write it to that handle. You have to use the ways that are available to you. In Java, for example, you'd have to get the input stream of the process, in a batch command you have to pipe it. At a GUI terminal you can cut and paste.

How to concatenate SVG files lengthwise from linux command line?

I have a series of square SVG files that I would like to arrange lengthwise into one super long SVG file.
I attempted to use imagemagick to combine them. Based on this page:
http://linux.about.com/library/cmd/blcmdl1_ImageMagick.htm
and this
http://www.imagemagick.org/Usage/compose/
I tried this command
composite 'file1.svg' 'file2.svg' +adjoin 'outputfile.svg'
However, I received the following error message:
composite: unrecognized option '+adjoin' # error/composite.c/CompositeImageCommand/565.
I tried several other imagemagick commands (convert, display), but had no success. How can I combine these files on the command line? Is there an inkscape command that does this?
There's currently no convenient way to do this with only the command line and no custom scripting.
Closest pre-written thing I could find currently (4-16-2012) is https://github.com/astraw/svg_stack, which lets you write commands of the form:
svg_stack.py --direction=h --margin=100 red_ball.svg blue_triangle.svg > shapes.svg
to concatenate.
It should be pretty easy if you're willing to use a scripting language. For each file, just add a prefix to all id tags; so in file 1, id="circle" becomes id="file1_circle", and in file 2, id="circle" becomes id="file2_circle".
In most cases you would get away with a trivial search and replace (find id=" and replace it with id="fileX_) although it is possible to have cases where this won't work (specifically if that find string appears in an item of text, for example).
If you want to do this 'the proper way', you'll need an XML parser (such as XMLReader in PHP).

Resources