How to do word break in Ascii Doc? - text

I am writing documentation in Ascii Doc style (AsciiDocFX GUI, but I suppose it is invalid to topic) and I am having a problem.
When I create a table and then input a long word with no spaces inside of cell, the word is not split into multiple lines, it is overlapping cell next to it.
Is there some way how to fix this problem?
Here is source code from ascii doc
[[EXAMPLE_TABLE]]
[cols="4,3,5,7",width="100%",options="header","autowidth"]
|============================
s|Function s| Parameter s| Value s| Description
.3+^.^|toLongWorddddddddddd |Command |AnotherLongWord |Some description with spaces
|Parameter |Value |...
|Parameter |Value |...
|============================
Here is the output (image snipped from exported pdf file):
Final goal is simple - to have nice-looking table.
Note:
When I export this document to html file, the table is fine, but it is not able to be printed out properly. I need the pdf for management..

Long story short,
I asked a guy who is working with AsciiDoc for a year and he said, that it is simply not possible. AsciiDoc framework doesn't support this possibility.
The only solution is to split the long word into multiple smaller words by spaces. :(

Related

How to create multiline QR code using bartender?

I am trying to generate multiline QR code in Bartender. I am using excel file as data source and take 3 filed for testing first. It successfully generate QR code but when I scan all text shows in a single line but I want it to be in 3 fiend in 3 separate line. I have used Carriage Return control character <<CR>> after 1 data field. Below is QR code properties settings.
When I scan the QR Code image then it gives me following output.
No_LAN_IP90:61:AE:BC:5B:01FAC-Laptop-044
My Expected output is
No_LAN_IP
90:61:AE:BC:5B:01
FAC-Laptop-044
Any help is greatly appreciated.
I have Tagged the post as Excel because I am using excel file as data source. May be someone excel expert may know the fact.
Using LF (Line Feed) instead of CR should solve the problem.
--edit--
After rereading your problem I saw I missed something. You are using the example data field to add a LF which will not be used while printing. In the properties screen you have a tab "Transforms" which has an option to add a suffix and/or prefix. If you put the LF in the suffix field for your first and second line your problem should be solved.

How to translate Unicode to and from matlab?

I have written matlab programs that produce plots and tables for chemical substances. I get my input mostly from excel tables and a local MySql database. My problem is quite a few substance names contain greek letters.
My problem is I want to create plots that use exactly the names specified by my collegues. And also create tables that show the correct symbol.
An example:
If I create an excel file containing: "α-Methylstyrol" in the first cell and read it with [~,~,tmp] = xlsread('test.xlsx'). tmp will contain '(box with question mark)-Methylstyrol'. If I use the string in a plot (title(tmp)) it will be shown as: '(right arrow)-Methylstyrol'
So far I tried the native2unicode and unicode2native commands on the string but there is no effect. Also I tried replacing the characters but the number of characters I need to replace is growing way too fast for me - so I'm really hoping there would be a more systematic way.
(We know there are also names that wouldn't contain greek letters - but we try to adhere to some guidelines which prefer these names.)
As far as I understand, Matlab does not support unicode nicely. However, it is possible to type greek letters in image titles using LaTex syntax.
title('\alpha-Methanol')
Even though it is not the nicest solution, I think it should be possible to replace unicode symbols with LaTex keywords.
I think, your problem is, that xlsread is not even getting the correct greek letter out of your sheet.
Just give jexcelapi or poi a try. Both links lead to java classes for importing xls-files. In MATLAB you only need to add the jar-file to you path via javaaddpath and the next steps are like basic java coding.

Read a text file to a string using fortran77

Is it possible to read a text file to a string using fortran77.
I actually have a text file in the following format
Some comments
Some comments
n1 m1 comment_with_unknown_number_of_words
..m1 lines of data..
n2 m2 comment_with_unknown_number_of_words
..m2 lines of data..
and so on
whereas n1,n2.. are the orders of the objects. m1, m2,..are the number of lines which contains the data about these objects, respectively. I also want to store the comment of each object for further investigations.
How can I deal with this? Thank you so much in advance!
I can't believe nobody called me on this.. My apologies this in fact only grabs the first word of the comment...
------------original answer----
Not to recomend F77, but this isnt that tough a problem either. Just declare a char variable long enough to hold your longest comment and use a list directed read.
integer m1,n1
char*80 comment
...
read(unit,*)m1,n1,comment
If you want to write it back out without padding a bunch of extra spaces thats a bit of effort but hardly the end of the world.
What you can not do at all in f77 is discern whether your file has trailing blanks at the end of a line, unless you go to direct access reading.
------------improved answer
What you need to do is read the whole line as a string then read your integers from the string:
read(unit,'(a)')comment
read(comment,*)m1,n1
at this point comment contains the whole line including your two integers (perhaps that will do the job for you). If you want to pull off the actual string it requires a bit of coding (I have a ~40 line subroutine to split the string into words). I could post if interesed but I'm more inclined as others to encourage you to see if your code will work with a more modern compiler.

splitting text files based column wise

So I have an invoice that I need to make a report out of. It is on average to be about 250 pages long. So I'm trying to create a script that would extract the specific value of the invoice and make a report. Here's my problem:
the invoice is in pdf format with it spanning two column. In Linux command, I want to use 'pdftotext' Linux command to convert into multiple text files (with each txt file representing each pdf page). How do I do that
I recognize that 'pdftotext' command splits it left part of the page and right part of the page by having 21 spaces in between. How do I the right side of the data(identified after reading at least 21 spaces in a row) to the end of the file
Since the file is large and I only last few page of the files, how do I delete all those text files in a script (not manually) until I read a keyword (let's just say the keyword = Start Invoice)?
I know this is a lot of questions, but I'm confused in what Linux command can do. Can you guys guide me to the right direction? Thanks
PS: I'm using CentOS 5.2
What about:
pdftotext YOUR.pdf | sed 's/^\([^ ]\+\) \{21\}.*/\1/' > OUTPUT
pdftotext YOUR.pdf | sed 's/.* \{21\}\(.*\)/\1/' >> OUTPUT
But you should check out pdftotext's -raw and -layout options too. And there are more ways to do it...

How to know if a PDF contains only images or has been OCR scanned for searching?

I have a bunch of PDF files that came from scanned documents. The files contain a mix of images and text. Some were scanned as images with no OCR, so each PDF page is one large image, even where the whole page is entirely text. Others were scanned with OCR and contain images and searchable text where text is present. In many cases even words in the images were made searchable.
I want to make an automated process to recognize the text in all of the scanned documents using OCR, with Acrobat 8 Pro, but I don't want to re-OCR the files that have already been through the OCR process in the past. Does anyone know if there is a way to tell which ones contain only images, and which ones already contain searchable text?
I'm planning on doing this in C# or VB.NET but I don't think being able to tell the two kinds of files apart is language dependent.
Scannned images converted to PDF which have been OCR'ed in the aftermath to make text searchable do normally contain the text parts rendered as "invisible". So what you see on screen (or on paper when printed) is still the original image. But when you search successfully, you get the hits highlighted that are on the invisible text.
I'd recommend you to look at the XPDF-derived commandline tools pdffonts(.exe), pdfinfo(.exe) and pdftotext(.exe). See here for downloads: http://www.foolabs.com/xpdf/download.html
Example usage of pdffonts:
C:\downloads\> pdffonts cisco-ip-phone-7911-guide6.1.pdf
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
LGOKFL+Univers-BlackOblique Type 1C yes yes no 13171 0
LGOKGM+Univers-Black Type 1C yes yes no 13172 0
[....]
This PDF uses fonts (indicated by the 'name' column), has them embedded (indicated by the 'yes' in the 'emb' column) and uses subset fonts (indicated by the 'yes' in the 'sub' column).
C:\downloads\> pdffonts examle1.pdf
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
Univers-BlackOblique Type 1C yes no no 14 0
Arial TrueType no no no 15 0
This PDF uses 2 fonts (indicated by the 'name' column). The font 'Universe-BlackOblique' is embedded completely (indicated by the 'yes' in the 'emb' column and the 'no' in the 'sub' column). The font 'Arial' is also used, but is not embedded.
C:\downloads\> pdffonts examle2.pdf
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
This PDF uses not a single font, and hence does not have any text embedded (so no OCR either).
Example usage of pdftotext:
C:\downloads\> pdftotext ^
-layout ^
cisco-ip-phone-7911-guide6.1.pdf ^
cisco-ip-phone-7911-guide6.1.txt
This will extract all text strings from the PDF (trying to preserve some resemblance of the original layout). If there is no text in the PDF, you'd know there was no OCR...
Various PDF tools can tell you if there's text. Some are available as COM controls, and maybe even native .NET ones.
Open the document in acrobat. Go to File -> Properties. Look in the "Advanced" section and find the PDF Producer. If it reads something like "Paper Capture..." then it has been OCR'd.
Hope this helps.
Apago's pdfspy extracts information from PDF into an XML file. It includes information about the document including images and text. For your project, the useful information includes image count & size and where there is OCR (hidden) text.
http://www.apagoinc.com/pdfspy
Sorry to dig up old thread, but if you found this have a look at my thread:
Batch OCR Program for PDFs
you can get extra information about the pdf by catting it in unix/linux/osx or opening it as "rb" mode in python. (course that's python and you didn't want to use that but maybe it has something equivalent).
I use Everything by VoidTools to do a regex content search on the PDF's.
Any pdf with absolutely no text is a good candidate.
e.g.
.pdf regex:content:^$
This searches for all files with .pdf in the name, and that has empty content (^$ means: a start of a line and and and of a line with nothing in between), alternatively regex:content:^(?![\s\S]))
Use "dtsearch" to create an index for all the pdf files... then "view the log file" of the indexing process to check the list of pdf files that were not indexed.
A very low tech solution: any file that has scanned text will undoubtedly contain the letter "a" so do a search on all file contents that don't contain the letter a. i.e. "NOT a". Any file that shows up won't have been OCR'd

Resources