Pulling elements from an SVG file apart - svg

I am using http://www.jasondavies.com/wordcloud to generate a word cloud in SVG format. Unfortunately the generator deletes all \ characters, which is bad since I want to 'wordcloudize' a list of TeX primitives.
I'd like to increase the spacing between the elements of the SVG automatically to be able to add \ manually in the SVG file. Is there any SVG-transformation which moves the elements away from the center while leaving the size of the elements as they are?

The much simpler solution is to download the page and its scripts to your local machine. Open the local page and check that it still works.
Then open the cloud.js file and comment out the line:
word = word.replace(punctuation, "");
I think that should work.

Related

How to highlight portions of a PDF file programmatically (eg. using command line)

I am interested in highlighting portions of a PDF programmatically, hopefully through a command line tool of sorts. My particular PDF file is not OCRed so the text is not searchable, but the particular places that I would like to highlight occur on every page in the same position. I was wondering if there is a tool to do this where I can input the rectangle positions in pixels into the command line tool and it would highlight the relevant portions for me.
Previous Findings
I have looked over the internet and found a few sites noting how to do this by searching for the text. Unfortunately that is not possible for me as my PDF does not have OCR.
I have searched stackexchange for similar questions and found
How to Highlight Text in PDF with commandline (windows)? and https://stackoverflow.com/questions/32713633/how-to-highlight-text-in-pdf-using-acrobat-reader-from-command-line but both were unanswered.
Potential Ideas
The first link had a possible lead with a given link to
Add comments to PDF files automagically with regular expressions
which uses ghostscript to include annotations. Is it possible to use ghostscript to highlight the pages in a similar fashion by coordinates.
The second link mentioned using command line options for the adobe acrobat/reader exe file, but searching the relevant manual for the command line switches does not show any highlighting options. It may be possible that Adobe does not support the highlight option through command line anymore, which would be unfortunate.
My last idea would be using AutoHotkey to create a macro that does an actual highlight for me using a GUI program, but that would be the last resort.
What do you all think? Any ideas on what to do, or things to check out? I am willing to program out a solution and can work out the solution on Windows or Linux if necessary. Thanks in advance.
I would have thought a Highlight annotation was what you wanted.Highlight annotations are a type of text markup annotation and as such take a set of QuadPoints which describe the bounding box(es) to apply the annotation type to.
Since you say you know the co-ordinates this would seem appropriate for your use. Of course, you will have to create the Annotation on every page, and you will have to learn how to program this with a pdfmark, but I believe it should work.
Note that the co-ordinates are in user space (generally 72 points to the inch) NOT pixels, because PDF is not an image format there is no concept of pixels, except for included images.
There are quite a few officially unsupported command line parameters to acrobat or the acrobat reader (acrord32.exe in Windows).
See: https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_open_parameters.pdf
This includes a parameter to highlight with four integers at left,right,top,bottom that are in some unspecified units but with 0,0 at the top left of the page.
EXCEPT... I have been unable to get this to work.
I can pass in parameters to search and zoom but highlight never shows anything.
For instance:
start acrord32 /n /s /a "search=MS25441&zoom=300&page=1&highlight=0,55,0,65" floorplan1_ABM_cameras.pdf
Opens the files, searches for the string, zooms to 300% but nothing shows for a highlight no matter what coordinates I specify.

Inkscape doesn't allow to edit svg text lines once it has been saved as plain svg or treated with scour

I was for several days trying to find a solution to the following problem :
Create a svg with text (simply click with the text tool to add text, do not drag to open a frame)
Type enter to create a multiline text, add several lines of text
Save as plain svg or optimized svg
Or treat with scour in command-line
Reopen with Inkscape : you cannot edit the text, it shows it properly but when you go to the next line (with the mouse or keyboard arrow down) the cursor stays on the first line.
This is an annoying bug running for some time in Inkscape and doesn't help with web edition.
But there are solutions... See the following thread to manually (in vim) replace all tspans :
Vim search replace regex + incremental function
And see my answer below to correct the svg code in order to get your Inkscape files back in working order !!!
SVG files do not currently support multi-line text. Inkscape uses custom XML attributes to keep track of which spans of text are part of that block of text.
When you save as Optimized SVG, Inkscape strips out all its custom XML attributes and writes a vanilla SVG file. So the sense of what is a block of text is gone.
Sed is very useful in order to correct your files in a batch:
cd /home/user/my/svg/files
sed -i.bak 's|<svg|<svg\nxmlns:xlink="http://www.w3.org/1999/xlink"\nxmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"|g' *.svg
This runs on all svg files in the current folder and :
Creates a .bak file which you can rename to svg in order to get your original files (but I nevertheless strongly advocate to duplicate your working folder in order to avoid terrible mistakes when fiddling with sed)
Adds the correct namespaces with newlines (\n)
Then:
sed -i.2.bak 's|<tspan|<tspan sodipodi:role="line" |g' *.svg
This appends sodipodi:role="line" to all tspan tags in the current folder and creates file.2.bak backups.

Hidden/Open words in an Image file such as PNG or JGP

As far as I can tell my question is not related to topics involved in Stenography or in the win.rar soluations I've seen to this where you are essentially hidding messages.
I am trying to figure out if there is a way to insert code into a file such as a jpg or png with a simple message, that could later be extracted by a program reading the file without having it encoded into the file either by slight differences in pixels or what have you in stenography.
I basically just want a tag along message that is a part of the file itself that is not brought up by the image reader but could perhaps be seen by a text reader of some kind.
I'm not sure how possible this is because I, for the most part don't understand the order/layout of the png/jgp/ect file aside from the RGB pixel code. How does it start, how does the image display tool know to stop displaying ect.
The way I'm envisioning it would be something like:
pngStartCode -> RGBinfo --> png end code so image reader knows to stop -> start sequence that some kind of reader will recognize (possibly a new text reader) -> written text wanted to be communicated -> endcodeforreader
I may just be rambling about something ridiculous here but please let me know if this is at least possible.
You can use following command(Windows command prompt)
Create a text file with your message, say "message.txt"
Now choose target file(it can be any file like a.jpg,a.png,a.exe,..etc), say "image.jpg"
Now execute follwing command
copy /b "image.jpg"+"message.txt" "NewImage.jpg"
Above command will combine files(in binary mode) and creats a new file(in this case NewImage.jpg). Now if anyone opens image they will just see noraml image. If you want to look at text, you have open it with any text editor(Notepad) and scroll down to last, there you can find text.
Here it wont chage any pixels or any thing to image, it just appends text to image.
It sounds like OP is asking about comment tags in the PNG specifications (i.e. adding data but without intent to hide it).
PNG files are broken into "Chunks". The image part is usually divided into several IDAT chunks; the color, size, etc are stored in an IHDR chunk, etc.
The iTXt, tEXt, and zTXt chunks are used for conveying text information associated with the image, so typically you'd look into using a tool to add those types of chunks. tEXt is for just plain text, zTXt is compressed.
More info on the PNG specification including what kinds of chunks are available can be found here, and you find chunk viewers on google.
For convenience at preset time (January 2021) here are a couple tools that will let you view, edit, and add chunks:
Windows 10: http://entropymine.com/jason/tweakpng/
Linux: https://www.systutorials.com/docs/linux/man/n-png/
Mac: https://apps.apple.com/us/app/inspectpng/id498851708?mt=12
NOTE: I do not vouch for the safety of any of the above links. Please use standard caution when downloading any file from the internet. If you don't have your own anti-virus, Virustotal has one online you can upload individual files to for free.

Using LocationTextExtractionStrategy in itextSharp for text coordinate

My goal is to retrieve data from PDF which may be in table structure to an excel file.
using LocationTextExtractionStrategy with iTextSharp we can get the string data in plain text with page content in left to right manner.
How can I move forward such that during
PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy())
I could make the text retain its coordinate in the resulting string.
As for instance if the first line in the pdf has text aligned to right, then the resulting string must be containing trailing space or spaces keeping the content right aligned.
Please give some suggestions, how I may proceed to achieve the same.
Its very important to understand that PDFs have no support for tables. Anything that looks like a table is really just a bunch of text placed at specific locations over a background of lines. This is very important and you need to keep this in mind as you work on this.
That said, you need to subclass TextExtractionStrategy and pass that into GetTextFromPage(). See this post for a simple example of that. Then see this post for a more complex example of subclassing. The latter isn't completely relevant to your goal but it does show some more complex things that you can do.

add a duplicate (hidden) text layer to a pdf for extra searching

My problem:
I have a pdf with lots of roman characters with complex diacritical marks (e.g., ṣ, ś, ṝ, ǎ, etc.). To make it easier to search within the pdf, I would like to add an additional layer, much as one does with hocr, where the same text is present without the diacritics.
When using full-text search engines I can index multiple terms at the same position (vector) - I would like to achieve the same effect here.
I have read lots about adding a hocr layer to scanned images, but I really just want to duplicate the text layer, pass it through a script that strips the diacritics (straightforward enough) and then adds it back in as a hidden but searchable layer.
Anyone have any suggestions? (Solutions involving any platform, language, library or toolchain will be useful!)
Thanks :)
Edit: please let me know if the question is unclear.
Well I have a (slightly ugly and hackish) solution, so I thought I'd share it.
I'm using PDFMiner to extract the text, along with the co-ordinates. Then I'm using ReportLab to write the normalized versions of the text to a new pdf, in exactly the same position, as hidden text. To make the positions line up properly, I found I had to use exactly the same font, so I've used a combination of FontForge and MuPDF to extract the required font(s) from the original pdf.
Finally, having created the new pdf, I'm using pdftk to merge it with the original.
It works pretty well, but has the downside that copying text out of the pdf results in the normalized text being copied too. But this is acceptable for my present purposes, and I can't see any way around it. The pdf spec. doesn't really support my objective, and so I don't imagine I can do better than this hackish solution.
I have written something similar to add searchable text by OCR'ing images and converting it to PDF in C#. I used QuickPDF from www.quickpdf.com to create hidden white text objects on top of the image and this worked reasonably well.
In your case QuickPDF would allow you to extract the text strings along with bounding boxes and font details. You could then normalize your text and create the invisible text objects using the existing font and position information and then save it out to a new file.
This would basically give you the same PDF as you have now and also give you both the original and normalised text as you are getting now.
QuickPDF is a commercial library. If your solution works well for you then there is no used buying a commercial engine though. The nice thing though is that it only requires 1 SDK and you would look at it if you had a more than a few PDF's to convert.

Resources