How can I use this UTF-8 SVG string to get <svg>? - svg

I am using domtoimage to try to turn my html div into <svg>. The function from domtoimage returns the string:
data:image/svg+xml;charset=utf-8,<svg xmlns="http://www.w3.org/2000/svg" width="288" height="1920"> .......... </svg>
I can set this string as the src of an <img>, but the other plugin I'm using (jsPDF) cannot use that, it needs <svg>.
I figured I could strip the beginning part off and add just the svg tag to the document but this results in a really odd svg with "%0A" everywhere, which I cannot strip from the string.

If this is your code; the problem is:
you are stuffing text into the append function which only accepts DOM nodes.
Only .innerHTML converts a string to HTML
If you feed the append function a string.. it will be displayed as a string.
https://developer.mozilla.org/en-US/docs/Web/API/Element/append
Note the documentation: DOMString objects are inserted as equivalent Text nodes.
Solution is to create an SVG DOM element
let svgElem = document.createElementNS("http://www.w3.org/2000/svg", "svg");
svgElem.innerHTML = mySvg;
$('body').append(svgElem);

Related

How to use groovy string replacement or XML Parser to edit SVG string

I have the text of an SVG image file in a string variable in Groovy. I need to modify it to be formed for embedding as a nested SVG. That means:
(1) remove the first line if the first line is an XML declaration that starts as “<?xml”. Or, another way of doing it would be to remove everything up until the start of the SVG tag, i.e., up until “<svg”
(2) within the SVG tag check to see if there is a width=“##” or height=“##” attribute. If so, revise the width and height to be 100%.
How can I do this, e.g, using string replacement or xml parser?
I have tried:
def parsedSVG = new XmlParser().parseText(svg)
if (parsedSVG.name() == “xml”) // remove node here
But the problem is that parsedSVG.name() is the svg tag/node, not the xml definition tag. So it still leaves me unable to tell whether the svg starts with the xml tag.
I have also tried the approaches here GPathResult to String without XML declaration
But my execution environment does not support XML Node Printer and the InvokeHelper call is giving me errors.
As far as string replacement this is what I have tried. But the regular expression doesn’t seem to work. The log shows svgEnd is basically at the end of the svg file rather than being the end of the svg tag as intended...
String sanitizeSvg(String svg) {
String cleanSvg = svg
def xmlDecStart = svg.indexOf("<?xml")
if (xmlDecStart > -1) {
def xmlDecEnd = svg.indexOf("?>")
cleanSvg = cleanSvg.substring(xmlDecEnd+2)
}
def svgStart = cleanSvg.indexOf("<svg")
logDebug("svgStart is ${svgStart}")
if (svgStart > -1) {
def svgEnd = cleanSvg.indexOf('>', svgStart)
logDebug("svgEnd is ${svgEnd}")
if (svgEnd > -1) {
String svgTag = cleanSvg.substring(svgStart, svgEnd-1)
logDebug("SVG Tag is ${svgTag}")
svgTag = svgTag.replaceAll('width="[^"]*', 'width="100%')
svgTag = svgTag.replaceAll('height="[^"]*', 'height="100%')
logDebug("Changed SVG Tag to ${svgTag}")
cleanSvg.replaceAll('(<svg)([^>]*)',svgTag)
}
}
return cleanSvg
}

Fails to parse Hebrew text from pdf using iText 7 with .net

I am trying to read a PDF file with several pages, using iText 7 on a .NET CORE 2.1
The following is my code:
Rectangle rect = new Rectangle(0, 0, 1100, 1100);
LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
inputStr = PdfTextExtractor.GetTextFromPage(pdfDocument.GetPage(i), strategy);
inputStr gets the following string:
"\u0011\v\u000e\u0012\u0011\v\f)(*).=*%'\f*).5?5.5*.\a \u0011\u0002\u001b\u0001!\u0016\u0012\u001a!\u0001\u0015\u001a \u0014\n\u0015\u0017\u0001(\u001b)\u0001)\u0016\u001c*\u0012\u0001\u001d\u001a \u0016* \u0015\u0001\u0017\u0016\u001b\u001a(\n,\u0002>&\u00...
and in the Text Visualizer, it looks like that:
)(*).=*%'*).5?5.5*. !!
())* * (
,>&2*06) 2.-=9 )=&,

2..*0.5<.?
.110
)<1,3
  2.3*1>?)10/6
 (& >(*,1=0>>*1?

  2.63)&*,..*0.5
  206)&13'?*9*<
  *-5=0>
?*&..,?)..*0.5
it looks like I am unable to resolve the encoding or there is a specific, custom encoding at the PDF level I cannot read/parse.
Looking at the Document Properties, under Fonts it says the following:
Any ideas how can I parse the document correctly?
Thank you
Yaniv
Analysis of the shared files
file1_copyPasteWorks.pdf
The font definitions here have an invalid ToUnicode entry:
/ToUnicode/Identity-H
The ToUnicode value is specified as
A stream containing a CMap file that maps character codes to Unicode values
(ISO 32000-2, Table 119 — Entries in a Type 0 font dictionary)
Identity-H is a name, not a stream.
Nonetheless, Adobe Reader interprets this name, and for apparently any name starting with Identity- assumes the text encoding for the font to be UCS-2 (essentially UTF-16). As this indeed is the case for the character codes used in the document, copy&paste works, even if for the wrong reasons. (Without this ToUnicode value, Adobe Reader also returns nonsense.)
iText 7, on the other hand, for mapping to Unicode first follows the Encoding value with unexpected results.
Thus, in this case Adobe Reader arrives at a better result by interpreting meaning into an invalid piece of data (and without that also returns nonsense).
file2_copyPasteFails.pdf
The font definitions here have valid but incomplete ToUnicode maps which only contain entries for the used Western European characters but not for Hebrew ones. They don't have Encoding entries.
Both Adobe Reader and iText 7 here trust the ToUnicode map and, therefore, cannot map the Hebrew glyphs.
How to parse
file1_copyPasteWorks.pdf
In case of this file the "problem" is that iText 7 applies the Encoding map. Thus, for decoding the text one can temporarily replace the Encoding map with an identity map:
for (int i = 1; i <= pdfDocument.GetNumberOfPages(); i++)
{
PdfPage page = pdfDocument.GetPage(i);
PdfDictionary fontResources = page.GetResources().GetResource(PdfName.Font);
foreach (PdfObject font in fontResources.Values(true))
{
if (font is PdfDictionary fontDict)
fontDict.Put(PdfName.Encoding, PdfName.IdentityH);
}
string output = PdfTextExtractor.GetTextFromPage(page);
// ... process output ...
}
This code shows the Hebrew characters for your file 1.
file2_copyPasteFails.pdf
Here I don't have a quick work-around. You may want to analyze multiple PDFs of that kind. If they all encode the Hebrew characters the same way, you can create your own ToUnicode map from that and inject it into the fonts like above.

Browser to Display MathML code instead of equation

Does any one know how to force browser to display MathML code instead of equation?
PS: Rendering MathML to view as plain text gives the TeX output.
For example,
The axis on which the point (0,4) lie, is _____
Should be displayed as:
The axis on which the point <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mn>4</mn><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">(0, 4)</annotation></semantics></math> lie, is _____
In most common configs, if your ouput is not directly mathML, mathjax stores mathml informations in the attribute data-mathml of a span tag wich wraps the mathJax element
This is what is displayed in the popup when you right click on a mathJax element : show math as -> MathMl Code
If your goal is to grab equations from html in mathml format, you can create a script which parse your document and get all data-mathml attributes.
There is many ways to achieve that, this is just an example you may have to adapt:
function grabMathMl(){
var spanMathMl = document.querySelectorAll(".MathJax");
let results = [];
let i = 0, ln = spanMathMl.length;
for ( i; i < ln; ++i){
if ( spanMathMl[i].hasAttribute("data-mathml") ){
results.push(spanMathMl[i].dataset.mathml);
// if you really want to replace content
spanMathMl[i].innerHTML = "<textarea>"+spanMathMl[i].dataset.mathml+"</textarea>";
}
}
return results;
}
// put this fonction in the mathJax queue for you have to wait until mathJax process is done
MathJax.Hub.Queue(function(){
let equations = grabMathMl();
//console.log (equations.toString());// your equations in mathml
});
<script>
</script>
<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
<div>$$\left|\int_a^b fg\right| \leq \left(\int_a^b
f^2\right)^{1/2}\left(\int_a^b g^2\right)^{1/2}.$$</div>
<div>
\begin{equation} x+1\over\sqrt{1-x^2} \end{equation}
</div>
Then in word, this link should interest you
https://superuser.com/questions/340650/type-math-formulas-in-microsoft-word-the-latex-way#802093

embedded spaces in a computed field don't show

I have some code in a computed field where I want to embed some spaces between two values:
var rtn:String = doc.getItemValue("RefNo")[0] + " " + doc.getItemValue("Company")[0];
the computed field Display Type = text and the content type is String but the display strips out all the extra spaces. Is there a function like insertSpaces(5) that would insert 5 hard spaces?
Figured it out insert "&#160" + "&#160" and display as HTML. fairly simple but really ackward.
It will print them out as whitespace in HTML output. It will not be rendered.
You can add instead (5 times in your case).
An alternative way is to set style="white-space: pre;" (works IE8+ and other browsers)

String replacement in latex

I'd like to know how to replace parts of a string in latex. Specifically I'm given a measurement (like 3pt, 10mm, etc) and I'd like to remove the units of that measurement (so 3pt-->3, 10mm-->10, etc).
The reason why I'd like a command to do this is in the following piece of code:
\newsavebox{\mybox}
\sbox{\mybox}{Hello World!}
\newlength{\myboxw}
\newlength{\myboxh}
\settowidth{\myboxw}{\usebox{\mybox}}
\settoheight{\myboxh}{\usebox{\mybox}}
\begin{picture}(\myboxw,\myboxh)
\end{picture}
Basically I create a savebox called mybox. I insert the words "Hello World" into mybox. I create a new length/width, called myboxw/h. I then get the width/height of mybox, and store this in myboxw/h. Then I set up a picture environment whose dimensions correspond to myboxw/h. The trouble is that myboxw is returning something of the form "132.56pt", while the input to the picture environment has to be dimensionless: "\begin{picture}{132.56, 132.56}".
So, I need a command which will strip the units of measurement from a string.
Thanks.
Use the following trick:
{
\catcode`p=12 \catcode`t=12
\gdef\removedim#1pt{#1}
}
Then write:
\edef\myboxwnopt{\expandafter\removedim\the\myboxw}
\edef\myboxhnopt{\expandafter\removedim\the\myboxh}
\begin{picture}(\myboxwnopt,\myboxhnopt)
\end{picture}
Consider the xstring package at https://www.ctan.org/pkg/xstring.
The LaTeX kernel - latex.ltx - already provides \strip#pt, which you can use to strip away any reference to a length. Additionally, there's no need to create a length for the width and/or height of a box; \wd<box> returns the width, while \ht<box> returns the height:
\documentclass{article}
\makeatletter
\let\stripdim\strip#pt % User interface for \strip#pt
\makeatother
\begin{document}
\newsavebox{\mybox}
\savebox{\mybox}{Hello World!}
\begin{picture}(\stripdim\wd\mybox,\stripdim\ht\mybox)
\put(0,0){Hello world}
\end{picture}
\end{document}

Resources