Fails to parse Hebrew text from pdf using iText 7 with .net

Fails to parse Hebrew text from pdf using iText 7 with .net - asp.net-core-2.0

I am trying to read a PDF file with several pages, using iText 7 on a .NET CORE 2.1
The following is my code:
Rectangle rect = new Rectangle(0, 0, 1100, 1100);
LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
inputStr = PdfTextExtractor.GetTextFromPage(pdfDocument.GetPage(i), strategy);
inputStr gets the following string:
"\u0011\v\u000e\u0012\u0011\v\f)(*).=*%'\f*).5?5.5*.\a \u0011\u0002\u001b\u0001!\u0016\u0012\u001a!\u0001\u0015\u001a \u0014\n\u0015\u0017\u0001(\u001b)\u0001)\u0016\u001c*\u0012\u0001\u001d\u001a \u0016* \u0015\u0001\u0017\u0016\u001b\u001a(\n,\u0002>&\u00...
and in the Text Visualizer, it looks like that:
)(*).=*%'*).5?5.5*. !!
())* * (
,>&2*06) 2.-=9 )=&,

2..*0.5<.?
.110
)<1,3
  2.3*1>?)10/6
 (& >(*,1=0>>*1?

  2.63)&*,..*0.5
  206)&13'?*9*<
  *-5=0>
?*&..,?)..*0.5
it looks like I am unable to resolve the encoding or there is a specific, custom encoding at the PDF level I cannot read/parse.
Looking at the Document Properties, under Fonts it says the following:
Any ideas how can I parse the document correctly?
Thank you
Yaniv

Analysis of the shared files
file1_copyPasteWorks.pdf
The font definitions here have an invalid ToUnicode entry:
/ToUnicode/Identity-H
The ToUnicode value is specified as
A stream containing a CMap file that maps character codes to Unicode values
(ISO 32000-2, Table 119 — Entries in a Type 0 font dictionary)
Identity-H is a name, not a stream.
Nonetheless, Adobe Reader interprets this name, and for apparently any name starting with Identity- assumes the text encoding for the font to be UCS-2 (essentially UTF-16). As this indeed is the case for the character codes used in the document, copy&paste works, even if for the wrong reasons. (Without this ToUnicode value, Adobe Reader also returns nonsense.)
iText 7, on the other hand, for mapping to Unicode first follows the Encoding value with unexpected results.
Thus, in this case Adobe Reader arrives at a better result by interpreting meaning into an invalid piece of data (and without that also returns nonsense).
file2_copyPasteFails.pdf
The font definitions here have valid but incomplete ToUnicode maps which only contain entries for the used Western European characters but not for Hebrew ones. They don't have Encoding entries.
Both Adobe Reader and iText 7 here trust the ToUnicode map and, therefore, cannot map the Hebrew glyphs.
How to parse
file1_copyPasteWorks.pdf
In case of this file the "problem" is that iText 7 applies the Encoding map. Thus, for decoding the text one can temporarily replace the Encoding map with an identity map:
for (int i = 1; i <= pdfDocument.GetNumberOfPages(); i++)
{
PdfPage page = pdfDocument.GetPage(i);
PdfDictionary fontResources = page.GetResources().GetResource(PdfName.Font);
foreach (PdfObject font in fontResources.Values(true))
{
if (font is PdfDictionary fontDict)
fontDict.Put(PdfName.Encoding, PdfName.IdentityH);
}
string output = PdfTextExtractor.GetTextFromPage(page);
// ... process output ...
}
This code shows the Hebrew characters for your file 1.
file2_copyPasteFails.pdf
Here I don't have a quick work-around. You may want to analyze multiple PDFs of that kind. If they all encode the Hebrew characters the same way, you can create your own ToUnicode map from that and inject it into the fonts like above.

Related

In GeoDMS, how can I transform string coordinates to dpoint?

I have problems converting coordinates in string format to dpoint format in GeoDMS GUI version 7.177.
I'm trying to read the BAG (basisadministratie gemeenten, Dutch municipality administration, a giant geo file) into GeoDMS directly from the Kadaster. It's first been converted from .xml into .csv, then the shapes of the buildings have been transformed in a format seemingly the same as the Vesta format, e.g.:
{5:{249943.307,593511.272}{249948.555,593512.791}{249946.234,593520.809}{249940.987,593519.29}{249943.307,593511.272}}
I am able to read the transformed CSV file into GeoDMS, then also able to write it as strings to .dmsdata format for speed and load it from there into GeoDMS again. However, when wanting to transform the strings into coordinates, I get the error
DPoint Error: Cannot find operator for these arguments:
arg1 of type DataItem<String>
Possible cause: argument type mismatch. Check the types of the used arguments.
My GeoDMS code looks like
unit<uint32> altBag:
storageName = 'c:/zandbak/output/bagPND.fss'
, storageReadOnly = 'true'
, dialogType = 'map'
, dialogData = 'geometry'
{
attribute <string> pandGeometrie; // works and looks good
attribute <dpoint> geometry := dpoint(pandGeometrie); // doesn't work, error above
attribute <rdc> geometry2 := pandGeometrie[rdc]; // doesn't work either
}
Is there a way to do this? Or is string to dpoint (or another type of point) unsupported and should I transform the CSV to shape file first?

you can try this:
attribute<dpoint> Geometry(poly) := dpolygon(GeometryStr);
and if a specific projection is required:
attribute<rdc_meter> Geometry2(poly) := value(GeometryStr, rdc_meter);

ReportLab - Metadata, CreationDate and ModificationDate

How can I change metadata fields, CreationDate and ModificationDate, when I create a pdf with Reportlab?

Take a look at where modification and creation dates are set:
D['ModDate'] = D["CreationDate"] = \
Date(ts=document._timeStamp,dateFormatter=self._dateFormatter)
# ...
return PDFDictionary(D).format(document)
Basically, metadata is a dictionary saved at the end of binary string, start of string is file contents (document).
Inside Reportlab the workflow you ask about can be:
create canvas
draw something on it
get document from canvas
create PDFDictionary with artificial mod and create dates
format document with PDFDictionary
save to file
Change metadata of pdf file with pypdf also attempts similar goal.

The ReportLab (currently 3.5) Canvas provides public methods, like Canvas.setAuthor(), to set the /Author, /Title, and other metadata fields (called "Internal File Annotations" in the docs, section 4.5).
However, there is no method for overriding the /CreationDate or /ModDate.
If you only need to change the formatting of the dates, you can simply use the Canvas.setDateFormatter() method.
The methods described above modify a PDFInfo object, as can be seen in the source, but this is part of a private PDFDocument (as in Canvas._doc.info).
If you really do need to override the dates, you could either hack into the private parts of the canvas, or just search the content of the resulting file object for /CreationDate (...) and /ModDate (...), and replace the value between brackets.
Here's a quick-and-dirty example that does just that:
import io
import re
from reportlab.pdfgen import canvas
# write a pdf in a file-like object
file_like_obj = io.BytesIO()
p = canvas.Canvas(file_like_obj)
# set some metadata
p.setAuthor('djvg')
# ... add some content here ...
p.save()
# replace the /CreationDate (similar for /ModDate )
pdf_bytes = file_like_obj.getvalue()
pdf_bytes = re.sub(b'/CreationDate (\w*)', b'/CreationDate (D:19700101010203+01)', pdf_bytes)
# write to actual file
with open('test.pdf', 'wb') as pdf:
pdf.write(pdf_bytes)
The example above just illustrates the principle. Obviously one could use fancy regular expressions with lookaround etc.
From the pdf spec:
Date values used in a PDF shall conform to a standard date format, which closely follows that of the international standard ASN.1 (Abstract Syntax Notation One), defined in ISO/IEC 8824. A date shall be a text string of the form
( D : YYYYMMDDHHmmSSOHH' mm )

Extract Arabic Text using iTextsharp get number only?

I try To extract Arabic Text from PDF file but it extract only number and the result like this :
: 7234569 1439/08/07 : : 1 2375173941 14 08 6 39266 1050672243 2280 30 400 24 415 24 15 720 30 402 30 499 14 07 1 610117038085 0 1069508677 0 :
My code :
public static string GetTextFromAllPages(string pdfPath) {
PdfReader reader = new PdfReader(pdfPath);
string result = null ;
//for (int i = 1; i <= reader.NumberOfPages; i++)
result = PdfTextExtractor.GetTextFromPage(reader, 1, new LocationTextExtractionStrategy()); return result;
}
Any help Please?

The embedded font for Arabic glyphs in your PDF contains this ToUnicode CMap:
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
endcmap
CMapName currentdict /CMap defineresource pop
end
end
According to ISO 32000-1, section 9.10.3 ToUnicode CMaps:
It shall use the beginbfchar, endbfchar, beginbfrange, and endbfrange operators to define the mapping from character codes to Unicode character sequences expressed in UTF-16BE encoding.
Unfortunately your CMap does not use these operators at all and, therefore, does not define any mappings to Unicode.
Furthermore the font has an Encoding of Identity-H and its descendant CIDFont has a ROS Adobe-Identity-0 which means that character code, CID, and GID values are equal for a character but doesn't imply any mapping of them to Unicode.
Thus, the font is missing the information required for text extraction according to ISO 32000-1 section 9.10.2 Mapping Character Codes to Unicode Values.
(In such a situation text extractors can only guess, and such guesswork usually only works for a special type of documents the extractor is optimized for. You might want to try to enhance iText to be able to guess correctly in your case but that will require you to study the PDF specification, the iText text extraction code, and your sample files in detail.)
By the way, a good first test whether text extraction is feasible is to open the PDF in Adobe Reader and to copy and paste the text in question to an editor or word processor. If this does not work (and in the case at hand it does not work), chances are that the file does have incomplete or misleading information for text extraction (or none at all).

c# uploading data error -> return "�" for space

i am using c# with http helper and using stream reader to read a text. But When i upload a text file containing this text
"Look  exactly what I found on # eBay! Willy Lee LifeLike  Chatting Butler Prop Motion Sen"
the space is replced by "�" and used in the code.
Code for reading the text is:-
List<string> list = new List<string>();
StreamReader reader = new StreamReader(filepath);
string text = "";
while ((text = reader.ReadLine()) != null)
{
if (!string.IsNullOrEmpty(text))
{
list.Add(text);
}
}
reader.Close();
return list;
list contains this data-
"Look��exactly�what�I�found�on�#�eBay!�Willy�Lee�LifeLike��Chatting�Butler�Prop�Motion�Sen"

Looks like encoding problem - I have had such text problems, when a text is multibyte encoded and shown in a non-unicode based webpage like a Windows-1252 or CP-125X or such.
Here looks like the same - text looks UTF-8 encoded and is displayed in ansi mode, so here the spaces are "special" spaces like these M$ Word puts sometimes, and the english characters are single byte as is the UTF-8 format (forr all chars below ASCII code 128) and this means they are compatible with ANSI codetable and visible correctly.
Or option 2 if it written in a file, and this text is saved like that, witout BOM in the beginning, the text editor may not understand that the context is unicode and opens it in ansi /regular ascii mode/.
If you give more details from where the data is read and where is saved and opened, I can give more concrete details.

String replacement in latex

I'd like to know how to replace parts of a string in latex. Specifically I'm given a measurement (like 3pt, 10mm, etc) and I'd like to remove the units of that measurement (so 3pt-->3, 10mm-->10, etc).
The reason why I'd like a command to do this is in the following piece of code:
\newsavebox{\mybox}
\sbox{\mybox}{Hello World!}
\newlength{\myboxw}
\newlength{\myboxh}
\settowidth{\myboxw}{\usebox{\mybox}}
\settoheight{\myboxh}{\usebox{\mybox}}
\begin{picture}(\myboxw,\myboxh)
\end{picture}
Basically I create a savebox called mybox. I insert the words "Hello World" into mybox. I create a new length/width, called myboxw/h. I then get the width/height of mybox, and store this in myboxw/h. Then I set up a picture environment whose dimensions correspond to myboxw/h. The trouble is that myboxw is returning something of the form "132.56pt", while the input to the picture environment has to be dimensionless: "\begin{picture}{132.56, 132.56}".
So, I need a command which will strip the units of measurement from a string.
Thanks.

Use the following trick:
{
\catcode`p=12 \catcode`t=12
\gdef\removedim#1pt{#1}
}
Then write:
\edef\myboxwnopt{\expandafter\removedim\the\myboxw}
\edef\myboxhnopt{\expandafter\removedim\the\myboxh}
\begin{picture}(\myboxwnopt,\myboxhnopt)
\end{picture}

Consider the xstring package at https://www.ctan.org/pkg/xstring.

The LaTeX kernel - latex.ltx - already provides \strip#pt, which you can use to strip away any reference to a length. Additionally, there's no need to create a length for the width and/or height of a box; \wd<box> returns the width, while \ht<box> returns the height:
\documentclass{article}
\makeatletter
\let\stripdim\strip#pt % User interface for \strip#pt
\makeatother
\begin{document}
\newsavebox{\mybox}
\savebox{\mybox}{Hello World!}
\begin{picture}(\stripdim\wd\mybox,\stripdim\ht\mybox)
\put(0,0){Hello world}
\end{picture}
\end{document}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Fails to parse Hebrew text from pdf using iText 7 with .net - asp.net-core-2.0

Related

In GeoDMS, how can I transform string coordinates to dpoint?

ReportLab - Metadata, CreationDate and ModificationDate

Extract Arabic Text using iTextsharp get number only?

c# uploading data error -> return "�" for space

String replacement in latex

Categories

Resources