I try To extract Arabic Text from PDF file but it extract only number and the result like this :
: 7234569 1439/08/07 : : 1 2375173941 14 08 6 39266 1050672243 2280 30 400 24 415 24 15 720 30 402 30 499 14 07 1 610117038085 0 1069508677 0 :
My code :
public static string GetTextFromAllPages(string pdfPath) {
PdfReader reader = new PdfReader(pdfPath);
string result = null ;
//for (int i = 1; i <= reader.NumberOfPages; i++)
result = PdfTextExtractor.GetTextFromPage(reader, 1, new LocationTextExtractionStrategy()); return result;
}
Any help Please?
The embedded font for Arabic glyphs in your PDF contains this ToUnicode CMap:
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
endcmap
CMapName currentdict /CMap defineresource pop
end
end
According to ISO 32000-1, section 9.10.3 ToUnicode CMaps:
It shall use the beginbfchar, endbfchar, beginbfrange, and endbfrange operators to define the mapping from character codes to Unicode character sequences expressed in UTF-16BE encoding.
Unfortunately your CMap does not use these operators at all and, therefore, does not define any mappings to Unicode.
Furthermore the font has an Encoding of Identity-H and its descendant CIDFont has a ROS Adobe-Identity-0 which means that character code, CID, and GID values are equal for a character but doesn't imply any mapping of them to Unicode.
Thus, the font is missing the information required for text extraction according to ISO 32000-1 section 9.10.2 Mapping Character Codes to Unicode Values.
(In such a situation text extractors can only guess, and such guesswork usually only works for a special type of documents the extractor is optimized for. You might want to try to enhance iText to be able to guess correctly in your case but that will require you to study the PDF specification, the iText text extraction code, and your sample files in detail.)
By the way, a good first test whether text extraction is feasible is to open the PDF in Adobe Reader and to copy and paste the text in question to an editor or word processor. If this does not work (and in the case at hand it does not work), chances are that the file does have incomplete or misleading information for text extraction (or none at all).
Related
I am trying to read a PDF file with several pages, using iText 7 on a .NET CORE 2.1
The following is my code:
Rectangle rect = new Rectangle(0, 0, 1100, 1100);
LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
inputStr = PdfTextExtractor.GetTextFromPage(pdfDocument.GetPage(i), strategy);
inputStr gets the following string:
"\u0011\v\u000e\u0012\u0011\v\f)(*).=*%'\f*).5?5.5*.\a \u0011\u0002\u001b\u0001!\u0016\u0012\u001a!\u0001\u0015\u001a \u0014\n\u0015\u0017\u0001(\u001b)\u0001)\u0016\u001c*\u0012\u0001\u001d\u001a \u0016* \u0015\u0001\u0017\u0016\u001b\u001a(\n,\u0002>&\u00...
and in the Text Visualizer, it looks like that:
)(*).=*%'*).5?5.5*. !!
())* * (
,>&2*06) 2.-=9 )=&,
2..*0.5<.?
.110
)<1,3
2.3*1>?)10/6
(& >(*,1=0>>*1?
2.63)&*,..*0.5
206)&13'?*9*<
*-5=0>
?*&..,?)..*0.5
it looks like I am unable to resolve the encoding or there is a specific, custom encoding at the PDF level I cannot read/parse.
Looking at the Document Properties, under Fonts it says the following:
Any ideas how can I parse the document correctly?
Thank you
Yaniv
Analysis of the shared files
file1_copyPasteWorks.pdf
The font definitions here have an invalid ToUnicode entry:
/ToUnicode/Identity-H
The ToUnicode value is specified as
A stream containing a CMap file that maps character codes to Unicode values
(ISO 32000-2, Table 119 — Entries in a Type 0 font dictionary)
Identity-H is a name, not a stream.
Nonetheless, Adobe Reader interprets this name, and for apparently any name starting with Identity- assumes the text encoding for the font to be UCS-2 (essentially UTF-16). As this indeed is the case for the character codes used in the document, copy&paste works, even if for the wrong reasons. (Without this ToUnicode value, Adobe Reader also returns nonsense.)
iText 7, on the other hand, for mapping to Unicode first follows the Encoding value with unexpected results.
Thus, in this case Adobe Reader arrives at a better result by interpreting meaning into an invalid piece of data (and without that also returns nonsense).
file2_copyPasteFails.pdf
The font definitions here have valid but incomplete ToUnicode maps which only contain entries for the used Western European characters but not for Hebrew ones. They don't have Encoding entries.
Both Adobe Reader and iText 7 here trust the ToUnicode map and, therefore, cannot map the Hebrew glyphs.
How to parse
file1_copyPasteWorks.pdf
In case of this file the "problem" is that iText 7 applies the Encoding map. Thus, for decoding the text one can temporarily replace the Encoding map with an identity map:
for (int i = 1; i <= pdfDocument.GetNumberOfPages(); i++)
{
PdfPage page = pdfDocument.GetPage(i);
PdfDictionary fontResources = page.GetResources().GetResource(PdfName.Font);
foreach (PdfObject font in fontResources.Values(true))
{
if (font is PdfDictionary fontDict)
fontDict.Put(PdfName.Encoding, PdfName.IdentityH);
}
string output = PdfTextExtractor.GetTextFromPage(page);
// ... process output ...
}
This code shows the Hebrew characters for your file 1.
file2_copyPasteFails.pdf
Here I don't have a quick work-around. You may want to analyze multiple PDFs of that kind. If they all encode the Hebrew characters the same way, you can create your own ToUnicode map from that and inject it into the fonts like above.
I have seen some questions (like this one) here asking about if a cell in Excel can be formatted by NPOI/POI as if formatted by Excel. As most of you, I have to deal with issues with Currency and DateTime. Here let me ask how the formatting can be achieved as if it has been formatted by Excel? (I will answer this question myself as to demonstrate how to do it.)
Setting: Windows 10, English, Region: Taiwan
Excel format: XLSX (version 2007 and later)
(Sorry about various edit of this question as I have pressed the 'Enter' button at unexpected time.)
If you format a cell as Currency, you have 4 choices:
The internal format of each style is as follow:
-NT$1,234.10
<numFmt formatCode=""NT$"#,##0.00" numFmtId="164"/>
[RED]NT$1,234.10
<numFmt formatCode=""NT$"#,##0.00;[Red]"NT$"#,##0.00" numFmtId="164"/>
-NT$1,234.10
<numFmt formatCode=""NT$"#,##0.00_);("NT$"#,##0.00)" numFmtId="7"/>
[RED]-NT$1,234.10
<numFmt formatCode=""NT$"#,##0.00_);[Red]("NT$"#,##0.00)" numFmtId="8"/>
Note: There is a pair of double quote (") comes before and after NT$.
(To get internal format of XLSX, just unzip it. The Style information is available in <unzip dir>\xl\Styles.xml Check out this answer if you need more information.)
(FYI: In formatCode, the '0' represent a digit. The '#' also represent a digit, but will not appear if the number is not large enough. So any number less than 1000 will not have the comma inside it. The '_' is a space holder. In format 3, '1.75' appears as 'NT$1.75 '. The last one is a space.)
(FYI: In numFmtId, for case 1 and case 2, number 164 is for user-defined. For case 3 and 4, number 7 and 8 are build-in style.)
For developers using POI/NPOI, you may find out if you format your currency column using Build In Format using 0x7 or 0x8, you can get only the third or fourth choice. You cannot get the first or second choice.
To get the first choice, you build upon style 0x7 "$#,##0.00);($#,##0.00)". You need to add the currency symbol and the pair of double quotes in front of it.
styleCurrency.DataFormat = workbook.CreateDataFormat().GetFormat("\"NT$\"#,##0.00");
Apply this format to a cell with number. Once you open the Excel result file, right click to check formatting, you will see the first choice.
Please feel free to comment on this post.
var cell5 = row.CreateCell(5, CellType.Numeric);
cell5.SetCellValue(item.OrderTotal);
var styleCurrency = workbook.CreateCellStyle();
styleCurrency.DataFormat= workbook.CreateDataFormat().GetFormat(string.Format("\"{0}\"#,##0.00", item.CurrencySymbol));//styleCurrency;
cell5.CellStyle = styleCurrency;
styleCurrency = null;
Iterate over loop for multiple currency.
Function to GetCurrencySymbol against currency Code on C#
private string GetCurencySymbol(string isOcurrencyCode)
{
return CultureInfo.GetCultures(CultureTypes.AllCultures).Where(c => !c.IsNeutralCulture)
.Select(culture =>
{
try
{
return new RegionInfo(culture.LCID);
}
catch
{
return null;
}
})
.Where(ri => ri != null && ri.ISOCurrencySymbol == isOcurrencyCode)
.Select(ri => ri.CurrencySymbol).FirstOrDefault();}
Be prepared, this is one of those hard questions.
In Farsi or Persian language ی which sounds like y or i and is written in 4 different shapes according to it's place in word. I'll call ی as YA from now for simplification.
take a look at this image
All YA characters are painted in red, in the first word YA is attached to it's previous (right , in Farsi we right from RIGHT to LEFT) character and is free at the end whereas the last YA (3rd word, left-most red char) is free both from left or right.
Having said this long story, I want to find out if a part of a string ends with long YA (YA without points) or short YA (YA with two points beneath it).
i.e تحصیلداری (the 3rd word) ends with long YA but تحصیـ which is a part of 3rd word does not ends with short YA.
Question: How can I say تحصیلداری ends whit which unicode? I just have a simple string, "تحصیلداری", how can I convert its characters to unicode?
I tried the unicodes
string unicodes = "";
foreach (char c in "تحصیلداری")
{
unicodes += c+" "+((int)c).ToString() + Environment.NewLine;
}
MessageBox.Show(unicodes);
result :
but at the end of the day unfortunately all YAs have the same unicode.
Bad news : YA was an example, a real one though. There are also a dozen of other characters like YA with different appearances too.
Additional info :
using this useful link about unicodes I found unicode of different YAs
We solved similar problem the way bellow:
We had a core banking application, the customer sub-system needed a full text search on customers name, family, father name etc.
Different encoding, legacy migrated data, keyboard layouts and Farsi fonts ... made search process inaccurate.
We overcame the problem by replacing problematic characters with some standard one and saving the standard string for search purpose.
After several iterations, the replacement is as bellow that may come in handy:
Formula="UPPER(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(FirsName || LastName || FatherName,
chr(32),''),
chr(13),''),
chr(9),''),
chr(10),''),
'-',''),
'-',''),
'آ','ا'),
'أ', 'ا'),
'ئ', 'ي'),
'ي', 'ي'),
'ك', 'ک'),
'آإئؤةي','اايوهي'),
'ء',''),
'شأل','شاال'),
'ا.','اله'),
'.',''),
'الله','اله'),
'ؤ','و'),
'إ','ا'),
'ة','ه'),
' ا لله','اله'),
'ا لله','اله'),
' ا لله','اله'))"
Despite there are different YEHs in Unicode, it must noticed that all presentation forms of YEHs are same Unicode character with code 0x06cc. You can not determine presentation forms by their Unicode code.
But you can reach your goal be checking to see what characters is before or after YEH.
You can also use Fardis to see Unicode codes of strings.
i am using c# with http helper and using stream reader to read a text. But When i upload a text file containing this text
"Look exactly what I found on # eBay! Willy Lee LifeLike Chatting Butler Prop Motion Sen"
the space is replced by "�" and used in the code.
Code for reading the text is:-
List<string> list = new List<string>();
StreamReader reader = new StreamReader(filepath);
string text = "";
while ((text = reader.ReadLine()) != null)
{
if (!string.IsNullOrEmpty(text))
{
list.Add(text);
}
}
reader.Close();
return list;
list contains this data-
"Look��exactly�what�I�found�on�#�eBay!�Willy�Lee�LifeLike��Chatting�Butler�Prop�Motion�Sen"
Looks like encoding problem - I have had such text problems, when a text is multibyte encoded and shown in a non-unicode based webpage like a Windows-1252 or CP-125X or such.
Here looks like the same - text looks UTF-8 encoded and is displayed in ansi mode, so here the spaces are "special" spaces like these M$ Word puts sometimes, and the english characters are single byte as is the UTF-8 format (forr all chars below ASCII code 128) and this means they are compatible with ANSI codetable and visible correctly.
Or option 2 if it written in a file, and this text is saved like that, witout BOM in the beginning, the text editor may not understand that the context is unicode and opens it in ansi /regular ascii mode/.
If you give more details from where the data is read and where is saved and opened, I can give more concrete details.
As I asked in my previous question(Link) about concatenating a multipart string of variable lengths, I used the method answered there by rkhayrov and now, my function looks like this:
local sToReturn = string.format( "\t%03s\t%-25s\t%-7s\n\t", "S. No.", "UserName", "Score" )
SQLQuery = assert( Conn:execute( string.format( [[SELECT username, totalcount FROM chatstat ORDER BY totalcount DESC LIMIT %d]], iLimit ) ) )
DataArray = SQLQuery:fetch ({}, "a")
i = 1
while DataArray do
sTemp = string.format( "%03s\t%025s\t%-7d", tostring(i), DataArray.username, DataArray.totalcount )
sToReturn = sToReturn..sTemp.."\n\t"
DataArray = SQLQuery:fetch ({}, "a")
i = i + 1
end
But, even now, the value of score is still not following the order as required. The max length of username is 25. I've used %025s inside the while loop because I want the usernames to be right-justified, while the %-25s is to make the word UserName centre justified.
EDIT
Current output:
Required Output:
Displaying the list of top 5 chit-chatters.
S. No. UserName Score
1 Keeda 9440
2 _2.2_™ 7675
3 aim 7057
4 KGBRULES 6770
5 Guddu 6322
I think it's because of difference in fonts, but since most of the clients have Windows 7 default fonts(Tahoma/Verdana at 11px), I need optimum result for at-least that.
I think it's because of difference in fonts
It is. string.format formats by inserting whitespace. That only works for a fixed width fonts (i.e. all characters have the same width, including whitespace).
since most of the clients have Windows 7 default fonts(Tahoma/Verdana at 11px)
In what? How are they viewing your output? Do you write it to a textfile, that they then open in the editor of their choice (likely Notepad)? Then this approach will simply not work.
Don't know enough about your output requirements to steer you any futher, but it's worth noting that everyone has a browser so HTML output is very portable.
string.format doesn't truncate - the width of the field is minimum, not maximum. You'll have to truncate the strings to 25 characters yourself with something like DataArray.username:sub(0,25).
I'd remove the tabs from the string.format; and use the justification provided by %25s only. Won't be perfect but will probably be closer.
Use a fixed-width font if you can.