A way to translate images? - python-3.x

In Python I am attempting to translate Arabic characters within a image. I can provide the language 'source' type (Arabic) and 'destination' (English). Is there a python library or API that is free that I can use for this? I.e that provides a service like https://translate.google.com, that allows for cloud image translation (the uploading of images containing non-translated characters) and downloading of images containing the destination characters translated within the image? Or a library to do this locally within my system (i.e. detect Arabic characters from an image containing Arabic text, extract the Arabic characters, for using cloud translation services (e.g. google translate) and then modify the image containing Arabic characters with the newly translated English characters? So, my goal is to modify/replace the Arabic characters within an image containing Arabic characters with English characters that are the translated characters of the original/extracted Arabic characters. I know Yandex / https://translate.yandex.com/ocr allows for this however you must pay for their translation API. How could I do this?

While I'm not sure if there is support for Arabic, there are libraries like OpenCV2 for python and pytesseract to extract text from image. Then you can use another library like translate to finish the process from there. https://pypi.org/project/translate/

Related

The structure of Arabic letters in Unicode

I got two different "versions" of Arabic letters on Wikipedia. The first example seems to be 3 sub-components in one:
"ـمـ".split('').map(x => x.codePointAt(0).toString(16))
[ '640', '645', '640' ]
Finding this "m medial" letter on this page gives me this:
ﻤ
fee4
The code points 640 and 645 are the "Arabic tatwheel" ـ and "Arabic letter meem" م. What the heck? How does this work? I don't see anywhere in the information so far on Unicode Arabic how these glyphs are "composed". Why is it composed from these parts? Is there a pattern for the structure of all glyphs? (All the glyphs on the first Wikipedia page are similar, but the second they are one code point). Where do I find information on how to parse out the characters effectively in Arabic (or any other language for that matter)?
Arabic is a script with cursive joining; the shape of the letters changes depending on whether they occur initially, medially, or finally within a word. Sometimes you may want to display these contextual forms in isolation, for example to simply show what they look like.
The recommended way to go about this is by using special join-causing characters for the letters to connect to. One of these is the tatweel (also called kashida), which is essentially a short line segment with “glue” at each end. So if you surround the letter م with a tatweel character on both sides, the text renderer automatically selects its medial form as if it occured in the middle of a word (ـمـ). The underlying character code of the م doesn’t change, only its visible glyph.
However, for historical reasons Unicode also contains a large set of so-called presentation forms for Arabic. These represent those same contextual letter shapes, but as separate character codes that do not change depending on their surroundings; putting the “isolated” presentation form of م between two tatweels does not affect its appearance, for instance: ـﻡـ
It is not recommended to use these presentation forms for actually writing Arabic. They exist solely for compatibility with old legacy encodings and aren’t needed for correctly typesetting Arabic text. Wikipedia just used them for demonstration purposes and to show off that they exist, I presume. If you encounter presentation forms, you can usually apply Unicode normalisation (NFKD or NFKC) to the string to get the underlying base letters. See the Unicode FAQ on presentation forms for more information.

How to enable my python code to read from Arabic content in Excel?

I have two related problems. I'm working on Arabic dataset using Excel. I think that Excel somehow reads the contents as ؟؟؟؟؟ , because when I tried to replace this character '؟' with this '?' it replaces the whole text in the sheet. But when I replace or search for another letter it works.
Second, I'm trying to edit the sheet using python, but I'm unable to write Arabic letters (I'm using jGRASP). For example when I write the letter 'ل' it appears as 0644, and when I run the code this message appears : "ُError encoding text. Unable to encode text using charset windows-1252 ".
0644 is the character code of the character in hex. jGRASP displays that when the font does not contain the character. You can use "Settings" > "Font" in jGRASP to choose a CSD font that contains the characters you need. Finding one that has those characters and also works well as a coding font might not be possible, so you may need to switch between two fonts.
jGRASP uses the system character encoding for loading and saving files by default. Windows-1252 is an 8-bit encoding used on English language Windows systems. You can use "File" > "Save As" to save the file with the same name but a different encoding (charset). Once you do that, jGRASP will remember it (per file) and you can load and save normally. Alternately, you can use "Settings" > "CSD Windows Settings" > "Workspace" and change the "Default Charset" setting to make the default something other than the system default.

Azure OCR unable to detect roman character "I" and "II"

I have this image .
I am using Azure Computer Vision API - v2.0,
combination of Recognize Text API(POSt) and Get Recognize Text Operation Result(GET) as mentioned in https://westcentralus.dev.cognitive.microsoft.com/docs/services/5adf991815e1060e6355ad44/operations/587f2c6a154055056008f200. to detect text characters in the image.
Currently it is able to detect all the characters except letter I and II.
Can someone help?

How to identify if text encoding issue is my processing error or carried from the source pdf

I have a selection of pdfs that I want to text mine. I use tika to parse the text out of each pdf and save to a .txt with utf-8 encoding (I'm using windows)
Most of the pdfs were OCR'd before I got them but when I view the extracted text I have "pnÁnn¿¡c" instead of "Phádraig" if I view the PDF.
Is it possible for me to verify the text layer of the PDF (forgive me if thats the incorrect term) Ideally without needing the full version of Acrobat
It sounds like you are dealing with scanned books with "hidden OCR", ie. the PDF shows an image of the original document, behind which there is a layer of OCRed text.
That allows you to use the search function and to copy-paste text out of the document.
When you highlight the text, the hidden characters become visible (though this behaviour maybe depends on the viewer you use).
To be sure, you can copy-paste the highlighted text to a text editor.
This will allow you to tell if you are actually dealing with OCR quality this terrible, or if your extraction process caused mojibake.
Since OCR quality heavily depends on language resources (dictionaries, language model), I wouldn't be surprised if the output was actually that bad for a low-resource language like Gaelic (Old Irish?).

Converting PostScript to Text Using GhostScript

I want to extract Text data out of PostScript documents. The problem is when I use GhostScript to do that, some texts would be extracted normally while others would be converted to weird symbolic characters.
I realized that the texts, which had normally been extracted, were in fonts that GhostScript would NOT embed them in PDF because of licensing restrictions. And, ironically the fonts without licensing restrictions which were normally embedded in PDF, weren’t been converting back correctly.
I tried both txtwrite device to convert directly the PostScript to Text and also pdfwrite device to first convert the PS to PDF and then extract the text out of the PDF Document, but neither of them worked.
I thought maybe I could be able to substitute all fonts with the unsupported fonts, so that the text data would be extracted correctly, but came out there is no simple way to do that.
What do you think I should do?
The cause of this is usually that the characters are encoded in a non-standard fashion. I'm afraid there is not a lot you can do, except possibly for finding out by comparing the readable PostScript with the extracted text which "weird symbolic characters" corresponds to what actual character. Then you might be able to reconstruct the original text by replacing the weird with the intended characters.

Resources