I got two different "versions" of Arabic letters on Wikipedia. The first example seems to be 3 sub-components in one:
"ـمـ".split('').map(x => x.codePointAt(0).toString(16))
[ '640', '645', '640' ]
Finding this "m medial" letter on this page gives me this:
ﻤ
fee4
The code points 640 and 645 are the "Arabic tatwheel" ـ and "Arabic letter meem" م. What the heck? How does this work? I don't see anywhere in the information so far on Unicode Arabic how these glyphs are "composed". Why is it composed from these parts? Is there a pattern for the structure of all glyphs? (All the glyphs on the first Wikipedia page are similar, but the second they are one code point). Where do I find information on how to parse out the characters effectively in Arabic (or any other language for that matter)?
Arabic is a script with cursive joining; the shape of the letters changes depending on whether they occur initially, medially, or finally within a word. Sometimes you may want to display these contextual forms in isolation, for example to simply show what they look like.
The recommended way to go about this is by using special join-causing characters for the letters to connect to. One of these is the tatweel (also called kashida), which is essentially a short line segment with “glue” at each end. So if you surround the letter م with a tatweel character on both sides, the text renderer automatically selects its medial form as if it occured in the middle of a word (ـمـ). The underlying character code of the م doesn’t change, only its visible glyph.
However, for historical reasons Unicode also contains a large set of so-called presentation forms for Arabic. These represent those same contextual letter shapes, but as separate character codes that do not change depending on their surroundings; putting the “isolated” presentation form of م between two tatweels does not affect its appearance, for instance: ـﻡـ
It is not recommended to use these presentation forms for actually writing Arabic. They exist solely for compatibility with old legacy encodings and aren’t needed for correctly typesetting Arabic text. Wikipedia just used them for demonstration purposes and to show off that they exist, I presume. If you encounter presentation forms, you can usually apply Unicode normalisation (NFKD or NFKC) to the string to get the underlying base letters. See the Unicode FAQ on presentation forms for more information.
Related
I am working on an application that converts text into some other characters of the extended ASCII character set that gets displayed in custom font.
The program operation essentially parses the input string using regex, locates the standard characters and outputs them as converted before returning a string with the modified text which displays correctly when viewed with the correct font.
Every now and again, the function returns a string where the characters are displayed in the wrong order, almost like they are corrupted or some data is missing from the Unicode double width spacing. I have examined the binary output, the hex data, and inspected the data in the function before i return it and everything looks ok, but every once in a while something goes wrong and cant quite put my finger on it.
To see an example of what i mean when i say the order is weird, just take a look at the following piece of converted text output from the program and try to highlight it with your mouse. You will see that it doesn't highlight in the order you expect despite how it appears.
Has anyone seen anything like this before and have they any ideas as to what is going on?
ך┼♫יἯ╡П♪דἰ
You are mixing various Unicode characters with different LTR/RTL characteristics.
LTR means "left-to-right" and is the direction that English (and many other western language) text is written.
RTL is "right-to-left" and is used mostly by Arabic and Hebrew (as well as several other scripts).
By default when rendering Unicode text the engine will try to use the directionality of the characters to figure out what direction a given part of the code should go. And normally that works just fine because Hebrew words will have only Hebrew letters and English words will only use letters from the Latin alphabet, so for each chunk there's a easily guessable direction that makes sense.
But you are mixing letters from different scripts and with different directionality.
For example ך is U+05DA HEBREW LETTER FINAL KAF, but you also use two other Hebrew characters. You can use something like this page to list the Unicode characters you used.
You can either
not use "wrong" directionality letters or
make the direction explict using a Left-to-right mark character.
Edit: Last but not least: I just realized that you said "custom font": if you expect displaying with a specific custom font, then you should really be using one of the private use areas in Unicode: they are explicitly reserved for private use like this (i.e. where the characters don't match the publicly defined glyphs for the codepoints). That would also avoid surprises like the ones you get, where some of the used characters have different rendering properties.
Let me start with acknowledging that this is a rather broad question, but I need to start somewhere and reduce the design space a bit.
The problem
Grammarly is an online app that provides grammar and spell-checking as a browser plugin. Currently, there neither exists support for text editors nor latex sources. Grammarly is apparently often confused when forced to deal with annotated text or text that is formatted (e.g. contains wrapped lines). I guess many people could use that tool when writing up scientific papers or pretty much any other LaTeX tool. I also presume that other solutions exist or will soon pop up that work similarly.
The solution
In principle, it is not necessary to support Grammarly directly in, e.g., emacs. It suffices to provide a convenient interface to check multiple source files at once. To that end, a simple web app could walk through a directory, read all .tex sources, remove all formatting and markup, and expose the files as an HTML document. The user could open that document, run Grammarly, and apply any fixes. The app would have to take the corrected text and reapply formatting, markdown, etc. to save the now fixed source file.
The question(s)
While it is reasonably simple to create such a web application, there are other requirements to be considered: LaTeX parsing (up to "standard" syntax) and a library like HaTeX could deal with parsing and interpretation. But the process of editing needs some thought. Presuming that the removal of formatting can be implemented by only deleting content, it should be possible to take a correction as a diff and reapply it to the formatted document.
In Haskell, is there a data structure for text editing that supports this use case. That is, a representation of text that can store deletions, find diffs, undo deletions, and move a diff accordingly? If not in Haskell, does something like this exist somewhere else?
Bonus question 2: What is the simplest (as in loc required) web framework in Haskell to set up such a web app? It would serve one HTML document and accept updated versions of the text files. No database is required.
Instead of removing and then adding the text formatting, you could parse the souce text into a stream of annotated tokens:
data AnnotatedChar = AC
{ char :: Char
, formatting :: String
}
The following source:
Is \emph{good}.
would be parsed as:
[AC 'I' "", AC 's' "", AC ' ' "\emph{", AC 'g' "", ...
Then, extract only the chars from this list, send them to Grammarly, and get back the result. Now, diff the list of annotated characters with the list of characters you got from Grammarly. This way, you only to deal with a list of characters, but keep the annotations.
I'm curious how does PDF securing work? I can lock PDF file so system can't recognize text and manipulate with PDF file. Everything I found was about "how to lock/unlock" however nothing about "how does it work". Is there anyone who could explain it to me? Thx
The OP clarified in a comment
I mean lock on text recognition or manipulation with PDF file. There should be nothing about cryptography imho just some trick.
There are some options, among them:
You can render the text as a bitmap and include that bitmap in the PDF
-> no text information.
Or you can embed the font in question using a non-standard encoding without using standard glyph names
-> text information in an unknown encoding.
E.g. cf. the PDF analysed in this answer.
A special case: make the encoding wrong only for a few characters, maybe just one, probably a digit. This way an unalert person thinks everything was extracted ok, and only when the data is to be used, the errors start screwing things up, something which especially in case of wrong digits is hard to fix. E.g. cf. the PDF analysed in this answer.
Or you can put text in structures where text extraction software or copy&paste routines usually don't look, like creating a large pattern tile containing the text for some text area and filling the area with the matching pattern color.
-> text information present but not seen by most extractors.
E.g. cf. this answer; the technique here is used to make the text of a watermark non-extractable.
Or you can put extra text all over the page but make it invisible, e.g. under images, drawn in rendering mode 3 (invisible), located in some disabled optional content group (layer), ... Text extractors often do not check whether the text they extract actually is visible.
-> text information present but polluted by garbage text bits.
...
I am currently coding in python and managed to use pdftotext in order to extract the text from a pdf.
That particular text file is split up in a list of strings. By using regular expression I am able to find specific words I am interested in. The reason why I divide the text into a list is that I want to measure the distance between two specific words and by distance I mean the number of words in between the two words.
However after finding the position of the words I would like to be able to refer back to the initial pdf. In detail, I am interested in the page and maybe even line (if pdf supports this kind of structure) where these words are located.
One idea I have is to do this process for each page of the pdf, so when I find these words I know on what page this was. But this has the big disadvantage that sometimes page breaks are not necessarily natural. Meaning, I would lose the ability to find the words if they are unfortunately separated by a page break.
Do you have any idea how to do this in a more sophisticated manner?
You'll need a more sophisticated library than the one you're using. The Datalogics PDF Java Toolkit has several classes that can extract text from a PDF file. The one you use depends on what you want to do with the text after extraction. The ReadingOrderTextExtractor will create a list of lists that will allow you to extract the text and examine the content of paragraphs, sentences within those paragraphs, and words within that sentence. You'll not only be able to tell the distance between the words but whether they are in the same sentence or paragraph. One you've found a Word object, you can then find both it's location on the page, allowing for highlighting, and the page number it's on.
What kind of sign is "" and what is it used for (note there is a invisible sign there)?
I have searched through all my documents and found a lot of them. They messed upp my htaccess file. I think I got them when I copied webadresses from google to redirect. So maybe a warning searching through your documents for this one also :)
It is U+200E LEFT-TO-RIGHT MARK. (A quick way to check out such things is to copy a string containing the character and paste it in the writing area in my Full Unicode input utility, then click on the “Show U+” button there, and use Fileformat.Info character search to check out the name and other properties of the character, on the basis of its U+... number.)
The LEFT-TO-RIGHT MARK sets the writing direction of directionally neutral characters. It does not affect e.g. English or Arabic words, but it may mess up text that contains parentheses for example – though for text in English, there should be no confusion in this sense.
But, of course, when text is processed programmatically, as when a web server processes a .htaccess file, they are character data and make a big difference.