Data structure for a grammar checker for LaTeX sources

Data structure for a grammar checker for LaTeX sources - haskell

Let me start with acknowledging that this is a rather broad question, but I need to start somewhere and reduce the design space a bit.
The problem
Grammarly is an online app that provides grammar and spell-checking as a browser plugin. Currently, there neither exists support for text editors nor latex sources. Grammarly is apparently often confused when forced to deal with annotated text or text that is formatted (e.g. contains wrapped lines). I guess many people could use that tool when writing up scientific papers or pretty much any other LaTeX tool. I also presume that other solutions exist or will soon pop up that work similarly.
The solution
In principle, it is not necessary to support Grammarly directly in, e.g., emacs. It suffices to provide a convenient interface to check multiple source files at once. To that end, a simple web app could walk through a directory, read all .tex sources, remove all formatting and markup, and expose the files as an HTML document. The user could open that document, run Grammarly, and apply any fixes. The app would have to take the corrected text and reapply formatting, markdown, etc. to save the now fixed source file.
The question(s)
While it is reasonably simple to create such a web application, there are other requirements to be considered: LaTeX parsing (up to "standard" syntax) and a library like HaTeX could deal with parsing and interpretation. But the process of editing needs some thought. Presuming that the removal of formatting can be implemented by only deleting content, it should be possible to take a correction as a diff and reapply it to the formatted document.
In Haskell, is there a data structure for text editing that supports this use case. That is, a representation of text that can store deletions, find diffs, undo deletions, and move a diff accordingly? If not in Haskell, does something like this exist somewhere else?
Bonus question 2: What is the simplest (as in loc required) web framework in Haskell to set up such a web app? It would serve one HTML document and accept updated versions of the text files. No database is required.

Instead of removing and then adding the text formatting, you could parse the souce text into a stream of annotated tokens:
data AnnotatedChar = AC
{ char :: Char
, formatting :: String
}
The following source:
Is \emph{good}.
would be parsed as:
[AC 'I' "", AC 's' "", AC ' ' "\emph{", AC 'g' "", ...
Then, extract only the chars from this list, send them to Grammarly, and get back the result. Now, diff the list of annotated characters with the list of characters you got from Grammarly. This way, you only to deal with a list of characters, but keep the annotations.

Related

How to design plain tex data entry helpers such as autocompletion and syntax highlighting

There are some files like GEDCOM and ADIF that are plain text files, but many people tend to work with them through GUIs. Say I wanted to do data entry on these files directly without any GUI.There are a number of things that make this a little dangerous. Things like misspellings of necessary file-grammar, missing a necessary key, incorrect types for values, etc. There is also something to be said for the additional difficulty of having to type additional characters relative to a GUI.
From what I can tell by thinking about this for 15mins ;) is that having the following would make the job of plain text entry much easier.
A formatter. I think of something like Python's Black which is a CLI that can be run on a file. It can let users know of bad formatting and can provide fixes.
A linter. I think of flake8 to ensure the styling matches the standard.
Autocomplete. The file type examples I showed above have a dictionary of key words. To save on typing it would be nice to have autocomplete.
Syntax Highlighting. Having a way to know if my data entry is good or bad in real-time would be helpful.
It seems like requirements 1-2 could be solved by making a file specific CLIs that combs through plain text files.
Requirement 4 seems IDE specific. vim and vscode allow users to make syntax highlighting plugins. The problem is that this is normally solved by connecting to a language server. When you are not looking for a language server, but for key words and proper values in a plain text file how does let their IDE know that to look for? Is this just a regex soup solution or is there a better way?
Requirement 3 may also be IDE specific, but the same question applies as for requirement 4. When there is not a language server how can I let an IDE know what/how to autocomplete?
Any examples of plain text data entry made easier would be appreciated.
Thanks!

How can I edit a DXF in node.js?

I'd like to make a custom lasered label from a user's input on a website. I have a template dxf file and I'd like to replace placeholder text with the user input. My problem is the dxf file format is very unreadable in its text format. Is there any way to make sense of the numeric data? If not are there any other formats (svg, etc) that would be easier to work with?
EDIT: The reason I've found it unreadable in terms of text is that the program (Solidworks) converted the text to curves.) At this point I'm trying to figure out how to prevent that.

AutoDesk was nice enough to document DXF syntax in great detail. Spend a couple hours understanding the documentation from the link below, and I think you will find it quite easy to parse and edit using code.
To just replace some placeholder text, it should be just as simple as reading the DXF file into a string (a dxf file is no different than a txt file), performing a text replace operation and saving it back to file. Just make sure that your placeholder text is very unique and is not contained in any of the key words in the document below (otherwise your DXF file will get corrupted). Something like "PlaceHolderText" will do the trick.
http://images.autodesk.com/adsk/files/autocad_2012_pdf_dxf-reference_enu.pdf
Edit: More Info
I do a lot of work with AutoDesk Inventor which is in direct competition with SolidWorks, so they are effectively the same tool. We were faced with a similar problem of needing to place text onto sheet metal flat pattern DXFs that came out of Inventor in order to identify the part, but Inventor simply could not do it (see, exactly the same!). One of our developers had the idea to place a very precise geometry punch onto the flat pattern. After the DXF was generated he wrote some code that parsed the DXF file and replaced the geometry with a text entity. More specifically we used a triangle with sides having each length defined to something like the 7th decimal place. You can then use one of the vertices of the triangle to position the text, including rotation. This process would be automatic, so once you write the code with the help of the document above (which won't take the long), it will just work. If your engraver can handle text the way you want it, I'd say this is a very good solution. We generate hundreds of parts every day using this code. Hope this helps.

How does PDF securing work?

I'm curious how does PDF securing work? I can lock PDF file so system can't recognize text and manipulate with PDF file. Everything I found was about "how to lock/unlock" however nothing about "how does it work". Is there anyone who could explain it to me? Thx

The OP clarified in a comment
I mean lock on text recognition or manipulation with PDF file. There should be nothing about cryptography imho just some trick.
There are some options, among them:
You can render the text as a bitmap and include that bitmap in the PDF
-> no text information.
Or you can embed the font in question using a non-standard encoding without using standard glyph names
-> text information in an unknown encoding.
E.g. cf. the PDF analysed in this answer.
A special case: make the encoding wrong only for a few characters, maybe just one, probably a digit. This way an unalert person thinks everything was extracted ok, and only when the data is to be used, the errors start screwing things up, something which especially in case of wrong digits is hard to fix. E.g. cf. the PDF analysed in this answer.
Or you can put text in structures where text extraction software or copy&paste routines usually don't look, like creating a large pattern tile containing the text for some text area and filling the area with the matching pattern color.
-> text information present but not seen by most extractors.
E.g. cf. this answer; the technique here is used to make the text of a watermark non-extractable.
Or you can put extra text all over the page but make it invisible, e.g. under images, drawn in rendering mode 3 (invisible), located in some disabled optional content group (layer), ... Text extractors often do not check whether the text they extract actually is visible.
-> text information present but polluted by garbage text bits.
...

creating tags for a script language for easy browsing in vim

I use ctags+Vim for a lot of my projects and I really like the ability to easily browse through large chunks of code quickly.
I am also using Stata, a statistical package, which has a script language. Even though you can have routines in the code, its code tends to be series of commands that perform data and statistics operations. And the code files can be very long. So I always find myself in need of a way to browse it efficiently.
Since I use Vim, I can use marks. But I was wondering if I could use ctags to do this. That is, I want to create a tag marker which (1) won't cause a problem when I run the script (2) easy to introduce to ctags.
Because it is supposed to not break the script, it needs to be a comment. In Stata, comment lines start with * and flow comments can be made by /* ..... */.
It would be great, for example, have sections in the code, marked by comments:
* Section: Data
And ctags picks up "Data Manipulation" as the tag. So I can see a list of sections and jump to them easily without the needs for creating marks.
Is there anyway to do this? I'd appreciate any comments.

You need a way to generate a tags database for your Stata files. The format is simple, see :help tags-file-format. The default tags program, Exuberant Ctags can be extended with regular expressions (--langmap, --regex); that probably only yields an approximate parsing for complex languages, but it should suffice for custom section marks; maybe you could even directly extract interesting language keywords.

I want to change the way text is represented internally in ANY Text Editor

I want to use a algorithm to reduce memory used to save the particular text file.I don't really know how text is stored but i have an idea in mind.
Would it be better to extend a open source text editor (if yes than which one) or write a text editor myself.
It would be nice if someone could also give me a link or tutorial to some basics on how text editors work and the way data is stored.
Edited to add
To clarify, what I wanted to do is instead of saving duplicates of a word make a hash table and store the address where it needs to be placed.
That way I wouldn't be storing the duplicates.
This would have become specific to a particular text editor.
Update
thanks everyone I got what all of you'll are trying to say. Anyways all i wanted to do is instead of saving duplicates of a word make a hash table and store the address where it needs to be placed.
This was i wouldn't be storing the duplicates.
Yes and this would have become specific to a particular text editor. never realized that.

I want to use a algorithm to reduce memory used to save the particular text file
If you did this you would no longer have a text editor, but instead you would have created some sort of binary file editor.
The whole point of the text file format is that it is universal, meaning any text file can be open in any other text editor.

Emacs handles compression transparently. Just create a text file with .gz extension. Emacs will automatically compress contents of the file during save operation, and decompress when you open the file next time.

Text is basically stored as-is. i.e., every character takes up a byte or two (wide chars), and there is no conversion done on it when it's saved. It might add an end-of-file character or something though. Don't try coming up with your own algorithm to compress these files. That's why zip-files and other archives were created. They're really good at compressing text. If you wanted to add these feature to your text-editor, you'd have to add some sort of post-save hook to zip it, and then put a hook on the open command to unzip it. Unless you wanted to do it by hand every time. Don't try writing the text editor yourself from scratch, unless (maybe) you're writing notepad. Text editors with syntax highlighting aren't very easy to make, even with the proper libraries. I'd say write a plugin for something like Visual Studio or what have you. Or find an open-source text editor.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string