Checking text as you write it

Checking text as you write it - python-3.x

I am writing a function (in Python 3) which will allow someone to revise a set text.
Set texts (in this situation) are pieces of text (I am doing the Iliad, for example) which you need to learn for an exam. In this function I am focusing on a user trying to learn the translation of the text off by heart.
In order to do this, I want to write the translation in a text file, then the user can test themselves by writing it up, and the program will check whether each word is correct by checking it against the known, correct translation.
I know I could simply use input() for this, but it is inadequate as the user would have to type the entire text, or small parts, at a time for this to work, and I want to correct as they type in order for them to remember their mistakes more easily.
I have not written any code yet as how I write the rest of the program will depend on how I code this part.

Related

How to validate text box in gherkin language?

I need to validate "text box" which does not allow special character,Number more than 10000, and letters,
So my question is how to write in gerkin language ?

Gherkin is not a programming language to have validations. You cannot inject variables into it. However, you can perform the validation in the step definition file and tag it to gherkin.
Scenario: I verify if the characters more than 100
Given I see the text box
And I verify, the text box does not contain characters more than ""
The step definition file
arg !< characters.length
arg is the argument you pass inside the double quotes in gherkin.

Your task is to
Find the value in the text box
The way to do this varies from environment to environment, Selenium may be a good way to interact with your system if it is a web application
Save it in some variable
Validate it against some known value in a step
I have written a few blog posts on how to use Cucumber. This blog post from 2015 may be a reasonable start. The cucumber version is a bit outdated. The process of implementing steps is still valid.

Executable specifications (whether they are in Gherkin format or not) are meant to describe the behavior of the business people. I am pretty confident that not a single business person would talk about the how a single text box should behave.
My advice is to see what the actual business value is about and write the scenario from that perspective. Then the actual testing on this particular text box might not be described in the scenario, but it can be part of the underlying steps implementation.
In other words should the text box suddenly allow for numbers up to a million than the business value probably doesn't change. Therefore the scenario should not change, but the test code behind it might.

Using "seed" based math to recreate application instances

Okay so I was thinking today about Minecraft a game which so many of you are so familiar with, I'm sure and while my question isn't directly related to the game I find it much simply to describe my question using the game as an example.
My question is, is there any way a type of "seed" or string of characters can be used to recreate an instance of a program (not in the literal programming sense) by storing a code which when re-entered into this program as a string at run-time, could recreate the data it once held again, in fields, text boxes, canvases, for example, exactly as it was.
As I understand it, Minecraft takes the string of ASCII characters you enter, all which truly are numbers, and performs a series of operations on it which evaluate to some type of hash or number which is finite... this number (again as I understand) is the representation of that string you entered. So it makes sense that because a string when parsed by this algorithm will always evaluate to the same hash. 1 + 1 will always = 2 so a seeds value must always equal that seeds value in the end. And in doing so you have the ability to replicate exactly, worlds, by entering this sort of key which is evaluated the same on every machine.
Now, if we can exactly replicate worlds like this this is it possible to bring it into a more abstract concept like the following?...
Say you have an application, like Microsoft Word. Word saved the data you have entered as a file on your hard drive it holds formatting data, the strings you've entered, the format of the file... all that on a physical file... Now imagine if when you entered your essay into Word instead of saving it and bringing your laptop to school you instead click on parse and instead of creating a file, you are given a hash code... Now you goto school you know you have to print it. so you log onto the computer and open Word... Now instead of open there is an option now called evaluate you click it and enter the hash your other computer formulated and it creates the exact essay you have written.
Is this possible, and if so are there obvious implementations of this i simply am not thinking of or are just so seemingly part of everyday I don't think recognize it? And finally... if possible, what methods and algorithms would go into such a thing?
[EDIT]
I had to do some research on the anatomy of a seed and I think this explains it well
The limit is 32 characters or for a
numeric seed, 19 digits plus the minus sign.
Numeric seeds can range from -9223372036854775808 to
9223372036854775807 which is a total of 18446744073709551616 Text
strings entered will be "hashed" to one of the numeric seeds in the
above range. The "Seed for the World Generator" window only allows 32
characters to be entered and will not show or use any more than that."
BUT looking back on it lossless compression IS EXACTLY what I was
describing after re-reading the wiki page and remembering that (you
are very correct) the seed only partakes in the generation, the final
data is stores as a "physical" file on the HDD (which again, you are correct) is raw uncompressed data in a file
So in retrospect, I believe I was describing lossless compression, trying in my mind to figure out how the seed was able to replicate the exact same world, forgetting the seed was only responsible for generating the code, not the saving or compression of it.
So thank you for your help guys! It's really appreciated I believe we can call this one solved!

There are several possibilities to achieve this "string" that recovers your data. However they're not all applicable depending on the context.
An actual seed, which initializes for example a peudo-random number generator, then allows to recreate the same sequence of pseudo-random numbers (see this question).
This is possibly similar to what Minecraft relies on, because the whole process of how to create a world based on some choices (possibly pseudo-random choices) is known in advance. Even if we pretend that we have random numbers, computers are actually deterministic, which makes this possible.
If your document were generated randomly then this would be applicable: with the same seed, the same gibberish comes out.
Some key-value dictionary, or hash map. Then the values have to be accessible by both sides and the string is the key that allows to retrieve the value.
Think for example of storing your word file on an online server, then your key is the URL linking to your file.
Compressing all the information that is in your data into the string. This is much harder, and there are strong limits due to the entropy of the data. See Shannon's source coding theorem for example.
You would be better off (as in, it would be easier) to just compress your file with a usual algorithm (zip or 7z or something else), rather than reimplementing it yourself, especially as soon as your document starts having fancy things (different styles, tables, pictures, unusual characters...)
With the simple hypothesis of 27 possible characters (26 letters and the space), Shannon himself shows in Prediction and Entropy of Printed English (Bell System Technical Journal, 30: 1. January 1951 pp 50-64, online version) that there is about 2.14 bits of entropy per letter in English. That's about 550 characters encoded with your 32 character string.
While this is significantly better than the 8 bits we use for each ASCII character, it also shows it is very likely to be impossible to encode a document in English in less than a fourth of its size. Then you'd still have to add punctuation, and all the rest of the fuss.

SWI-Prolog stream I/O and tab completion in the swipl-window

What I'm Doing
I am currently working on creating a SWI-Prolog module that adds tab-completion capability to the swipl-win window. So far I've actually gotten it to where it reads a single character at a time without stopping/returning anything until a tab character is typed. I have also already written a predicate that returns all possible completions of an incompletely typed term by using substring-matching on a list of current terms (obtained via current_functor/2, current_arithmetic_function/1, current_predicate/2, etc [the predicate used will eventually be based off of context]).
If you want to see my code, it is here...just keep in mind that I'm not exactly a Prolog master yet (friendly tips are more than welcome).
What I'm Thinking
I realize that when I actually implement my main completion predicate (still unwritten), I'll have to figure out what the last "word" is in the input stream. I'm debating on whether I should create a new stream with everything in the input stream so far (so I don't have to change the position in the input stream/go back to the beginning) or write to a string...if I take the second approach, I'll start over on the string whenever a delimiting character is inputted (characters that start a new "word", like space, comma, parentheses, operators, etc.) so there won't be any searching through the stream every time tab is pressed.
However, there is another thing: When the user is navigating through and modifying a typed but not-yet-submitted query (via arrow keys and backspace and such), a separate stream is necessary to handle mid-stream completion. A string will do just fine if completion is requested at the end of a stream (handling backspace is as easy as lopping off the last character of the string), but since the string would only contain the current "word", tabber.pl would be at a loss in instances like that. Unless, of course, the current-word string would update and find the current word that the cursor is in as the user navigated and typed mid-stream... (could I use at_end_of_stream(Stream) for that?)
What I'm Asking
How do you think I ought to approach this (string or stream)? The store-to-string method and the make-a-new-stream way both sound like they each have their advantages, so I'm pretty sure the solution will be some sort of combination of both. Any ideas, corrections, or suggestions on accomplishing my goal? (pun intended)
In order to figure that out and really do this correctly, I think I'll also have to know how SWI-Prolog use the input and output streams in the swipl-win window. (It's obviously accepting input, but does it use the output stream to write to the window as you type [into the input stream]?)

Getting this done without changing the C code underlying the swipl-win.exe console will be hard. This also relates to a thread on the mailing list starting here. The completion caller is in src/pl-ntmain.c, do_complete()
for Windows and src/os/pl-rl.c, prolog_completion() for the GNU readline based completion used on Unix systems.
The first step to make is lead these two and the upcoming one described in the referenced thread back
to Prolog using a callback. That requires a small study of the design of the completion interfaces to arrive at a suitable Prolog callback. I guess that should pass in some representation of the entire line and the caret location and return a list of completions from the caret. With that, anyone can write their own smart completer.

matching common strings between two data sets

I am working on a website conversion. I have a dump of the database backend as an sql file. I also have a scrape of the website from wget.
What I'm wanting to do is map database tables and columns to directories, pages, and sections of pages in the scrape. I'd like to automate this.
Is there some tool or a script out there that could pull strings from one source and look for them in the other? Ideally, it would return a set of results that would say soemthing like
string "piece of website content here" on line 453 in table.sql matches string in website.com/subdirectory/certain_page.asp on line 56.
I don't want to do line comparisons because lines from the database dump (INSERT INTO table VALUES (...) ) aren't going to match lines in the page where it actually populates (<div id='left_column'><div id='left_content'>...</div></div>).
I realize this is a computationally intensive task, but even letting it run over the weekend is fine.
I've found similar questions, but I don't have enough CS background to know if they are identical to my problem or not. SO kindly suggested this question, but it appears to be dealing with a known set of needles to match against the haystack. In my case, I need to compare haystack to haystack, and see matching straws of hay.
Is there a command-line script or command out there, or is this something I need to build? If I build it, should I use the Aho–Corasick algorithm, as suggested in the other question?

So your two questions are 1) Is there already a solution that will do what you want, and 2) Should you use the Aho-Corasick algorithm.
The first answer is that I doubt you'll find a ready-built tool that will meet your needs.
The second answer is that, since you don't care about performance and have a limited CS background, that you should use whatever algorithm you find simplest to implement.
I will go one step further and propose an architecture.
First, you need to be able to parse the .sql files into a meaningful way, one that go line-by-line and return tablename, column_name, and value. A StreamReader is probably best for this.
Second, you need a parser for your webpages that will go element-by-element and return each text node and the name of each parent element all the way up to the html element and its parent filename. An XmlTextReader or similar streaming XML parser, such as SAXON is probably best, as long as it will operate on non-valid XML.
You would need to tie these two parsers together with a mutual search algorithm of some sort. You will have to customize it to suit your needs. Aho-Corasick will apparently get you the best performance if you can pull it off. A naive algorithm is easy to implement, though, and here's how:
Assuming you have your two parsers that loop through each field (on the one hand) and each text node (on the other hand), pick one of the two parsers and have it go through each of the strings in its data source, calling the other parser to search the other data source for all possible matches, and logging the ones it finds.

This cannot work, at least not reliably. Best case: you would fit every piece of data to its counterpart in your HTML files, but you would have many false positives. For example user names that are actual words etc.
Furthermore text is often manipulated before it is displayed. Sites often capitalize titles or truncate texts for preview etc.
AFAIK there is no such tool, and in my opinion there cannot exist one that solves your problem adequately.
Your best choice is to get the source code the site uses/used and analyze it. If that fails/is not possible you have to analyse the database manually. Get as much content as possible from the URL and try to fit the puzzle.

Finding words from a dictionary in a string of text

How would you go about parsing a string of free form text to detect things like locations and names based on a dictionary of location and names? In my particular application there will be tens of thousands if not more entries in my dictionaries so I'm pretty sure just running through them all is out of the question. Also, is there any way to add "fuzzy" matching so that you can also detect substrings that are within x edits of a dictionary word? If I'm not mistaken this falls within the field of natural language processing and more specifically named entity recognition (NER); however, my attempt to find information about the algorithms and processes behind NER have come up empty. I'd prefer to use Python for this as I'm most familiar with that although I'm open to looking at other solutions.

You might try downloading the Stanford Named Entity Recognizer:
http://nlp.stanford.edu/software/CRF-NER.shtml
If you don't want to use someone else's code and you want to do it yourself, I'd suggest taking a look at the algorithm in their associated paper, because the Conditional Random Field model that they use for this has become a fairly common approach to NER.
I'm not sure exactly how to answer the second part of your question on looking for substrings without more details. You could modify the Stanford program, or you could use a part-of-speech tagger to mark proper nouns in the text. That wouldn't distinguish locations from names, but it would make it very simple to find words that are x words away from each proper noun.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string