How to remove a word from Aspell's British dictionary - gnu

When I check my texts with aspell (with the British dictionary), the word "froward" is accepted (because it is a real English word). However I never use it, so in my texts "froward" is always a misspelling of "forward". Therefore I want aspell to reject "froward".
How can I remove a word from Aspell's standard dictionary? Is there a way to create a "blacklist" of words? There is no way to mark it in .aspell.en.pws, because the personal dictionary only contains a "whitelist".

You can't.
Aspell does not support it.
Submit an issue or a pull request on the official repo if you care.

Related

Do I need to add updated phoneme sequence of words to .dict file while adapting AM using cmusphinx?

I am trying to adapt en-us acoustic model with indian english accent recordings. Since many words are pronounced in different accent, do I need to add the updated phoneme representation of words? Currently I am following this link: https://cmusphinx.github.io/wiki/tutorialadapt/#accumulating-observation-counts and here nothing is mentioned about updating your .dict file.
PS: Should I add new words directly in the dictionary?
There is Indian English model in downloads, you should use it instead. It comes with Indian English dictionary.

After using pdftotext: find page of string from txt

I am currently coding in python and managed to use pdftotext in order to extract the text from a pdf.
That particular text file is split up in a list of strings. By using regular expression I am able to find specific words I am interested in. The reason why I divide the text into a list is that I want to measure the distance between two specific words and by distance I mean the number of words in between the two words.
However after finding the position of the words I would like to be able to refer back to the initial pdf. In detail, I am interested in the page and maybe even line (if pdf supports this kind of structure) where these words are located.
One idea I have is to do this process for each page of the pdf, so when I find these words I know on what page this was. But this has the big disadvantage that sometimes page breaks are not necessarily natural. Meaning, I would lose the ability to find the words if they are unfortunately separated by a page break.
Do you have any idea how to do this in a more sophisticated manner?
You'll need a more sophisticated library than the one you're using. The Datalogics PDF Java Toolkit has several classes that can extract text from a PDF file. The one you use depends on what you want to do with the text after extraction. The ReadingOrderTextExtractor will create a list of lists that will allow you to extract the text and examine the content of paragraphs, sentences within those paragraphs, and words within that sentence. You'll not only be able to tell the distance between the words but whether they are in the same sentence or paragraph. One you've found a Word object, you can then find both it's location on the page, allowing for highlighting, and the page number it's on.

Lucene, Change Search on one file

Question about Lucene,
I have a file that I would like to index and search by different analyzers. My goal is to be able to change how I search.
In one case I would like to search exact phrase with punctuation IE. for "one,two" and only return exact matchings w/ punctuation.
I would also like to be able to search the exact phrase without punctuation. IE. for "one two." As in the StandardAnalyzer
Essentially I need to change the search functionality on one field.
How can I change the search on the same file. Ive tried using two analyzers (standard and whitespace) however this makes the indexing time very long.
My second thought is to use just a WhitespaceAnalyzer and when searching pass a query that further tokenizes each string if needed? However I am not sure which API has this if any do.
Also is there a good reading on how analyzers and tokens work and are implemented.
Thanks
What do you mean you tried two analyzers? Duplicate the content to 2 seperate fields with different analyzers? That would be my suggestion.

What kind of sign is "‎" and what is it used for

What kind of sign is "‎" and what is it used for (note there is a invisible sign there)?
I have searched through all my documents and found a lot of them. They messed upp my htaccess file. I think I got them when I copied webadresses from google to redirect. So maybe a warning searching through your documents for this one also :)
It is U+200E LEFT-TO-RIGHT MARK. (A quick way to check out such things is to copy a string containing the character and paste it in the writing area in my Full Unicode input utility, then click on the “Show U+” button there, and use Fileformat.Info character search to check out the name and other properties of the character, on the basis of its U+... number.)
The LEFT-TO-RIGHT MARK sets the writing direction of directionally neutral characters. It does not affect e.g. English or Arabic words, but it may mess up text that contains parentheses for example – though for text in English, there should be no confusion in this sense.
But, of course, when text is processed programmatically, as when a web server processes a .htaccess file, they are character data and make a big difference.

Semantic difference between "Find" and "Search"?

When building an application, is there any meaningful difference between the idea of "Find" vs "Search" ? Do you think of them more or less as synonymous?
I'm asking in terms of labeling for application UI as well as API design.
Finding is the completion of searching.
If you might not succeed in finding something, call the feature "Search". For example text search in an editor can fail due to no matches - then calling it "Find" would be lying.
On the other hand: in an established job searching site, you can say "Find a PHP job" because you know that for (almost) anything your users want, there will be offerings. This also makes it sound confident, positive and energetic.
According to Steve Krug in Don't Make Me Think, when talking about usability for a publicly-facing web site, use the word Search for a search box and nothing else. (He specifically prohibits "Find", "Quick Find", "Quick Search", and all variations.)
The rationale is that "Search" is the most commonly understood term, so it's what people will look for when they aren't thinking, and you don't want your users to have to think (at all).
I would say that "find" is focused on getting a single, exact match. As in the example above, you "find" the perfect PHP job.
OTOH, you "search" for jobs that meet your criteria. Searching is what you do when you want to graze through several results. "Search" returns pages of results. "Find" is closer to "I'm feeling lucky."
Of course, the terms get used interchangeably sometimes. But, I think that's the essence of the difference.
In many applications, find means "find on the current page/screen", while search means "search the entire database/Internet." Web browsers, online help, and other applications seem to make this distinction.
Within most applications...
Find typically refers to locating text within the document at hand and jumps to the next occurrence.
Search typically refers to locating multiple documents (or other objects) and returns a list.
I wrote the built-in Find command in Acrobat 1.0 and worked on the full text Search engine for Acrobat 2.0 and 3.0.
Most software at that point that handled large amounts of text had a way to locate an exact match to a single word or phrase and called it Find/Find Next. This is what we called it in Acrobat 1.0. We knew from the start that this wasn't enough to handle entire repositories of documents, so we needed a way to scan across a whole set. We couldn't use Find since that was already in the UI and had established behavior, so we settled on Search. The decision was based on little more than the relatively small set of common words that convey the action.
Even harder is to come up with a reasonable icon for it. Our initial take was to use something similar to the old Yellow Pages logo:
(source: yellowpagecity.com)
but the lawyers shot that down - it was too close. We couldn't use a magnifying glass as we had zoom functions tied to that. We went with binoculars.
I don't think that there is any difference.
But then again, I'm Portuguese. :P
Find = Discover exact
Example: We write "Please find attached" in an email. We don't write "Please search attached".
Search = Discover exact + Related match
Example: Google Search
"Seek and ye shall find"
"Search and you will find"
One angle that (surprisingly) no one has mentioned, is that in English when you say you search something, that something is the thing you're searching within, not the thing you're trying to find. So unless you add the word 'for' (as in, to search for something), the two words are fundamentally different.
It becomes obvious with an example:
Find the room.
Search the room.
Two very different tasks! The first defines the object of your search. The second defines the scope of your search.
That's not completely irrelevant when talking about UIs. If your app has a search feature where the user can specify both the source and the object of their search, you might choose to use the words this way. For example:
Search: Current document
Find: "positive and energetic"
Yes, as some others have pointed out, the word 'Find' does imply a successful search, but let's not start calling app designers liars for using it when success isn't guaranteed. It's become a pretty standard term for searching a document for a particular string.
I think search is more generic and more suitable for text search. Find sounds more like 'find a specific record or a group of records'
After searching You find something.
Search for an answer on stackoverflow that you may find it.
For me Find is the success of a Search, that is to Find is to identify the location of something that's known to exist.
Search should always be used when you have no control on what the user is looking for.
Find talks about a specific one.
Search does not talk about a specific one.
Did you find the picture I requested yet?
No? Please search on internet. I need to present it in an hour.
Another one is below
Please find the attachment in this email.
(or)
You'll find the attachment below.
(or)
Please find attached.
here, we use find because it is a specific document which is attached to email.
we don't use the search here, as there is nothing to search in a larger domain.
Search is the primary interface to the Web for many users. Search should be global (not scoped to a subsite) and available from every page; booleans should be made intimidating since users usually use them wrong
Read this: https://www.nngroup.com/articles/search-and-you-may-find/

Resources