Open source spell check - nlp

Was evaluating adding spell check to a product I own. As per my research the major decisions that need to be made:
The library to use.
Dictionary( this can be region specific, British english, American etc).
Exclusion lists. Anytime a typo is detected its possible that its not a typo but is
verbiage specific to the user. At this point the users should be given the ability to
add this to his custom exclusion list.
Besides a per user custom list also a list of exclusion based on the user space of the
clients of the tool. That is terms/acronyms in the users work domain. For example FX will not be a typo for currency traders.
The open questions I had are listed below and if I could get input into them that would be very useful.
For 1, I was thinking of hunspell, whcih is the open source library offered under MPL and is used by firefox and OpenOffice family of products. Any horror stories out there using this?
Any grey areas with the licensing? The spell checking will happen on a windows client.
Dictionaries are available from a variety of sources some free under MPL while some are not. Any suggestions on good sources for free dictionaries.
Multi lingual support and what needs to be worked out to support them?
For 4, how are custom dictionaries kept in sync with the server side and the clientside? The spell check needs to happen on the clientside so are they pushed down with the initial launch everytime or are they synced up ever so often?

As already mentioned Hunspell is a state of the art spell checker. It is the Open Office, Thunderbird, Firefox and Google Chrome spell checker. Ports to all major programming languages are available. It works with the Open Office Directories, so a lot of languages are supported.

I've used Hunspell for a few things, and I don't really have any horror stories with it. I've only used it with English (American) though, but it claims to work with other languages.
As for licensing, it offers a choice of GPL, LGPL, and MPL. If you don't like the MPL, you can always choose to use the LGPL.

There are several pupular options that widely used: myspell, aspell. Check them.
http://en.wikipedia.org/wiki/MySpell
http://en.wikipedia.org/wiki/GNU_Aspell

Here is a good demonstration by Peter Norvig: I find this simple explanation much more intuitive. Follow the links in the doc as well for more indepth analysis.
http://norvig.com/spell-correct.html

Related

How to organize libraries and links of programming information

I have an email account whose sole purpose it is to store interesting and useful links to programming articles, code, and blog posts. It has become a little knowledgebase of sorts. I can even do a search on it, which is pretty cool.
However, after using this account for a couple of years, I now have 775 links, and it has become this big blob of amorphous information, most of which I have never looked at again. I take comfort in the fact that, if I really needed to, I could find something in there again, if I even remember putting it in there in the first place. But it has developed a "smell," if you will.
How are you organizing your programming library of cool stuff? Do you have a system or tool, and is it better than the way I'm doing it?
I would use something that is made for storing bookmarks. I use delicious.com for all of my bookmarks. The tagging system works perfectly for technology sites because you can tag each page with a specific language or tech abbreviation. This coupled with the Delicious Bookmarks plugin will make it very easy to tag sites and get back to them.
Use one word or abbreviations for languages: java c# vb.net python
Use acronyms for technologies: wpf wcf
I used to use the standard bookmark system in the browser but since I bounce around through various machines and browsers throughout the day I started to use bookmark synchronizers. Both Foxmarks and the one that google came out with. But neither I was completely satisfied with. Plus delicious has a great web interface to it as a decent api to extend for your own purpose.
IMHO, using Evernote to store this information is great.
1) you can go back and search through it easily
2) organize by tags and "notebooks collections"
3) available on multiple platforms (even mobile)
4) available as browser plugins (for direct archiving in-browser)
The only drawback is it's copy-paste functionality is a little lacking (it sometimes doesn't import/display the CSS styles correctly).
Otherwise, it's a great alternative to store web "bookmarks" (and also archive the content at the same time).

Why are wizard dialogs called "wizards"?

I was talking with my non-techie wife tonight. She was talking about how she was training staff to use some new software. The software made heavy use of wizards to accomplish tasks. Her question to me was "Why are wizards called 'wizards?' Are they made by some nerd with an interest in Dungeons & Dragons?"
I realized that, while the "nerd" and "Dungeons & Dragons" were true in my case, I didn't know the origin of of the term "wizard" as it relates to a part of an application that guides a user through some difficult process.
I'm curious to see what thoughts others here have on this great and weighty question.
My impression is that it's related to the meaning of wizard that's similar to "expert". A UI wizard is like a (very simple) expert system. The wizard/"expert" asks you a series of questions to figure out what you want, and then they use their "expertise" to generate a result.
One of the original Wizard interfaces, was with Microsoft Publisher 2.0.
The wizard part came after the last dialog page, where it would 'magically' perform the actions required to achieve the task requested in the wizard, and actually show you how to do it. For example, running the Greeting Card Wizard, would show you how to set the aspect ratio, paper size, etc.
I guess user interface testing showed that not enough people were following the wizard tutorial, and just skipped through it to get the desired result, because this functionality was dropped in later versions of Publisher.
Because they magically guide the user through the process to achieve the users goal.
I believe Microsoft invented and introduced the term, no doubt for marketing related reasons.
I guess because user interfaces that configured things that were previously done manually must have seemed like magic to users. It's a pretty good analogy if you think about it - this little config app is doing many many things with a single "wave of the wand" as it were.

Is it better to do roll-your-own or ready-built forum software?

As part of a wide ranging job for a cystic fibrosis support organization, they'd also like a web site set up and I've decided on Apache running on Linux (due to its security and low cost mostly). Other than (fairly) static content, they also want a forum where people can discuss issues with the condition - it'll be attached to a hospital chain so there'll be plenty of medical staff there who know little about the web.
I can handle all the specific coding and Apache setup since I've done it before but I'm interested in people's opinions as to whether I should roll my own forum software or get a hold of some ready-built stuff. I've not had any experience with forum software but I could generate my own (initially buggy, I'm sure) in a month or so.
It'll require registration and login to leave comments (but guest access just to read) and I'd like it to be 'pretty' (excuse me while I remember damning customers for providing similarly vague requirements specs :-) but not necessarily infinitely-configurable with skins/themes/etc.
If anyone has some compelling reasons (and experience with specific products that can provide what I need), I'd be interested in hearing about them. Alternatively, does anyone have any 'gotchas' they experienced while coding their own forum software?
Advantages to rolling your own:
a non-standard custom-built system means you'll be less prone to "standard" attacks (e.g.: a vulnerability in PunBB) since bad guys tend to bother with exploit-hunting only on widely-deployed systems (more return on their investment)
absolute control over how your system works and looks
you'll learn a lot
Disadvantages:
you'll repeat mistakes other people have already solved
it'll take you longer to get up and running
long-term it'll be more maintenance (since you have to fix bugs & add features yourself).
you can't "leverage the community" -- if you choose an off-the-shelf forum that has a plugin system then there's a whole bunch of community add-ons that won't be available for your custom forum software.
There's a GIANT list of forum software on wikipedia -- there's most likely something in there that will suit your needs that you can get up and running quickly.
IMHO the old "don't build what you can buy" adage applies to this (well, the web 2.0 version is obviously "don't build what you can download"). Have a look around at the available forum software, pick one that covers 99% of your needs and tweak it to do the rest.
If you still want to build your own forum software that'll probably be a cool side project but if the job is to get a forum up and running, then go and download one - don't try to mix up the desire to do cool stuff and the day job unless the day job is just to do cool stuff only.
One of the best-kept secrets on the internets is a little gem called FUDforum, by Ilia Alshanetsky.
And yes, it's the same Ilia who wrote xDebug's original profiler code, improved the caching in MMcache, fixed several security bugs in libmcrypt, and who was the release manager for the PHP language from 4.3.3 to 4.3.6+. He is, as my friends in Boston would say, wicked smaart.
Because of this, FUDforum is robust, ridiculously fast and more secure than probably any other part of your web application will ever be. It comes with a neat install script and it has all the features you'll need.
Plus, it's not a high-profile target like phpBB or vBulletin, which means you won't have to worry about spambots constantly banging on the gates.
Having written my own forum software before...
It seems like a simple problem, but when you get into it, you find that there's a lot of little things that you'd like to do nicer, and it takes a lot of time. Mine was cool and all, and I did get paid for it, but if I was doing it over again (which has also happened), I'd use a customizable pre-made solution, and spend all my spare time doing something productive. :)
Forum softwares tend to have rather complex minimum requirements. A few things you are very likely to need do matter what you do:
Forum/thread/post hierarchy;
User system;
Security system (eg user/admin classes and all kinds of restrictions for users);
Gathering statistics;
BBCodes or some other minimized markup language (NEVER allow users to do full HTML);
File uploads and avatars;
Bans and other punishments;
CAPTCHAs;
etc.
Ready made forum systems provide this out-of-the-box and lots more. Setup is mostly easy too. Why do it all over again yourself?
My answer would be: don't reinvent the wheel, there are plenty of fora software out there. My preference would go for RForum if you need only that.
I'd say, don't waste your time. phpBB 3 is pretty stable, usable and feature-rich forum. We use it at work (for our internal discussions), and I really don't have anything bad to say about it.
I'd concur with most of the above posters that since you want something which appears fairly standard, why reinvent something that already exists?
Like any development, creating forum software is probably much harder than it looks! There will be problems solved in the existing software which you haven't even considered.
It's worth adding that if you do require any specific additional functionality, you can always build that on top of an existing solution anyway, which is especially easy if you have the source code (whether open source or commercial).
From the sounds of the website that you are building, there is the potential for the forum to be a highly useful and visible resource, it would be good to go with something that already exists, due to the quality of a lot of the products out there and the rich communities that surround them.
I think that vBulletin, although a paid for product, would suit your needs and give you a great base to build a community on.
vanilla is pretty bare bones and easy to configure, perhaps find a system which is easy to extend vs building everything yourself
Ready built until you have some really unique features needed that can be tied to money it will make you.

One man bugtracker?

Recently I've been doing lots of weekend coding, and have began to really need a bugtracker as things are gaining speed. This is probably the worst case scenario because I basically have to let things cool down over the week,so I simply can't remember the bugs in my head. So far I've been using a text file to jot down bugs,but I'd rather use something a bit better.
The biggest points here are ease of use and very little setup time.Don't want to spend more than an hour learning the basics and trying to install something. Also in my case I'm on a Mac so that would help, but solutions for other platforms are welcomed as they will likely help others.
FogBugz has a student/startup edition that's free indefinitely, for 2 or less users.
Personally, I use Excel. (Wait, come back, I'm not crazy!) For a bigger / team project, I've gotten a ton of mileage out of Bugzilla, but that tends to be kind of overkill for a one-person project.
But, a well-organized spreadsheet, with columns for things like "status", "description", "code module", "resolved date," etc, gets you pretty close to what you'd need for a small project. Sorting a spreadsheet by column isn't anywhere near a search, but its a whole lot better than "find in text file."
Heck, if you use Google docs rather than excel, you can even publish the thing as an RSS feed and get it anywhere.
And, the major advantage is that the setup time and learning curve are both effectively nil.
Addendum: And of course, the instant your "One-Person Bug Tracker" becomes a "Two-Person Bug Tracker" you must switch to something better. Bugzilla, FogBugz, anything. Trust me, I've been there.
Trac or Redmine are both pretty good. I don't know how easy they are to set up on a Mac.
It's worth mentioning that FogBugz also has a free version for up to 2 users, which would suit you. It is hosted so there is no installation and you can use something like Fluid to access it in its own window.
I don't think you need a full blown bugtracker for your scenario.
Try tiddly wiki, store each bug in a tiddler and give them tags like 'open' or 'closed'.
There is no installation required (only one html file), and it's very easy to use.
And platform neutral.
If you're working on a LAMPP stack, then for ease of setup and use I would probably recommend Mantis. It's written in PHP / MySQL and the only installation involved was specifying where the database should be created and what credentials should be used.
Oh, and its FOSS.
I would suggest Omnigroup's Omnifocus - it's an excellent task tracker, and if you just make the mental leap from bug to task, I think it works famously for one man projects as well as being an excellent way to organize your no doubt burgeoning task queue.
Eclipse has a really interesting system--I don't know why so few people seem to know about it.
It's tied in with their to-do list. It gives you the ability to enter bugs with as much or as little info as you like. You can tie it to versioning or an external bug tracker if you like. It's a decent bug tracker in itself.
The real trick is how it works with your source code.
Before you begin work you select a bug from the list. All the time you're coding, it tracks what files you are editing. It can close old tabs for you, and will also highlight areas of the source tree that you have modified a lot.
The nice thing is, you can go back to any bug you've edited an you will get your "Environment" back. Not only all your notes and stuff, but the same tabs will open up and the same sections of code in the navigator will be highlighted.
Also eclipse works with virtually any language, it's not just restricted to Java...
let me put in a good word for ditz - it's a bit bare-bones, but it has the invaluable feature that bugs are checked into your repository. it's also very easy to use once you get used to its way of doing things
You can use fogbugz for free if you're a one man team.
It's super easy to use and quick to learn.
They made it so that bugs are really easy to enter, no mandatory fields.
I'm the author of BugTracker.NET mentioned in another post. If I were looking for a tracker for JUST ONE PERSON with MINIMUM hassle, I'd use FogBugz, because it's hosted. No installation, no need to worry about backups.
But, what are you doing about version control? Don't you have to worry about that too, and backing that up? If so, consider something like Unfuddle or CVSDude where you can get BOTH Subversion and Trac, or Subversion and Fogbugz.
I use Mantis at home and I'm happy with it. It can be a pain in the arse to get it working so you can choose to download a free and ready-made VM installation. Cannot be easier than that,
Maybe a spreadsheet would be the next logical step? I know it sounds really un-sexy, but if you're the only user, you don't have to worry much about others mucking it up, and it adds a few basic features over a text file like sorting. Then if you later need to graduate to something RDBMS-backed, you would likely have a feasible import path. I just know that for me, when working by myself, I don't tend to get around to putting bugs in anything that requires more care and feeding than that (of course when working with others the collaborative needs make a more defined repository a requirement, but that's a different story).
EDIT: After noting the availability of free, hosted access to FogBugz, I'm re-thinking the bar for care and feeding...
RT from BestPractical is great.
I also get a lot of mileage out of just keeping a list of items in a text file with vi, if I can express them all in one line. This is usually for many small todo items on a single component or task.
I've tried bugtracker.net and even though it's a little bit rough on the edges, it's free and was built with ASP.NET:
http://sourceforge.net/project/showfiles.php?group_id=66812
Are you using a source control repository as well? If not, you really should, even though you're only a one-man team.
My personal preference is to use a VMWare Virutal Application (free) that offers no-hassle setup gives you access to both Trac and Subversion. You can find many different virual appliances through searching. Here is one example of getting a Trac/SVN virtual appliance up and running:
http://www.rungeek.com/blog/archives/how-to-setup-svn-and-trac-with-a-virtual-appliance/
Trac is an excellent project management tool that sports a bug tracker, wiki, and integrated source control management. It's adaptable to your needs, and fits me very well personally.
I use bugzilla for this purpose. Plus for me was that it has integration with Eclipse (precisely with Mylyn). FogBuzz has it to but AFAIK it is nonfree.
Plus it sits on my laptop so I can code and add/remove bugs when offline (it was biggest disadvantage of hosted solutions for me)
Installation was not a problem in Ubuntu (and any debian-based distro I suppose).
I dig ELOG in those cases, it's more of a personal blog, but it's easy to handle and install, the data is local on your computer and you can search all entries via fulltext. Always sufficed for me.
If you have a Windows box with IIS and MSSQL (including SQL Server Express), you should look at Bugtracker.net. It is free and open source (you get the source code), and it is extensible.
Even if you are a one man shop, having a free bug tracking system with this much power will allow you to grow over time, because it is fairly easy to add future users into the system.
You can also customize it for the look of your organization, business or product.
Ontime 2008 by Axosoft is free for a single user licence. It's industrial strength and will give you alot more that just bug tracking!
http://www.axosoft.com
Jira which now has free personal licenses.
I am using leo for this purpose. To be more specific, its cleo plugin.
Of course you might need to spend some time to get used to leo, but it will pay off.
A flat text file is just a list, an Excel spreadsheet is a two-dimensional list.
leo lets you keep the data in a tree! And it also has clones.

How to manage application resources?

We are developing a web application which is available in 3 languages.
There are these key-value pairs to translate everything. At this moment we use Excel (key, german, french, english) for this. But this does not work well ... if there is more than 1 person editing this file, you have no chance to automatically merge the different files.
Is there a good (and free) tool which can handle this job?
--- additional information ---
(This is a STRUTS application) But the question is how to manage these kinds of information in general (or at least in an conveinient way, which also supports multiple users editing this single file ("mergeable" filetypes))
Why not use gettext and manage separate .po files? See that blog entry.
If you can store this information in plain text then you will be able to use a version control system like subversion to help you with merging changes. Subversion is free.
The free guide (the "Red Book") to subversion gives a fairly good explanation of how this kind of merging works.
http://svnbook.red-bean.com/en/1.5/svn.basic.vsn-models.html#svn.basic.vsn-models.copy-merge
EDIT: Another thought - if you really want to stay using a spreadsheet - Google Docs supports simultaneous editing of a spreadsheet. You could import your existing spreadsheet and get your multi-user merging wishes for free with very little change to how you work.
Good Question.
There are some "Best Practice" depending on what you actually code in (java, ms-windows c#).
I solved this (but I think there must be a better way) by using a SQL db instead of excel file, and a wrote a plug for VS (VB6,........,..., emacs) that was able to insert new keys into the db without going to round trip with version control. The keys are the developers name of what they think is a best guess for a label. (key => save, sv => "spara", no => "", en => "save").
This db can then be generated as a module, class, obj, txt, to appropriate code(platform)
and can be accessed, depending on the ide, so in c#, bt,label = corelang.save;
Someone else can then do all the language stuff, and then we just update the db and rerun the generation to the platform resources.
After years of seeing localization done, including localization at large companies like Sony. I can only say the "standard" is Excel :)
There are tons of good ideas around, and probably many better ways to do it, but in real-life excel seems to be the best/cost effective solution that doesn't require training or making complex new tools to get the job done.
Found out, that Intellij Idea (at leas in version 7 and 8) has an editor for application resources. But it is not free at all. And it does not scale for bigger resource files with more than 1.000 keys.
Another good choice would be to use Google's spreadsheets ... for those who don't know it - it is like an "online Excell web-application". It can handle concurrent access from multiple users. Yay! But sadly, it comes from Google. This makes it impossible to be used in commercial projects.
So,
still searching...
cheers,
mana

Resources