I'd like to learn how to build a website, say using .Net (Monorail comes to mind). I'd like a pet project, something that:
Will take a fair yet reasonble amount of time
I can I can build on my own
Will be actually cool or useful,
Hasn't been done to death already (e.g. ... writing a blog engine is not what I'd consider as interesting, although it's technically challenging - it's been done to death and there are so many ready blog platforms today)
Any ideas, stackoverflow?
Have you considered offering your time to a local non-profit organization? You might review their existing mission, website, and other materials and approach them with an idea for something helpful that you could develop for free.
I find that if a project is "real" I'll put more effort into it than into a "toy" project on the side.
Hasn't been done to death already
(e.g. ... writing a blog engine is not
what I'd consider as interesting,
although it's technically challenging
- it's been done to death and there are so many ready blog platforms
today)
If this is just a learning exercise, why do you care if its been done to death? More than that, it seems like a blog platform involves a lot of the fundamental skills you'd need to learn anyway to get up to speed on ASP.NET.
You could also try writing a:
messageboard
web-based source-control system.
wiki engine
SO clone
Music/movie management system
Input two celebrities A and A', output a list of movies where A appears B, B appears with C, C appears with D, D appears with A'. See also: Kevin Bacon.
Start your own internet phenomenon. Lolcats, FML, NotAlwaysRight, GraphJam, Passive Agressive Notes, FSTDT, FailBlog, Sh*t Bricks, Keyboard Cat, and JapanWTF have already been done. Find a meme and run with it.
Searchable online taxonomy of species
Decentralized usernames (OpenID), avatars (Gravatar), status updates (Twitter), and currently playing music (Last.fm) have already been done. I predict the next big social network phenomenon will extend the phenomenon by decentralizing another staple of social-networking sites, probably a "current mood" or "signature" that follows you from site to site.
photo gallery engine
a website where people post great ideas for a website.
I'd say my answer would be the same as the one I gave to this previous SO question (albeit substitute .Net for PHP).
Related
Many years ago I worked at a DEC-shop. We used a tool called Document (as far as I remember) to create documentation. It was provided by DEC and created the same layout as the original DEC documentation. Which is as far as I'm concerned a milestone in layout and typesetting.
Researching the web I found a more or less obscure company which sells this tool for Open VMS. But I would prefer an open source replacement.
Any help ?
Greetings Till
Touch Technology was, and perhaps still is, an interesting company with interesting folks like 'Mr Dan'.
They picked up a good bit of Digital software in a fire-sale and had some good stuff such themselves such as performance tuning tools and a 4GL (Intouch... available on OpenVMS Freeware).
The company appears to have moved one, judging by their current website front door which does not dwell on the old stuff , but you could do worse than try contact them.
The back door still list DECdocument: http://www.ttinet.com/documentation.html
Good luck!
Hein
If you're still looking for a solution, have you thought about LaTeX? The markup syntax isn't radically different from VAX DOCUMENT's SDML. They both have the same back-end; the final steps in processing an SDML file involved running it through TeX.
I would think the best solution would be DocBook, since it is also an SGML-ish format. You might be able to translate a substantial portion using XSS.
I have been brainstorming for an Undergraduate Project in Question Answering domain. A project that has components of IR and NLP.
The first thing that popped up, was of course factoid question answering, but that seemed to be an already conquered problem. #IBM Watson!
Non-factoid QA seems interesting, so I took it up. Now, we are in scope-it-out phase of the project description. So, from the ambitious goal - of answering any question put up by the user - I need to scope out our project.
So I took the following decisions:
It will be closed-domain - C++ Programming
The corpus will consist of just one website. (cplusplus or wikipedia) or just one document (the complete reference)
We will develop only one module of the entire QA architecture - Passage Retrieval or Answer Extraction.
Our mentor insists on implementing an already existing solution, to start with.
I am stuck at this point, to search for existing implementations. Here is one. But when I read through the environment requirements, it was staggering. There are a lot of libraries and tool kits, but I didn't find any non-factoid QA system, that was good to know at least on a very small scale.
Suggest a good scope for the project. I wish to continue working on this through my masters, so it what would be a good start? We have about 4 months for the project, and it is important not to end up doing a research project. It should have a tangible output.
For IR you have Lucene/Solr.
For machine learning and nlp lots of libraries are available, primarily in python and java, at least the user friendly ones.
Implementing Hoifung's system is pretty ambitious, I'd go for something simpler. Have you looked at his code at all?
Something you could find lots of stuff in is the BioNLP challenges from the last few years, but those are also relatively complicated tasks.
How about twitter movie review discovery? Ie based on X tweets, does this movie suck?
As part of a wide ranging job for a cystic fibrosis support organization, they'd also like a web site set up and I've decided on Apache running on Linux (due to its security and low cost mostly). Other than (fairly) static content, they also want a forum where people can discuss issues with the condition - it'll be attached to a hospital chain so there'll be plenty of medical staff there who know little about the web.
I can handle all the specific coding and Apache setup since I've done it before but I'm interested in people's opinions as to whether I should roll my own forum software or get a hold of some ready-built stuff. I've not had any experience with forum software but I could generate my own (initially buggy, I'm sure) in a month or so.
It'll require registration and login to leave comments (but guest access just to read) and I'd like it to be 'pretty' (excuse me while I remember damning customers for providing similarly vague requirements specs :-) but not necessarily infinitely-configurable with skins/themes/etc.
If anyone has some compelling reasons (and experience with specific products that can provide what I need), I'd be interested in hearing about them. Alternatively, does anyone have any 'gotchas' they experienced while coding their own forum software?
Advantages to rolling your own:
a non-standard custom-built system means you'll be less prone to "standard" attacks (e.g.: a vulnerability in PunBB) since bad guys tend to bother with exploit-hunting only on widely-deployed systems (more return on their investment)
absolute control over how your system works and looks
you'll learn a lot
Disadvantages:
you'll repeat mistakes other people have already solved
it'll take you longer to get up and running
long-term it'll be more maintenance (since you have to fix bugs & add features yourself).
you can't "leverage the community" -- if you choose an off-the-shelf forum that has a plugin system then there's a whole bunch of community add-ons that won't be available for your custom forum software.
There's a GIANT list of forum software on wikipedia -- there's most likely something in there that will suit your needs that you can get up and running quickly.
IMHO the old "don't build what you can buy" adage applies to this (well, the web 2.0 version is obviously "don't build what you can download"). Have a look around at the available forum software, pick one that covers 99% of your needs and tweak it to do the rest.
If you still want to build your own forum software that'll probably be a cool side project but if the job is to get a forum up and running, then go and download one - don't try to mix up the desire to do cool stuff and the day job unless the day job is just to do cool stuff only.
One of the best-kept secrets on the internets is a little gem called FUDforum, by Ilia Alshanetsky.
And yes, it's the same Ilia who wrote xDebug's original profiler code, improved the caching in MMcache, fixed several security bugs in libmcrypt, and who was the release manager for the PHP language from 4.3.3 to 4.3.6+. He is, as my friends in Boston would say, wicked smaart.
Because of this, FUDforum is robust, ridiculously fast and more secure than probably any other part of your web application will ever be. It comes with a neat install script and it has all the features you'll need.
Plus, it's not a high-profile target like phpBB or vBulletin, which means you won't have to worry about spambots constantly banging on the gates.
Having written my own forum software before...
It seems like a simple problem, but when you get into it, you find that there's a lot of little things that you'd like to do nicer, and it takes a lot of time. Mine was cool and all, and I did get paid for it, but if I was doing it over again (which has also happened), I'd use a customizable pre-made solution, and spend all my spare time doing something productive. :)
Forum softwares tend to have rather complex minimum requirements. A few things you are very likely to need do matter what you do:
Forum/thread/post hierarchy;
User system;
Security system (eg user/admin classes and all kinds of restrictions for users);
Gathering statistics;
BBCodes or some other minimized markup language (NEVER allow users to do full HTML);
File uploads and avatars;
Bans and other punishments;
CAPTCHAs;
etc.
Ready made forum systems provide this out-of-the-box and lots more. Setup is mostly easy too. Why do it all over again yourself?
My answer would be: don't reinvent the wheel, there are plenty of fora software out there. My preference would go for RForum if you need only that.
I'd say, don't waste your time. phpBB 3 is pretty stable, usable and feature-rich forum. We use it at work (for our internal discussions), and I really don't have anything bad to say about it.
I'd concur with most of the above posters that since you want something which appears fairly standard, why reinvent something that already exists?
Like any development, creating forum software is probably much harder than it looks! There will be problems solved in the existing software which you haven't even considered.
It's worth adding that if you do require any specific additional functionality, you can always build that on top of an existing solution anyway, which is especially easy if you have the source code (whether open source or commercial).
From the sounds of the website that you are building, there is the potential for the forum to be a highly useful and visible resource, it would be good to go with something that already exists, due to the quality of a lot of the products out there and the rich communities that surround them.
I think that vBulletin, although a paid for product, would suit your needs and give you a great base to build a community on.
vanilla is pretty bare bones and easy to configure, perhaps find a system which is easy to extend vs building everything yourself
Ready built until you have some really unique features needed that can be tied to money it will make you.
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
I've always been interested in developing a web search engine. What's a good place to start? I've heard of Lucene, but I'm not a big Java guy. Any other good resources or open source projects?
I understand it's a huge under-taking, but that's part of the appeal. I'm not looking to create the next Google, just something I can use to search a sub-set of sites that I might be interested in.
There are several parts to a search engine. Broadly speaking, in a hopelessly general manner (folks, feel free to edit if you feel you can add better descriptions, links, etc):
The crawler. This is the part that goes through the web, grabs the pages, and stores information about them into some central data store. In addition to the text itself, you will want things like the time you accessed it, etc. The crawler needs to be smart enough to know how often to hit certain domains, to obey the robots.txt convention, etc.
The parser. This reads the data fetched by the crawler, parses it, saves whatever metadata it needs to, throws away junk, and possibly makes suggestions to the crawler on what to fetch next time around.
The indexer. Reads the stuff the parser parsed, and creates inverted indexes into the terms found on the webpages. It can be as smart as you want it to be -- apply NLP techniques to make indexes of concepts, cross-link things, throw in synonyms, etc.
The ranking engine. Given a few thousand URLs matching "apple", how do you decide which result is the best? Jut the index doesn't give you that information. You need to analyze the text, the linking structure, and whatever other pieces you want to look at, and create some scores. This may be done completely on the fly (that's really hard), or based on some pre-computed notions of "experts" (see PageRank, etc).
The front end. Something needs to receive user queries, hit the central engine, and respond; this something needs to be smart about caching results, possibly mixing in results from other sources, etc. It has its own set of problems.
My advice -- choose which of these interests you the most, download Lucene or Xapian or any other open source project out there, pull out the bit that does one of the above tasks, and try to replace it. Hopefully, with something better :-).
Some links that may prove useful:
"Agile web-crawler", a paper from Estonia (in English)
Sphinx Search engine, an indexing and search api. Designed for large DBs, but modular and open-ended.
"Information Retrieval, a textbook about IR from Manning et al. Good overview of how the indexes are built, various issues that come up, as well as some discussion of crawling, etc. Free online version (for now)!
Xapian is another option for you. I've heard it scales better than some implementations of Lucene.
Check out nutch, it's written by the same guy that created Lucene (Doug Cutting).
It seems to me that the biggest part is the indexing of sites. Making bots to scour the internet and parse their contents.
A friend and I were talking about how amazing Google and other search engines have to be under the hood. Millions of results in under half a second? Crazy. I think that they might have preset search results for commonly searched items.
edit:
This site looks rather interesting.
I would start with an existing project, such as the open source search engine from Wikia.
[My understanding is that the Wikia Search project has ended. However I think getting involved with an existing open-source project is a good way to ease into an undertaking of this size.]
http://re.search.wikia.com/about/get_involved.html
If you're interested in learning about the theory behind information retrieval and some of the technical details behind implementing search engines, I can recommend the book Managing Gigabytes by Ian Witten, Alistair Moffat and Tim C. Bell. (Disclosure: Alistair Moffat was my university supervisor.) Although it's a bit dated now (the first edition came out in 1994 and the second in 1999 -- what's so hard about managing gigabytes now?), the underlying theory is still sound and it's a great introduction to both indexing and the use of compression in indexing and retrieval systems.
I'm interested in Search Engine too. I recommended both Apache Hadoop MapReduce and Apache Lucene. Getting faster by Hadoop Cluster is the best way.
There are ports of Lucene. Zend have one freely available. Have a look at this quick tutorial: http://devzone.zend.com/node/view/id/91
Here's a slightly different approach, if you are not so much interested in the programming of it but more interested in the results: consider building it using Google Custom Search Engine API.
Advantages:
Google does all the heavy lifting for you
Familiar UI and behavior for your users
Can have something up and running in minutes
Lots of customization capabilities
Disadvantages:
You're not writing code, so no learning opportunity there
Everything you want to search must be public & in the Google index already
Your result is tied to Google
Does anyone have a suggestion for where to find archives or collections of everyday English text for use in a small corpus? I have been using Gutenberg Project books for a working prototype, and would like to incorporate more contemporary language. A recent answer here pointed indirectly to a great archive of usenet movie reviews, which hadn't occurred to me, and is very good. For this particular program technical usenet archives or programming mailing lists would tilt the results and be hard to analyze, but any kind of general blog text, or chat transcripts, or anything that may have been useful to others, would be very helpful. Also, a partial or downloadable research corpus that isn't too marked-up, or some heuristic for finding an appropriate subset of wikipedia articles, or any other idea, is very appreciated.
(BTW, I am being a good citizen w/r/t downloading, using a deliberately slow script that is not demanding on servers hosting such material, in case you perceive a moral hazard in pointing me to something enormous.)
UPDATE: User S0rin points out that wikipedia requests no crawling and provides this export tool instead. Project Gutenberg has a policy specified here, bottom line, try not to crawl, but if you need to: "Configure your robot to wait at least 2 seconds between requests."
UPDATE 2 The wikpedia dumps are the way to go, thanks to the answerers who pointed them out. I ended up using the English version from here: http://download.wikimedia.org/enwiki/20090306/ , and a Spanish dump about half the size. They are some work to clean up, but well worth it, and they contain a lot of useful data in the links.
Use the Wikipedia dumps
needs lots of cleanup
See if anything in nltk-data helps you
the corpora are usually quite small
the Wacky people have some free corpora
tagged
you can spider your own corpus using their toolkit
Europarl is free and the basis of pretty much every academic MT system
spoken language, translated
The Reuters Corpora are free of charge, but only available on CD
You can always get your own, but be warned: HTML pages often need heavy cleanup, so restrict yourself to RSS feeds.
If you do this commercially, the LDC might be a viable alternative.
Wikipedia sounds like the way to go. There is an experimental Wikipedia API that might be of use, but I have no clue how it works. So far I've only scraped Wikipedia with custom spiders or even wget.
Then you could search for pages that offer their full article text in RSS feeds. RSS, because no HTML tags get in your way.
Scraping mailing lists and/or the Usenet has several disatvantages: you'll be getting AOLbonics and Techspeak, and that will tilt your corpus badly.
The classical corpora are the Penn Treebank and the British National Corpus, but they are paid for. You can read the Corpora list archives, or even ask them about it. Perhaps you will find useful data using the Web as Corpus tools.
I actually have a small project in construction, that allows linguistic processing on arbitrary web pages. It should be ready for use within the next few weeks, but it's so far not really meant to be a scraper. But I could write a module for it, I guess, the functionality is already there.
If you're willing to pay money, you should check out the data available at the Linguistic Data Consortium, such as the Penn Treebank.
Wikipedia seems to be the best way. Yes you'd have to parse the output. But thanks to wikipedia's categories you could easily get different types of articles and words. e.g. by parsing all the science categories you could get lots of science words. Details about places would be skewed towards geographic names, etc.
You've covered the obvious ones. The only other areas that I can think of too supplement:
1) News articles / blogs.
2) Magazines are posting a lot of free material online, and you can get a good cross section of topics.
Looking into the wikipedia data I noticed that they had done some analysis on bodies of tv and movie scripts. I thought that might interesting text but not readily accessible -- it turns out it is everywhere, and it is structured and predictable enough that it should be possible clean it up. This site, helpfully titled "A bunch of movie scripts and screenplays in one location on the 'net", would probably be useful to anyone who stumbles on this thread with a similar question.
You can get quotations content (in limited form) here:
http://quotationsbook.com/services/
This content also happens to be on Freebase.