Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm currently faced with a very unusual design problem, and hope that a developer wiser than myself might be able to offer some insight.
Background
Without being too specific, I've been hired by a non-profit organisation to assist with the redevelopment of their legacy, but very valuable (in terms of social value) software. The development team is unlike any I've encountered in my time as a software developer, and is comprised of a small number of developers and a larger group of non-programming domain experts. What makes the arrangement unusual is that the domain experts (lets call them content creators), use custom tooling, some of which is based around a prolog expert system engine, to develop web based software components/forms.
The Problem
The system uses a very awkward postback model to perform logical operations server side and return new forms/results. It is slow, and prone to failure. Simple things, like creating html forms using the existing tooling is much more arduous than it should be. As the demand for a more interactive, and performant experience grows, the software developers are finding increasingly that they have to circumvent the expert system/visual tooling used by the content creators, and write new components by hand in javascript. The content creators feel increasingly that their hands are tied, as they are now unable to contribute new components.
Design approach: Traditional/Typical
I have been advocating for the complete abandonment of the previous model and the adoption of a typical software development process. As mentioned earlier, the project has naturally evolved towards this as the non-programmatic development tooling has become incapable of meeting the needs of the business.
The content creators have a very valuable contribution to make however, and I would like to see them focusing on formally specifying the expected behaviour of the software with tools like Cucumber, instead of being involved in implementation.
Design approach: Non-programmatic
My co-worker, who I respect a great deal and suspect is far more knowledgeable than me, feels that the existing process is fine and that we just need to build better tooling. I can't help but feel however that there is something fundamentally flawed with this approach. I have yet to find one instance, either historical or contemporary where this model of software development has been successful. COBOL was developed with the philosophy of allow business people/domain experts to write applications without the need for a programmer, and to my mind all this did was create a new kind of programmer - the COBOL programmer. If it was possible to develop effective systems allowing non-programmers to create non-trivial applications, surely the demand for programmers would be much lower? The only frameworks that I am aware of that roughly fit this model are SAP's Smart Forms and Microsoft's Dynamix AX - both of which are very domain specific ERP systems.
DSLs, Templating Languages
Something of a compromise between the two concepts would be to implement some kind of DSL as a templating language. I'm not even sure that this would be successful however, as all of the content creators, with one exception, are completely non technical.
I've also considered building a custom IDE based on Visual Studio or Net Beans with graphical/toolbox style tooling.
Thoughts?
Is non-programmatic development a fools errand? Will this always result in something unsatisfactory, requiring hands on development from a programmer?
Many thanks if you've taken the time to read this, and I'd certainly appreciate any feedback.
You say:
Something of a compromise between the
two concepts would be to implement
some kind of DSL as a templating
language. I'm not even sure that this
would be successful however, as all of
the content creators, with one
exception, are completely non
technical.
Honestly, this sounds like exactly the approach I would use. Even "non-technical" users can become proficient (enough) in a simple DSL or templating language to get useful work done.
For example, I do a lot of work with scientific modeling software. Many modelers, while being much more at home with the science than with any form of engineering, have been forced to learn one or more programming languages in order to express their ideas in a way they can use. Heck, as far as I know, Fortran is still a required course in order to get a Meteorology degree, since all the major weather models currently in use are written in Fortran.
As a result, there is a certain community of "scientific programming" which is mostly filled with domain experts with relatively little formal software engineering training, expertise, or even interest. These people are more at home with languages/platforms like Matlab, R, and even Visual Basic (since they can use it to script applications like Excel and ESRI ArcMap). Recently, I've seen Python gaining ground in this space as well, mainly I think because it's relatively easy to learn.
I guess my point is that I see strong parallels between this field and your example. If your domain experts are capable of thinking rigorously about their problems (and this may not be the case, but your question is open-ended enough that it might be) then they are definitely capable of expressing their ideas in an appropriate domain-specific language.
I would start by discussing with the content creators some ideas about how they would like to express their decisions and choices. My guess would be that they would be happy to write "code" (even if you don't have to call it code) to describe what they want. Give them a "debugger" (a tool to interactively explore the consequences of their "code" changes) and some nice "IDE" support application, and I think you'll have a very workable solution.
Think of spreadsheets.
Spreadsheets are a simple system that allows non-technical users to make use of a computer's calculation abilities. In doing so, they have opened up computers to solve a great number of tasks which normally would have required custom software developed to solve them. So, yes non-programmatic software development is possible.
On the other hand, look at spreadsheets. Despite their calculational abilities you really would not want as a programmer to have to develop software with them. In the end, many of the techniques that make programming languages better for programmers make them worse for the general population. The ability to define a function, for example, makes a programmer's life much easier, but I think would confuse most others.
Additionally, past a certain point of complexity trying to use a spreadsheet would be a real pain. The spreadsheet works well within the realm for which it was designed. Once you stray too far out that, its just not workable. And again, its the very tools programmers use to deal with complexity which will prevent a system being both widely usable and sufficiently powerful.
I think that for any given problem area, you could develop a system that allows the experts to specify a solution. It will be much harder to develop that system then to solve the problem in the first place. However, if you repeatedly have similiar problems which the experts can develop solutions for, then it might be worthwhile.
I think development by non-developers is doomed to failure. It's difficult enough when developers try it. What's the going failure rate? 50% or higher?
My advice would be to either buy the closest commercial product you can find or hire somebody to help you develop a custom solution with your non-developer maintenance characteristics in mind.
Being a developer means keeping a million details in mind at once and caring about details like version control, deployment, testing, etc. Most people who don't care about those things quickly tire of the complexity.
By all means involve the domain experts. But don't saddle them with development and maintenance as well.
You could be putting your organization at risk with a poorly done solution. If it's important, do it right.
I don't believe any extensive non-programmer solution is going to work. Programming is more than language, it's knowing how to do things reasonably. Something designed to be non-programmer friendly will still almost certainly contain all the pitfalls a programmer knows to avoid even if it's expressed in English or a GUI.
I think what's needed in a case like this is to have the content creators worry about making content and an actual programmer translate that into reasonable computer code.
I have worked with two ERP systems that were meant for non-programmers and in both cases you could make just about every mistake in the book with them.
... Simple things, like creating html forms using the existing tooling is much more arduous than it should be...
More arduous for whom? You're taking a development model that works (however badly) for the non-programming content creators, and because something is arduous for someone you propose to replace that with a model where the content creators are left out in the cold entirely? Sounds crazy to me.
If your content creators can learn custom tooling built around a Prolog rules engine, then they have shown they can learn enough formalism to contribute to the project. If you think other aspects of the development need to be changed, I see only two honorable choices:
Implement the existing formalism ("custom tooling") using the new technology that you think will make things better in other ways. The content creators contribute exactly as they do now.
Design and implement a domain-specific language that handles the impedance mismatch between what your content creators know and can do and the way you and other developers think the work should be done.
Your scenario is a classic case where a domain-specific language is appropriate. But language design is not easy, especially when combined with serious usability questions. If you are lucky you will be able to hire someone to help you who is expert in both language design and usability. But if you are nonprofit, you probably don't have the budget. In this case one possibility is to look for help from another nonprofit—a nearby university, if you have one.
I'd advise you to read this article before attempting to scrap the whole system. I look at it this way. What changed to prompt the redevelopment? Your domain experts haven't forgotten how to use the original system, so you already have some competent "COBOL programmers" for your domain. From your description, it sounds like mostly the performance requirements have changed, and possibly a greater need for web forms.
Therefore, the desired solution isn't to change the responsibilities of the domain experts, it's to increase the performance and make it easier to create web forms. You have the advantage of an existing code base showing exactly what your domain experts are capable of. It would be a real shame not to use it.
I realize Prolog isn't exactly the hottest language around, but there are faster and slower implementations. Some implementations are designed mostly for programmer interactivity and are dynamically interpreted. Some implementations create optimized compiled native code. There are also complex logical programming techniques like memoization that can be used to enhance performance, but probably no one learns them in school anymore. A flow where content creators focus on creating new content and developers focus on optimization could be very workable. Also, Prolog is ideally suited for the model layer, but not so much for the view layer. Moving more of your view layer to a different technology could really pay off.
In general though, 2 thoughts:
You cannot reduce life to algorithms. Everything we know (philosophically, scientifically) and experience demonstrates this. (Sorry, Dr. Minsky).
That said, a Domain-Specific tool that allows non-programmers to express a finite language is definitely doable as several people have given examples. Another example of this type of system is Mathematica and especially Simulink which are used very successfuly over a range of applications. However, the failure of Expert Systems, Fuzzy logic, and Japan's Fifth Generation computer project of the 80's to take-off demonstrate the difficulty in doing this.
Labview is a very successful none programatic programming environment.
What an interesting problem.
I would have to ultimately agree with you, and disagree with your colleague.
The philosophy and approach of Domain Driven Development/Design exists exactly for your purposes, in that it puts paramount importance on the specific knowledge of the experienced domain experts, and on communicating that knowledge to talented software developers.
See, in your issue, there are two distinct things. The domain, and the software. The domain should be understood and specified first and foremost without software development in mind.
And then the transformation to software happens between the communication between domain experts and programmers.
I think trying to build "programming" tools for domain experts is a waste of time.
In Domain Driven Development your domain experts will continue to be important, and you'll end up with better software.
In your colleague's approach you're trying to replace programmers with tool.... maybe in the future, like, start trek future, that will be possible, but today I don't think so.
I am currently struggling with a similar problem in trying to enable healthcare providers to write rules for workflows, which isn't easy because they aren't programmers. You're a programmer not because you went to programming school -- you're a programmer because you think like a programmer. Fortunately, most hospitals have some anesthesiologist or biomedical engineer who thinks like a programmer and can manage to program. The key is to give the non-programmers-who-think-like-programmers a language that they can learn and master.
In my case, I want doctors to be able to formulate simple rules, such as: "If a patient's temperature gets too low, send their doctor a text message". Of course it's more complicated than that because the definition of "too low" depends on the age of the patient, a patient may have many doctors, and so on, but a smart doctor will be able to figure out those rules. The real issue is that the temperature sensor will often fall off the patient and read ambient temperature, meaning that the rule is useless unless you can figure out how to determine that the thermometer is actually reading the patient's temperature.
The big problem I have is that, while it's relatively easy to create a DSL so that doctors can express IF [temperature] [less-than] [lookup-table [age]] THEN [send-text-message], it's much harder to create one that can express all the different heuristics you might need to try before coming up with the right way to make sure the reading is valid.
In your case, you may want to consider how VB became popular: It has a form drawing tool that anybody can easily use to draw forms and set properties on form items. Since not everything can be specified by form properties and data binding, there's a code-behind mode that lets you do complex logic. But to make the tool accessible to beginning programmers, the language is BASIC, so users didn't have to learn about pointers, memory allocation, or data structures.
While you probably wouldn't want to give your users VB, you might consider a hybrid approach. You would have one "language" (it could be graphical, like VB's form designer or Labview) where inexperienced users can easily do the simple stuff, and another language to enable expert users to do the complicated stuff.
I had this as a comment previously but I figured it deserved more merit.
There are definitely a number of successful 'non-programmatic' tools around, off hand I can think of Labview, VPL and graph based (edit: I just noticed this link has more far more than just shaders on it) shaders which are prevalent in 3D applications.
Having said that, I don't know of any which are suited to web based dev (which appears to be your case).
I dout very much the investment on developing such tools would be worth it (unless maybe you could move to sell it as a product as well).
I agree with you - non-technical people will not be able to program anything non-trivial.
Some products try to create what's basically a really simple programming language. The problem is that programming is an aptitude as well as a skill. It takes a certain kind of mindset to think in the sort of logic used by computers, which just can't be abstracted away by any programming language (at least not without without making assumptions that it can't safely make).
I've seen this in action with business people trying to construct workflows in MS Dynamics CRM. Even though the product was clearly intended to allow them to do it without a programmer they just couldn't figure out how to make a loop or an if-else condition work, no matter how "friendly" the UI tried to make it. I watched in amazement as they struggled with something that seemed completely self-obvious to me. After a full day (!) of this they managed to produce a couple of very basic workflows that worked in some cases, but didn't handle edge cases like missing values or invalid data. It was basically a complete waste of time.
Granted, Dynamics CRM isn't exactly the epitome of user-friendliness, but I saw enough to convince me that this is, indeed, a fool's errand.
Now, if your users are not programmers, but still technical people they might be able to learn programming, but that's another story - they've really become "new programmers" rather than "non-programmers" then.
This is a pretty philosophical topic and difficult to answer for every case, but in general...
Is non-programmatic development a fools errand?
Outside of a very narrow scope, yes. Major software vendors have invested billions over the years in creating various packages to try to let non-technical users create & define workflows and processes with limited success. Your best bet is to take advantage of what has been done in that space rather than trying to re-invent it.
Edit:
Sharepoint, InfoPath, some SAP stuff are the examples I'm talking about. As I said above, "a very narrow scope". It's possible to let non-programmers create workflows, complex forms, some domain processes, but that's it. Anything more general-purpose is simply trying to make non-programmers into programmers by giving them very crude tools.
Non programatic software development IS feasable, as long as you are realistic about what non-developers can reasonaby achieve - its all about a compromise between capability and ease of use.
The key is to break the requirements down cleanly into things that the domain experts need to be able to do, and spend time implementing those features in a foolproof way - The classic mistake is to try to let the system to do too much.
For example suppose you want domain experts to be able to create a form with a masked text input:
Most developers will look at that requirement and create a fancy control which accepts some sort of regular expression and lets the domain expert do anything.
This is the classic developer way of looking at things, however it's likely that your domain expert does not understand regular expressions and the developer has missed the point of the reqirement which was for the domain expert to be able to create this form.
A better solution might have been to create a control that can be confgiured to mask either Email addresses or telephone numbers.
Yes this control is far less capable than the first control, and yes the domain experts have to ask developers to extend it when they want to be able to mask to car registration numbers, however the domain experts are able to use it.
It seems that the problem is of organizational nature and cannot be solved by technical means alone.
The root is that content creators are completely non-technical, yet have to perform inherently technical tasks of designing forms and writing Prolog rules. Various designers and DSLs can alleviate their problems, but never solve them.
Either reorganize system and processes so knowledge carriers actually enter knowledge - nothing else; or train content creators to perform necessary development with existing tools or may be DSLs.
Non-programmatic development can save from low-level chores, but striving to set up system once and let users indefinitely and unrestrictedly expand it is certainly a fools errands.
Computer games companies operate like this all the time, so far as I can tell: a few programmers and a lot of content creators who need to be able to control logic as well (like level designers).
It's also probably a healthy discipline to be able to separate your code from the data and rules driving it if you can.
I'm therefore with your colleague, but of course the specifics may not make this general solution appropriate!
Can anyone give me an example of successful non-programmer, 5GL (not that I am sure what they are!), visual, 0 source code or similar tools that business users or analysts can use to create applications?
I don’t believe there are and I would like to be proven wrong.
At the company that I work at, we have developed in-house MVC that we use to develop web applications. It is basically a reduced state-machine written in XML (à la Spring WebFlow) for controller and a simple template based engine for presentation. Some of the benefits:
dynamic nature: no need to recompile to see the changes
reduced “semantic load”: basically, actions in controller know only “If”. Therefore, it is easy to train someone to develop apps.
The current trend in the company (or at least at management level) is to try to produce tools for the platform that require 0 source code, are visual etc. It has a good effect on clients (or at least at management level) since:
they can be convinced that this way they will need no programmers or at least will be able to hire end-of-the-lather programmers that cost much less than typical programmers.
It appears that there is a reduced risk involved, since the tool limits the implementer or user (just don’t use the word programmer!) in what he can do, so there is a less chance that he can introduce error
It appears to simplify the whole problem since there seems to be no programming involved (notoriously complex). Since applications load dynamically, there is less complexity then typically associated with J2EE lifecycle: compile, package, deploy etc.
I am personally skeptic that something like this can be achieved. Solution we have today has a number of problems:
Implementers write JavaScript code to enrich pages (could be solved by developing widgets). Albeit client-side, still a code that can become very complex and result in some difficult bugs.
There is already a visual tool, but implementers prefer editing XML since it is quicker and easier. For comparison, I guess not many use Eclipse Spring WebFlow plug-in to edit flow XML.
There is a very poor reuse in the solution (based on copy-paste of XML). This hampers productivity and some other aspects, like fostering business knowledge.
There have been numerous performance and other issues based on incorrect use of the tools. No matter how reduced the playfield, there is always space for error.
While the platform is probably more productive than Struts, I doubt it is more productive than today’s RAD web frameworks like RoR or Grails.
Verbosity
Historically, there have been numerous failures in this direction. The idea of programs written by non-programmers is old but AFAIK never successful. At certain level, anything but the power of source code becomes irreplaceable.
Today, there is a lot of talk about DSLs, but not as something that non-programmers should write, more like something they could read.
It seems to me that the direction company is taking in this respect is a dead-end. What do you think?
EDIT: It is worth noting (and that's where some of insipiration is coming from) that many big players are experimenting in that direction. See Microsoft Popfly, Google Sites, iRise, many Mashup solutions etc.
Yes, it's a dead end. The problem is simple: no matter how simple you make the expression of a solution, you still have to analyze and understand the problem to be solved. That's about 80-90% of how (most good) programmers spend their time, and it's the part that takes the real skill and thinking. Yes, once you've decided what to do, there's some skill involved in figuring out how to do that (in a programming language of your choice). In most cases, that's a small part of the problem, and the least open to things like schedule slippage, cost overrun or outright failure.
Most serious problems in software projects occur at a much earlier stage, in the part where you're simply trying to figure out what the system should do, what users must/should/may do which things, what problems the system will (and won't) attempt to solve, and so on. Those are the hard problems, and changing the environment to expressing the solution in some way other that source code will do precisely nothing to help any of those difficult problems.
For a more complete treatise on the subject, you might want to read No Silver Bullet - Essence and Accident in Software Engineering, by Frederick Brooks (Included in the 20th Anniversary Edition of The Mythical Man-Month). The entire paper is about essentially this question: how much of the effort involved in software engineering is essential, and how much is an accidental result of the tools, environments, programming languages, etc., that we use. His conclusion was that no technology was available that gave any reasonable hope of improving productivity by as much as one order of magnitude.
Not to question the decision to use 5GLs, etc, but programming is hard.
John Skeet - Programming is Hard
Coding Horror - Programming is Hard
5GLs have been considered a dead-end for a while now.
I'm thinking of the family of products that include Ms Access, Excel, Clarion for DOS, etc. Where you can make applications with 0 source code and no programmers. Not that they are capable of AI quality operations, but they can make very usable applications.
There will always be "real" languages to do the work, but we can drag and drop the workflow.
I'm using Apple's Automator which allows users to chain together "Actions" exposed by the various applications on their systems.
Actions have inputs and/or outputs, some have UI elements and basic logic can be applied to the chain.
The key difference between automator
and other visual environments is that
the actions use existing application
code and don't require any special
installation.
More Info > http://www.macosxautomation.com/automator/
I've used it to "automate" many batch processes and had really great results (surprises me every time). I've got it running builds and backups and whenever i need to process a mess of text files it comes through.
I would love to know whether iHook or Platypus (osx wrapper builders for shell scripts) could let me develop plugins in python ....
There is definitely room for more applications like this and for more support from OSX application developers but the idea is sound.
Until there's major support there aren't many "actions" available, but a quick check on my system just showed me an extra 30 that i didn't know i had.
PS. There was another app for OS-preX called "Filter Tops" which had a much more limited set of plugins.
How about Dabble DB?
Of course, just like MS Access and other non-programmer programming platforms, it has some necessary limitations in order that the user won't get him or herself stuck... as John pointed out programming is hard. But it does give the user a lot of power, and it seems that most applications that non-programmers want to build are database-type applications anyway.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
We are an all Unix shop (Solaris, Linux). This last product cycle I returned to a project lead capacity, and needed to produce a schedule. I asked what tools my managers would accept, and was surprised to hear "text files". My teammate and I gamely tried this, and probably worse, HTML tables, to track the tasks we wanted to size. It was pretty painful.
We then tried a few tools. MrProject is buggy, limited and crashes too frequently. My manager swears that Microsoft Project is inflexible. Whenever they needed to change a task, reassign a resource or rebalance, it generally hosed their plan. So I started looking around on the Internet for a Linux-capable project planning tool. One that sounded interesting is TaskJuggler. It's neat in that the inputs are declarative files. I feel like I'm building a makefile for a project.
However. I have a limited amount of time to devote to evaluating this tool and it seems pretty complex. Before diving into the next product cycle, I'd like to know if TaskJuggler is robust enough, flexible and capable of handling multi-month, multiple resource projects with frequent changes. So I'm calling on all engineers who have had experience with this tool to share their insights. Thanks!
There is nothing free in project management, and managing a complex project with software is inevitably complex. The real question is, does the chosen tool help with this?
Task Juggler has a learning curve, and in the end is suitable for someone who doesn't mind reading the manual (an absolute necessity for this tool) and isn't tied to graphical input. Task Juggler requires that you think about your project and structure it in a meaningful way. It is helpful if you do a diagram in advance (many TJ users make mind maps and there is a tool out there somewhere to generate TJ input statements from a FreeMind mind map). It is also very helpful if you organize your input file in some meaningful way, making things easy to find.
That said, once you get going, creating a project with TJ is super fast. You don't need to bother with a million dialog boxes, you just tell TJ what you want in TJ's text language.
But all of that aside, what I like about TJ (and hated at first, coming from a legacy of other more traditional tools) is that it ensures that your schedule makes sense. OpenProj happily schedules resources at 300% and more. TJ will give you an error and make you fix it. Yes, it's annoying. But the end result is that you have a project schedule that makes sense and can actually be executed. Imagine that!
As I started out by saying, nothing's free. TJ requires study and some effort. The reward is rich and copious reporting, all the information you need to manage your project to cost and schedule, and the enforcement of a logical, reliable approach to scheduling and resource allocation. And it doesn't cost $499 or whatever MSP goes for --- it's free.
I have been using taskjuggler for the past 4/5 years now (4 projects with an average duration of a year or more). I find it very useful to create my initial estimates of
how long the project will take
When will each resource group be freed.
What if we added more resources with varying level of experience and efficiency to different domains of the project.
Typically the kind of stuff that upper management will ask you about your schedule can be generated much faster and at a more accurate granularity compared to doing something similar using MS Project or other GUI based tools.
Till recently I was using taskjuggler to get my initial schedule and using ms excel to track the project.
This is the first time I am using task juggler to actually track the project on a weekly basis. and so far the results look good.
TaskJuggler's syntax is rather easy, but do take your time to read the documentation.
My experience with TJ:
very powerful and expressive syntax
useful for detailed calculation of large projects
However in reality a manual planning takes into account many implicit constraints, which TJ requires to be made explicit in order to obtain realistic scenario's. This is of course true for every planning tool, but I found it rather cumbersome to add and edit manual constraints in large projects in TJ... Therefore I found it less suited for project tracking and rescheduling afterwards.
I now use OmniPlanner, which is a far simpler tool than TJ and MSProject but turns out to suit my needs (especially in tracking, analysis and reporting).
I am using Taskjuggler to develop a very detailed task manager for big movie productions. It's brilliant cause of it's syntax and csv outputs.
I have been using it for 1w and love it.
The acceptance test, so to speak, is if you find text/coding more expressive than UI based input.
If you do feel comfortable expressing your thinking in a structured language but prefer/expect UI then do not spend time with TaskJuggler.
See http://www.pegasoft.ca/coder/coder_july_2008.html for these remarks like
"Don't expect a nice user interface with an "Add Task" button here."
"Even reports must be designed in it's awkward, C-like language"
If that is how you think then do not spend time with TaskJuggler.
TaskJuggler is (almost) a DSL for planning.
If you do not know what a DSL is then do not spend time with TaskJuggler.
Or learn about DSLs. :-)
For the rest, try it because it might just put planning in your own hands and take it away from the hands of people who demand it from you only to ask for status.
I think there is a wealth of natural language data associated with sites like reddit or digg or news.google.com.
I have done a little bit of research with text mining, but can't find how I could use those tools to parse something like reddit.
What kind of applications can you come up with?
I have found in the past that the best way to mine data on sites like Reddit or Digg is to first use the developer API that they provide. Typically you have a focused interest in either a topic or trend, and the only way to get that data is through an established public interface. You can also parse feeds, and combine them both to uncover 90% of what you would want to know. If you want to do deep research on data not available through an API, then you should be prepared to spend a significant amount of time writing custom wrappers around a tool like cURL. If you have the budget you can also call them and ask if they offer paid research data on users.
I'd start on the RSS, and after that I might use Nutch; what to actually do with the data is more your call.
These are good ideas. I can get the data, but what applications can be built around it?
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
I've always been interested in developing a web search engine. What's a good place to start? I've heard of Lucene, but I'm not a big Java guy. Any other good resources or open source projects?
I understand it's a huge under-taking, but that's part of the appeal. I'm not looking to create the next Google, just something I can use to search a sub-set of sites that I might be interested in.
There are several parts to a search engine. Broadly speaking, in a hopelessly general manner (folks, feel free to edit if you feel you can add better descriptions, links, etc):
The crawler. This is the part that goes through the web, grabs the pages, and stores information about them into some central data store. In addition to the text itself, you will want things like the time you accessed it, etc. The crawler needs to be smart enough to know how often to hit certain domains, to obey the robots.txt convention, etc.
The parser. This reads the data fetched by the crawler, parses it, saves whatever metadata it needs to, throws away junk, and possibly makes suggestions to the crawler on what to fetch next time around.
The indexer. Reads the stuff the parser parsed, and creates inverted indexes into the terms found on the webpages. It can be as smart as you want it to be -- apply NLP techniques to make indexes of concepts, cross-link things, throw in synonyms, etc.
The ranking engine. Given a few thousand URLs matching "apple", how do you decide which result is the best? Jut the index doesn't give you that information. You need to analyze the text, the linking structure, and whatever other pieces you want to look at, and create some scores. This may be done completely on the fly (that's really hard), or based on some pre-computed notions of "experts" (see PageRank, etc).
The front end. Something needs to receive user queries, hit the central engine, and respond; this something needs to be smart about caching results, possibly mixing in results from other sources, etc. It has its own set of problems.
My advice -- choose which of these interests you the most, download Lucene or Xapian or any other open source project out there, pull out the bit that does one of the above tasks, and try to replace it. Hopefully, with something better :-).
Some links that may prove useful:
"Agile web-crawler", a paper from Estonia (in English)
Sphinx Search engine, an indexing and search api. Designed for large DBs, but modular and open-ended.
"Information Retrieval, a textbook about IR from Manning et al. Good overview of how the indexes are built, various issues that come up, as well as some discussion of crawling, etc. Free online version (for now)!
Xapian is another option for you. I've heard it scales better than some implementations of Lucene.
Check out nutch, it's written by the same guy that created Lucene (Doug Cutting).
It seems to me that the biggest part is the indexing of sites. Making bots to scour the internet and parse their contents.
A friend and I were talking about how amazing Google and other search engines have to be under the hood. Millions of results in under half a second? Crazy. I think that they might have preset search results for commonly searched items.
edit:
This site looks rather interesting.
I would start with an existing project, such as the open source search engine from Wikia.
[My understanding is that the Wikia Search project has ended. However I think getting involved with an existing open-source project is a good way to ease into an undertaking of this size.]
http://re.search.wikia.com/about/get_involved.html
If you're interested in learning about the theory behind information retrieval and some of the technical details behind implementing search engines, I can recommend the book Managing Gigabytes by Ian Witten, Alistair Moffat and Tim C. Bell. (Disclosure: Alistair Moffat was my university supervisor.) Although it's a bit dated now (the first edition came out in 1994 and the second in 1999 -- what's so hard about managing gigabytes now?), the underlying theory is still sound and it's a great introduction to both indexing and the use of compression in indexing and retrieval systems.
I'm interested in Search Engine too. I recommended both Apache Hadoop MapReduce and Apache Lucene. Getting faster by Hadoop Cluster is the best way.
There are ports of Lucene. Zend have one freely available. Have a look at this quick tutorial: http://devzone.zend.com/node/view/id/91
Here's a slightly different approach, if you are not so much interested in the programming of it but more interested in the results: consider building it using Google Custom Search Engine API.
Advantages:
Google does all the heavy lifting for you
Familiar UI and behavior for your users
Can have something up and running in minutes
Lots of customization capabilities
Disadvantages:
You're not writing code, so no learning opportunity there
Everything you want to search must be public & in the Google index already
Your result is tied to Google