How to represent a duplicate data store in a data flow diagram (DFD) using Yourdon DeMarco notation? - dataflow-diagram

While searching for the answer, I found that in the Gane & Sarson notation, duplicate data stores are indicated by adding an extra line. Another website suggested adding an asterisk next to the data store name. However, I was unable to find any answers specific to the Yourdon DeMarco notation.
Is there a generally accepted way to represent duplicates in the Yourdon DeMarco notation? Or should I adopt one of the methods stated above?

I asked this question in StackExchange and got an answer.
To summarise the answer, the Yourdon Demarco notation avoids duplication where possible because it makes it difficult to visually follow the flow, and this is the practice that should be adopted if possible.
However, in cases where duplication cannot be avoided, there are 2 ways to represent it:
Make no graphical difference between the duplicate stores. This was the approach used in the Yourdon & Demarco book.
Use an asterisk exponent behind the store name, and add a legend for the asterisk. This representation can be used if attention needs to be drawn to the duplication.
(Edit: added more details)


How to extract categories out of short text documents?

My data contains the answers to the open-ended question: what are the reasons for recommending the organization you work for?
I want to use an algorithm / technique that, using this data, learns the categories (i.e. the reasons) that occur most frequently, and that a new answer to this question can be placed in one of these categories automatically.
I initially thought of topic modeling (for example LDA), but the text documents are very short in this problem (mostly between the 1 and 10 words per document). Therefore, is this an appropriate method? Or are there other models that are suitable for this? Perhaps a cluster method?
Note: the text is in Dutch
No, clustering will work even worse.
It can't do magic.
You'll need to put in additional information, such as labels to solve this problem - use classification.
Find the most common terms that clearly indicate one reason or another and begin labeling posts.

Embeddings vs text cleaning (NLP)

I am a graduate student focusing on ML and NLP. I have a lot of data (8 million lines) and the text is usually badly written and contains so many spelling mistakes.
So i must go through some text cleaning and vectorizing. To do so, i considered two approaches:
First one:
cleaning text by replacing bad words using hunspell package which is a spell checker and morphological analyzer
convert sentences to vectors using tf-idf
The problem here is that sometimes, Hunspell fails to provide the correct word and changes the misspelled word with another word that don't have the same meaning. Furthermore, hunspell does not reconize acronyms or abbreviation (which are very important in my case) and tends to replace them.
Second approache:
using some embeddings methode (like word2vec) to convert words into vectors without cleaning text
I need to know if there is some (theoretical or empirical) way to compare this two approaches :)
Please do not hesitate to respond If you have any ideas to share, I'd love to discuss them with you.
Thank you in advance
I post this here just to summarise the comments in a longer form and give you a bit more commentary. No sure it will answer your question. If anything, it should show you why you should reconsider it.
Points about your question
Before I talk about your question, let me point a few things about your approaches. Word embeddings are essentially mathematical representations of meaning based on word distribution. They are the epitome of the phrase "You shall know a word by the company it keeps". In this sense, you will need very regular misspellings in order to get something useful out of a vector space approach. Something that could work out, for example, is US vs. UK spelling or shorthands like w8 vs. full forms like wait.
Another point I want to make clear (or perhaps you should do that) is that you are not looking to build a machine learning model here. You could consider the word embeddings that you could generate, a sort of a machine learning model but it's not. It's just a way of representing words with numbers.
You already have the answer to your question
You yourself have pointed out that using hunspell introduces new mistakes. It will be no doubt also the case with your other approach. If this is just a preprocessing step, I suggest you leave it at that. It is not something you need to prove. If for some reason you do want to dig into the problem, you could evaluate the effects of your methods through an external task as #lenz suggested.
How does external evaluation work?
When a task is too difficult to evaluate directly we use another task which is dependent on its output to draw conclusions about its success. In your case, it seems that you should pick a task that depends on individual words like document classification. Let's say that you have some sort of labels associated with your documents, say topics or types of news. Predicting these labels could be a legitimate way of evaluating the efficiency of your approaches. It is also a chance for you to see if they do more harm than good by comparing to the baseline of "dirty" data. Remember that it's about relative differences and the actual performance of the task is of no importance.

How do search engines implement keyword suggestion?

This is a very important programming question (and a popular interview question), yet I wasn't able to find any direct answer on the internet.
All I know is that they are:
Based on real searches
Different for different languages and regions
Based on your search history
Duplicates removed
Sorted by most relevant and popular
So we're looking at a data structures that are:
Easily sortable
Elements are unique
The word you type in is somehow connected to the suggestions
My first guess would be: it's a graph.
Another option is an adjacency list. Each element contains a link list of suggestions etc.
Does anyone know it's really done?

Cross Referencing Databases on Fuzzy Data

I am currently working on project where I have to match up a large quantity of user-generated names with a separate list of the same names in a canonical format. The problem is that the user-generated names contains numerous misspellings, abbreviations, as well as simply invalid data, making it hard to do a cross-reference with the canonical data. Any suggestions on methods to do this?
This does not have to be done in real-time and in this case accuracy is more important than speed.
Current ideas for this are:
Do a fuzzy search for the user entered name in the canonical database using an existing search implementation like Lucene or Sphinx, which I presume use something like the Levenshtein distance for this.
Cross-reference on the SOUNDEX hash (which is supposedly computed on the sound of the name rather than spelling) instead of using the actual name.
Some combination of the above
Anyone have any feedback on any of these or ideas of their own?
One of my concerns is that none of the above methods will handle abbreviations very well. Can anyone point me in a direction for some machine learning methods to actually search on expanded abbreviations (or tell me I'm crazy)? Thanks in advance.
First, I'd add to your list the techniques discussed at Peter Norvig's post on spelling correction.
Second, I'd ask what kind of "user-generated names" you're talking about. Having dealt with both, I believe that the heuristics you'd use for street names are somewhat different from the heuristics for person names. (As a simple example, does "Dr" expand to "Drive" or "Doctor"?)
Third, I'd look at a combination using testing to establish the set of coefficients for combining the results of the various techniques.

How to categorize and tabularize free-form answers to a question in a survey?

I want to analyze answers to a web survey (Git User's Survey 2008 if one is interested). Some of the questions were free-form questions, like "How did you hear about Git?". With more than 3,000 replies analyzing those replies entirely by hand is out of the question (especially that there is quite a bit of free-form questions in this survey).
How can I group those replies (probably based on the key words used in response) into categories at least semi-automatically (i.e. program can ask for confirmation), and later how to tabularize (count number of entries in each category) those free-form replies (answers)? One answer can belong to more than one category, although for simplicity one can assume that categories are orthogonal / exclusive.
What I'd like to know is at least keyword to search for, or an algorithm (a method) to use. I would prefer solutions in Perl (or C).
Possible solution No 1. (partial): Bayesian categorization
(added 2009-05-21)
One solution I thought about would be to use something like algorithm (and mathematical method behind it) for Bayesian spam filtering, only instead of one or two categories ("spam" and "ham") there would be more; and categories itself would be created adaptively / interactively.
Text::Ngrams + Algorithm::Cluster
Generate some vector representation for each answer (e.g. word count) using Text::Ngrams.
Cluster the vectors using Algorithm::Cluster to determine the groupings and also the keywords which correspond to the groups.
You are not going to like this. But: If you do a survey and you include lots of free-form questions, you better be prepared to categorize them manually. If that is out of the question, why did you have those questions in the first place?
I've brute forced stuff like this in the past with quite large corpuses. Lingua::EN::Tagger, Lingua::Stem::En. Also the Net::Calais API is (unfortunately, as Thomposon Reuters are not exactly open source friendly) pretty useful for extracting named entities from text. Of course once you've cleaned up the raw data with this stuff, the actual data munging is up to you. I'd be inclined to suspect that frequency counts and a bit of mechanical turk cross-validation of the output would be sufficient for your needs.
Look for common words as keywords, but through away meaningless ones like "the", "a", etc. After that you get into natural language stuff that is beyond me.
It just dawned on me that the perfect solution for this is AAI (Artificial Artificial Intelligence). Use Amazon's Mechanical Turk. The Perl bindings are Net::Amazon::MechanicalTurk. At one penny per reply with a decent overlap (say three humans per reply) that would come to about $90 USD.
