where does the term flat-file come from? - history

I was wondering if anyone knows. Is 'flat' the opposite of hierarchal? Does it come from a phrase like, "this is a flat-out regular file"?

Looking at the wikipedia article (and one of its references), the 'flat' in flat file is as opposed to structured file, where the structure can be heirarchical, relational, or a few other formats.

A flat file is read into a flat data structure, it is essentially an array. One big flat list of values.
A better way to store data would be in a binary tree for example. this data is not flat as it has a root and nodes.
A non flat data structure makes sorting and search much more efficient.

Well an imaginative way would be flatfiles can be imagined in one plane, a flat surface ,the more complicated DBs have connection between tables which can only be imagined in 3d :P

Your speculation is correct: flat files are "flat" in that they do not have an internal hierarchical structure.

Related

Searching through polymorphic data with Elasticsearch

I am stumped at what seems to be a fundamental problem with Elasticsearch and polymorphic data. I would like to be able to find multiple types of results (e.g. users and videos and playlists) with just one Elasticsearch query. It has to be just one query, since that way Elasticsearch can do all the scoring and I won't have to do any magic to combine multiple query results of different types.
I know that Elasticsearch uses a flat document structure, bringing me to the following problem. If I index polymorphic data, I will have to specify a 'missing' value for each unique attribute that I care about in scoring subtypes of the polymorphic data.
I've looked for examples of other dealing with this problem and couldn't find any. There doesn't seem to be anything in the documentation on this either. Am I overlooking something obvious or was Elasticsearch just not designed to do something like this?
Kind regards,
Steffan
Thats not the issue of Elasticsearch itself, its the problem (or limitation) of underlying lucene indexes. So, any db/engine based on lucene will have the same problems (if not worse :), ES does a hell ton of job for you). Probably, ES will ease the pain in further releases, but not dramatically. And IMO, there's hardly any hi-perf search engine that can bear with true polymorphic data.
The answer depends on your data structure, thats for sure. Basically, you have two options:
Put all your data in single index, and split it by types. And you already know the overhead - lucent indexes works poorly with sparse data. More similar your data is, less problem you have. Anyway, ES will do all the underlying job for "missing" values, you only have to cope with memory/disk overhead for storing sparse data.
If your data is organised with parent-child relation (i.e. video -> playlist), you definitely need single index for such data. Which is leaving you with this approach only.
Divide your data into multiple indexes. This way you have slightly higher disk overhead for lucene index + possibly higher CPU usage when aggregation data from multiple shards (so, you should tune your sharding respectively).
You still can query ES for all your documents in single request, as ES supports multi-index queries.
So, this looks like question purely of your data structure. I'd recommend to simply fire up small cluster to measure memory/disk/cpu usage for expected data. More details on "index vs shard" – great article by Adrien.
Slightly off-topic, if ES doesn't seem to feet your needs, I suggest you to
still consider merging data on application side. ES works great with multiple light request (instead of few heavier), and as your results from ES is sorted already, you need to merge sorted streams having sorted input. Not so much magic there, tbh.

How does Spark's sort shuffle work?

From https://0x0fff.com/spark-architecture-shuffle/ I know that the default way of shuffeling in Spark is sort shuffle. However the description was not step-by-step enough to be clear for me. How does it work?
What I understand is that each mapper writes into exactly one AppendOnlyMap (What are the keys?), which is sorted (and spilled - why spilled?) into potentially multiple... what exactly?... then somehow written in some indexed (what exactly is indexed by what with what key?) file. I think the idea in the end is that all those sorted-and-indexed files are brought with this Min Heap Merge together to have only one big file per reduces.
As one can see - there are more wholes (things I don't understand) than Swiss cheese (things I do understand)...

How to represent spatial data in Cassandra

Can someone tell me how to represent spatial data (coming from postgis) in Cassandra?
This presentation was pretty interesting, on the topic of spatial data in Cassandra, and may help:
http://www.readwriteweb.com/cloud/2011/02/video-simplegeo-cassandra.php
Please provide a bit more detail on what you are trying to achieve.
This is particularly important for Cassandra (as opposed to a relational database), because you need to model the data to support the specific queries you need, rather than modelling the domain in a fairly generic way and using SQL to define queries afterwards.
Are you just trying to look up lat/longs for entities with unique identifiers, or do you have more complex shapes associated with your entities - or what?
Responding to Mr. Roland (and hopefully the OP):
You'd need to come up with your own indexing scheme, and store the indexes in Cassandra.
For example, you could subdivide the space into squares (perhaps using a hierarchical structure such as a quadtree) and store each square in a Cassandra row, with the columns storing the objects that fall within the square. Your client code would need to determine the correct square for each lat,long, then look up the objects in that square (or squares) that cover the radius you desire, then do a final client-side filter to remove any objects that are just outside the radius due to them being stored in squares.

Data structure for Objects

Which data structures I should store the real life 'objects' in?
I am not looking for computer representation. I am looking for different data structures for different item in real life access/storage etc. Is there any study on this?
Update:
Based upon comments, I should remove the 'data' from data structures and simply looking for structures to store various objects in based upon usability rules.
Your question is a bit too vague to answer well, but in general you can think about using existing "objects"/models/representations of the abstract things you want to model or manipulate.
If those don't exist then you build your own.
Which data structure to use completely depends on the type of action you are going to perform on your data.
Some data structures are useful for random access(Arrays) while others are fast for insert delete operation( linked list )
Some store key value pair( HashMap or TreeMap)
Different operation varies arithmetically from each other in terms of time and space.So use the data structure that suit your requirements properly.

Quick Filter List

Everyone is familiar with this functionality. If you open up the the outlook address book and start typing a name, the list below the searchbox instantly filters to only contain items that match your query. .NET Reflector has a similar feature when you're browsing types ... you start typing, and regardless of how large the underlying assembly is that you're browsing, it's near instantaneous.
I've always kind of wondered what the secret sauce was here. How is it so fast? I imagine there are also different algorithms if the data is present in-memory, or if they need to be fetched from some external source (ie. DB, searching some file, etc.).
I'm not sure if this would be relevant, but if there are resources out there, I'm particularly interested how one might do this with WinForms ... but if you know of general resources, I'm interested in those as well :-)
What is the most common use of the trie data structure?
A Trie is basically a tree-structure for storing a large list of similar strings, which provides fast lookup of strings (like a hashtable) and allows you to iterate over them in alphabetical order.
Image from: http://en.wikipedia.org/wiki/Trie:
In this case, the Trie stores the strings:
i
in
inn
to
tea
ten
For any prefix that you enter (for example, 't', or 'te'), you can easily lookup all of the words that start with that prefix. More importantly, lookups are dependent on the length of the string, not on how many strings are stored in the Trie. Read the wikipedia article I referenced to learn more.
The process is called full text indexing/search.
If you want to play with the algorithms and data structures for this I would recommend you read Programming Collective Intelligence for a good introduction to the field, if you just want the functionality I would recommend lucene.

Resources