We are a large company, selling frobnication services to tens of thousands of customers via phone calls. Orders get recorded on punch cards, featuring
a customer ID
a date
the dollar amount of frobnication bought.
In order to process these into monthly bills to our users, we're ready to buy computing equipment modern enough for the '60s. I presume we're going to store our user database on a tape (... since... that's where you can store a lot of data with 60s tech, right?).
Sales record punch cards are coming in unsorted. Even if the records on tape are sorted by e.g. customer ID, doing one "seek" / lookup for each punch card / customer ID coming in (to update e.g. a "sum" amount) would be very slow. Meanwhile, if you have e.g. 256k of RAM (even less?), significant parts of the data set just won't fit.
My question is: how can this database operation be done in practice? Do you sort the punch cards first & then go through the tape linearly? How do you even sort punch cards? Or do you copy all of them to a tape first? Do you need multiple batch jobs to do all of this? How much of this is code we'll have to write vs. something that's coming with the OS?
(... yes I've heard about those fridge-size devices with spinning metal disks that can randomly seek many times a second; I don't think we'll be able to afford those.)
In the 60's you would most likely
You store your data in a Master-File sorted in Key sequence
Sort the Punch-Cards to a temporary Disk file.
Do a Master-File Update using the Temporary Disk file (transaction File) and the master file.
They might of used a Indexed-file or some Database (e.g. IMS) if online access is required.
Master File Update
For a Master File update both files need to be sorted in to the same sequence and you match on keys, it writes an updated master file using the details from the two. It Basically like a SQL Outer join.
Logic
Read Master-File
Read Transaction-file
While not eof-master-file and not eof-Transaction-file
if Transaction-file-key < Master-File-key
Write transaction-file details into updated-master-file
Read Transaction-file
else_if Transaction-file-key == Master-File-key
update Master-File-Record with Transaction-file-details
Write updated-master-file-record to updated-master-file
Read Transaction-file
else
Write master-file-record to updated-master-file
Read Master-File
end_if
end_while
Process Remaining Transaction-file records
Process Remaining Master-file records
Related
We are starting to think about how we will utilize locations within warehouses to keep tighter tracking and control over where our items are physically located. In that context, we are trying to figure out what the actual workflow would be when it relates to performing physical inventory counts and review. I have read the documentation, but I'm wondering how to best think through the below scenario.
Let's say to start, that we have 10 serial items across 5 locations (so let's assume 2 in each location). And assume that all these locations are in the same warehouse.
2 weeks go by, and there is movement between these locations by way of the inventory transfer document process. But for this example, let's say that users didn’t perform the inventory transfer as they physically moved the items between locations 100% of the time.
So at this point, where acumatica thinks the serial items are doesn't reflect the reality of where they are.
So now we do Physical inventory for this warehouse (all 5 locations together).
By the time we complete the inventory count and review, we will see the 10 items in the same warehouse. BUT:
will see be able to see that variances/problems against the locations? Meaning, will it highlight/catch where they actual are located vs where acumatica thought they were located,?
and assuming yes, is there anything in the inventory process that will handle the auto transferring to it's correct location within the warehouse? Or does this need to then be done manually through an inventory transfer?
Any help would be much appreciated.
Thanks.
I am making a game and will use hazelcast to save player data, should I save each player's data as a Map? Because if I save player's data as a map element, so with every small change like increasing the gold value, I have to put all of the data of the player:
ex:
playerData.gold = newValue;
players.replace(palyerID, playerData);
but if I make a player data as a Map, so I will just put new gold value. ex:
playerA.replace("gold", newGoldValue)
But I affair that creates many maps is not good (in case I have more than 1 million players). Can I create as many maps as I want? if not, how many maps can I create?
Hoang
I just answered the same question in our community channel:
You can use EntryProcessor to read or update only a single for some of the properties of the player data. I dont know how many users you’ll have but in case you have 100k user, for example, creating 100K map wont help you, but you can have 100K or tens of millions or more records on a single map. Please check: http://docs.hazelcast.org/docs/3.10.4/manual/html-single/index.html#entry-processor and https://github.com/hazelcast/hazelcast-code-samples/tree/master/distributed-map/entry-processor
On the other hand, if each user data > 5-10 MB & if you need to access all of that data frequently, then splitting it is a good idea.
Please note that these are just general suggestions. You need to test different setups & find the most suitable & performant one for your use case.
I would like to design a system that
Will be reading CDR (call data records) files and inserts them
into a nosql database. To achieve this spark streaming with Cassandra as nosql looks promising as the files will keep coming
Will be able to calculate real time price by rating the duration and called number or just kilobytes in case of data and store the total so far chargable amount for the current billcycle. I need a nosql that i will be both inserting rated cdrs and updating total so far chargable amount for the current billcycle for that msisdn in that cdr.
In case rate plans are updated for a specific subscription, for the current billcycle all the cdrs using that price plan needs to be recalculated and total so far amount needs to be calculated for all the customers
Notes:
Msisdns are unique for each subscription with one to one relation.
Within a month One msisdn can have up to 100000 cdrs.
I have been going through the nosql databases so far i am thinking to
use cassandra but i am still not sure how to design the database to
optimize for this business case.
Please also consider while one cdr is being processed in one node,
the other cdr for the same msisdn can be processed in another node
at the same time and both nodes doing the above logic.
The question is indeed very broad - StackOverflow is a meant to cover more specific technical questions and not debate architectural aspects of an entire system.
Apart from this, let me attempt to address some of the aspects of your questions:
a) Using streaming for CDR processing:
Spark Streaming is indeed a tool of choice for incoming CDRs, typically delivered over a message queueing system such as Kafka. It allows windowed operations, whcih com in handy when you need to calculate call charges over a set period (hours, days etc..). You can very easily combine existing static records, such as price plans from other databases, with your incoming CDRs in windowed operations. All of that in a robust and expansive API.
b) using Cassandra as a store
Cassandra has excellent scaling capabilities with instantaneous row access - for that, it's an absolute killer. However, in the case of TelCo industry setting, I would seriously question using it for anything else than MSISDN lookups and credit checks. Cassandra is essentially a columnar KV storage, and trying to store multi dimensional, essentially relational records such as price plans, contracts and the lot will give you lots of headaches. I would suggest storing your data in different stores, depending on the use cases. These could be:
CDR raw records in HDFS -> CDRs can be plentiful, and if you need to reprocess them, collecting them from HDFS will be more efficient
Bill summaries in Cassandra -> the itemized bill summaries is the result of CDR as initially processed by Spark Streaming. These are essentially columnar and can be perfectly stored in Cassandra
MSISDN and Credit information -> as mentioned above, that is also a perfect use case for Cassandra
price plans -> these are multi dimensional, more document oriented, and should be stored in databases that support such structures. You can perfectly use Postgres with JSON for that, as you you wouldn't expect more than a handful of plans.
To conclude this, you're actually looking at a classic lambda use case with Spark Streaming for immediate processing of incoming CDRs, and batch processing with regular Spark on HDFS for post processing, for instance when you're recalculating CDR costs after plan changes.
I am trying to build a real-time stock application.
Every seconds I can get some data from web service like below:
[{"amount":"20","date":1386832664,"price":"183.8","tid":5354831,"type":"sell"},{"amount":"22","date":1386832664,"price":"183.61","tid":5354833,"type":"buy"}]
tid is the ticket ID for stock buying and selling;
date is the second from 1970.1.1;
price/amount is at what price and how many stock traded.
Reuirement
My requirement is show user highest/lowest price at every minute/5 minutes/hour/day in real-time; show user the sum of amount in every minute/5 minutes/hour/day in real-time.
Question
My question is how to store the data to redis, so that I can easily and quickly get highest/lowest trade from DB for different periods.
My design is something like below:
[date]:[tid]:amount
[date]:[tid]:price
[date]:[tid]:type
I am new in redis. If the design is this is that means I need to use sorted set, will there any performance issue? Or is there any other way to get highest/lowest price for different periods.
Looking forward for your suggestion and design.
My suggestion is to store min/max/total for all intervals you are interested in and update it for current ones with every arriving data point. To avoid network latency when reading previous data for comparison, you can do it entirely inside Redis server using Lua scripting.
One key per data point (or, even worse, per data point field) is going to consume too much memory. For the best results, you should group it into small lists/hashes (see http://redis.io/topics/memory-optimization). Redis only allows one level of nesting in its data structures: if you data has multiple fields and you want to store more than one item per key, you need to somehow encode it yourself. Fortunately, standard Redis Lua environment includes msgpack support which is very a efficient binary JSON-like format. JSON entries in your example encoded with msgpack "as is" will be 52-53 bytes long. I suggest grouping by time so that you have 100-1000 entries per key. Suppose one-minute interval fits this requirement. Then the keying scheme would be like this:
YYmmddHHMMSS — a hash from tid to msgpack-encoded data points for the given minute.
5m:YYmmddHHMM, 1h:YYmmddHH, 1d:YYmmdd — window data hashes which contain min, max, sum fields.
Let's look at a sample Lua script that will accept one data point and update all keys as necessary. Due to the way Redis scripting works we need to explicitly pass the names of all keys that will be accessed by the script, i.e. the live data and all three window keys. Redis Lua has also JSON parsing library available, so for the sake of simplicity let's assume we just pass it JSON dictionary. That means that we have to parse data twice: on the application side and on the Redis side, but the performance effects of it are not clear.
local function update_window(winkey, price, amount)
local windata = redis.call('HGETALL', winkey)
if price > tonumber(windata.max or 0) then
redis.call('HSET', winkey, 'max', price)
end
if price < tonumber(windata.min or 1e12) then
redis.call('HSET', winkey, 'min', price)
end
redis.call('HSET', winkey, 'sum', (windata.sum or 0) + amount)
end
local currkey, fiveminkey, hourkey, daykey = unpack(KEYS)
local data = cjson.decode(ARGV[1])
local packed = cmsgpack.pack(data)
local tid = data.tid
redis.call('HSET', currkey, tid, packed)
local price = tonumber(data.price)
local amount = tonumber(data.amount)
update_window(fiveminkey, price, amount)
update_window(hourkey, price, amount)
update_window(daykey, price, amount)
This setup can do thousands of updates per second, not very hungry on memory, and window data can be retrieved instantly.
UPDATE: On the memory part, 50-60 bytes per point is still a lot if you want to store more a few millions. With this kind of data I think you can get as low as 2-3 bytes per point using custom binary format, delta encoding, and subsequent compression of chunks using something like snappy. It depends on your requirements, whether it's worth doing this.
I currently have a few server reports that return usage statistics whenever run. The data is collected from several different sources (mostly log files), so they're not in a database to begin with.
The returned data are simple lists, for example, detailing how much disk space a user is using (user => space) average percent memory they've used for the month (user => memory), avg CPU time, etc.
Some of the information is a running total (like disk usage) and others are averages of snapshots taken throughout the month.
Running these reports and looking at the results works perfect, but I'd like to start storing these results to look at long-term trends.
What would be the best way to do this?
CACTI is very useful and highly configurable. Utilizes RRD Tool.
RRD Tool is great, b/c it stores data in a circular format and summarizes it. When RRD creates a data file, it creates it with every data point that it will ever store, so it never gets larger. You don't have to worry about log files getting too large. The key is to configure it to summarize in time periods, e.g., daily, monthly, yearly. The downside is that next year, you might not be able to know the CPU usage for the five minute period from January 1 of this year. But who really needs that?
RRDtool seems the obvious solution for this.
Or for that matter, one of the out-of-the box monitoring tools, some of which happen to use rrdtool for storing their data. E.g. Munin.