Let's suppose that we have geoip database: IPrangeStart, IPrangeEnd, country.
#for, example
1.1.1.1:2.2.2.2:US
3.3.3.3:4.4.4.4:DE
etc.
This database have a lot of strings, but all this data can perfectly fit memory (about 200-500Mb). Now we need to find country by ip. What data structure fits best for doing it (we'll transfer all IP to int, of course)?
An array that's sorted by the range start value will let you find the proper range with a simple binary search. I don't know how many address ranges you're working with, but even if you had a million ranges the binary search will take at most 20 probes. You could easily do tens of thousands of lookups per second with that.
Another option is a segment tree, although I don't see it being particularly helpful in this situation since you don't have overlapping intervals.
Related
I am trying to implement moving average for a dataset containing a number of time series. Each column represents one parameter being measured, while one row contains all parameters measured in a second. So a row would look something like:
timestamp, parameter1, parameter2, ..., parameterN
I found a way to do something like that using window functions, but the following bugs me:
Partitioning Specification: controls which rows will be in the same partition with the given row. Also, the user might want to make sure all rows having the same value for the category column are collected to the same machine before ordering and calculating the frame. If no partitioning specification is given, then all data must be collected to a single machine.
The thing is, I don't have anything to partition by. So can I use this method to calculate moving average without the risk of collecting all the data on a single machine? If not, what is a better way to do it?
Every nontrivial Spark job demands partitioning. There is just no way around it if you want your jobs to finish before the apocalypse. The question is simple: When it comes time to do the inevitable aggregation (in your case, an average), how can you partition your data in such a way as to minimize shuffle by grouping as much related data as possible on the same machine?
My experience with moving averages is with stocks. In that case it's easy; the partition would be on the stock ticker symbol. After all, the calculation of the 50-Day Moving Average for Stock A has nothing to with that for Stock B, so those data don't need to be on the same machine. The obvious partition makes this simpler than your situation--not to mention that it only requires one data point (probably) per day (the closing price of the stock at the end of trading) while you have one per second.
So I can only say that you need to consider adding a feature to your data set whose sole purpose is to serve as a partition key even if it is irrelevant to what you're measuring. I would be surprised if there isn't one, but if not, then consider a time-based partition on days for example.
I am storing account information in Cassandra. Each account has lists of data associated with it. For example, an account may have a list of friends and a list of liked books. Queries on accounts will always want all friends or all liked books or all of both. No filtering or searching is needed on either. The list of friends and books can grow and shrink.
Is it better to use a set column type or composite columns for this scenario?
I would suggest you not to use sets if
You are concerned about disk space(as each value is allocated a cell in disk + data space for metadata of each cell which is 15 bytes if am not wrong. Now that consumes a lot if your data is a growing one).
Not going to grow a lot of data in that particular row as each time ,the cells are to be fetched from different sstable .
In these kind of cases, the more preferred option would be a json array. You shall store it as a text and back the data from that.
Set (or any other collections ) use case was brought in for a completely different perspective. If you are having a particular value inside the list or a value has to be updated frequently inside the same collection, you shall make use of the collections .
My take on your query will be this.
Store all account specific info in a json object of friends that has a value as list of books .
Sets are good for smaller collections of data, if you expect your friends / liked books lists to grow constantly and get large (there isn't a golden number here) it would be better to go with composite columns as that model scales out better than collections and allows for straight up querying compared to requiring secondary indexes on collections.
I have several documents which contain statistical data of performance of companies. There are about 60 different excel sheets representing different months and I want to collect data into one big table. Original tables looks something like this, but are bigger:
Each company takes two rows which represent their profit from the sales of the product and cost to manufacture the product.I need both of these numbers.
As I said, there are ~60 these tables and I want to extract information about Product2. I want to put everything into one table where columns would represent months and rows - profit and costs of each company. It could be easily done (I think) with INDEX function as all sheets are named similarly. The problem I faced is that at some periods of time other companies enter the market:
Some of them stay, some of them fail. I would like to collect information on all companies that exist today or ever existed, but newly found companies distort the list (in second picture we see, that company BA is in 4th row, not BB). As row of a company changes from time to time, using INDEX becomes problematic, because in some cases results of different companies get into one row. Adjusting them one by one seems very painful.
Maybe there is some quick and efficient method to solve such problem?
Any help or ideas would be appreciated.
One think you may want to try is linking the Excel spreadsheets as tables in Access. From there you can create a query that ties the tables together. As data changes in the spreadsheets, the query will reflect those changes.
I have recently been working on a data table in Excel containing measurements of fossil specimens. In addition to containing things like the specimen number, species name, etc., the table also contains measurements from the fossils in question. However, because several specimens have data from both the left and right sides of the specimen, I often end up with situations where a single entry spans multiple rows, which means I cannot sort the data.
I have looked elsewhere on the Internet for a solution, and the only response I have gotten is that Excel doesn't really work well with entries spanning multiple rows, and I should reorganize my data. I understand that, and I have been looking for an alternate way of organizing the data. However, I have not been able to find an easy way to organize the data. I have tried reorganizing the information so each entry spans multiple rows, but when I do this it becomes very easy to make mistakes and to lose track of the data. It also becomes difficult to compare the data, since the measurements on the left and right side of the specimen are essentially the same thing and I cannot easily compare them if one specimen has a bone only preserved on the right side and the other specimen has the same bone preserved but only on the left side.
I have also tried organizing the measurements into a separate sheet which could be accessed by a hyperlink from the main sheet, but this has also posed problems. Because in this case the measurement data still cannot be sorted by specimen number of species name, if a specimen number or species name changes (which it has in the past), I have to manually reorganize all the hyperlinks by hand.
Finally, I have also tried adding an identifier to the multi-row entries, but this has a tendency to get screwed up if I sort the data, and it also mixes up any equations I use in the sheet. I might be doing it wrong somehow.
The good news is I am not interested in sorting the specimens by measurements, so if there is any way to organize the table so it is sortable but the measurements cannot be sorted, that is fine. At the same time, because all specimens technically have a left and right side (plus the average measurement between them), I could also work with a system wherein each "entry" spanned a set number of rows or subrows.
I was also wondering if it would be possible to write a macro to sort the data (especially since I am just sorting by the first five columns or so), or else do the database in some other program like Microsoft Access. Any help would be greatly appreciated.
Everything you describe really breaks down to: "Excel is great for analysis; but I really should have stored my source data in a database." Accountants I have worked with almost always come to this conclusion eventually, once their data and reporting needs get sufficiently complex.
I suggest you invest the moderate effort to upload your data to a proper database, and learn how to download I as appropriate to Excel for specific analyses. The time effort will be well spent, and simpler by far than coercing EXCEL into tasks for which it is ill-suited.
MS-Access, MySql, and SQL-Server Express are all suitable for this type of upgrade. MS-Access, if already available in your Office subscription, has the advantage of integrating even more easily with Excel than the other two, and also uses VBA as it's macro language. The other two offer more complete and powerful implementations of the SQL language. All told, use the one most easily available to you.
I'm using Windows Azure and venturing into Azure Table Storage for the first time in order to make my application scalable to high density traffic loads.
My goal is simple, log every incoming request against a set of parameters and for reporting count or sum the data from the log. In this I have come up with 2 options and I'd like to know what more experienced people think is the better option.
Option 1: Use Boolean Values and Count the "True" rows
Because each row is written once and never updated, store each count parameter as a bool and in the summation thread, pull the rows in a query and perform a count against each set of true values to get the totals for each parameter.
This would save space if there are a lot of parameters because I imagine Azure Tables store bool as a single bit value.
Option 2: Use Int Values and Sum the rows
Each row is written as above, but instead each parameter column is added as a value of 0 or 1. Summation would occur by querying all of the rows and using a Sum operation for each column. This would be quicker because Summation could happen in a single query, but am I losing something in storing 32 bit integers for a Boolean value?
I think at this point for query speed, Option 2 is best, but I want to ask out loud to get opinions on the storage and retrieval aspect because I don't know Azure Tables that well (and I'm hoping this helps other people down the road).
Table storage doesn't do aggregation server-side, so for both options, you'd end up pulling all the rows (with all their properties) locally and counting/summing. That makes them both equally terrible for performance. :-)
I think you're better off keeping a running total, instead of re-summing everything everytime. We talked about a few patterns for that on Cloud Cover Episode 43: http://channel9.msdn.com/Shows/Cloud+Cover/Cloud-Cover-Episode-43-Scalable-Counters-with-Windows-Azure