Need Hardware Sizing guidelines for SSAS Tabular server - ssas-2012

Are there any good sources of hardware sizing guidelines for servers running SSAS Tabular? Something along the lines of "if your models are estimated to be this big (in terms of memory) and you're expecting and average x-number of simultaneous connections...here's what you should start out with".
obviously they need more memory than most other systems to hold the data...but I'd also expect them to be able to take advantage of more cores and higher clock speeds...once it's in memory, its really just a matter of ripping through it and crunching the numbers. More cores to handle more concurrency, higher clock-speed for faster "crunching".

you're supposed to aim for 40% of the memory size of the original database. 1tb data = 400gb ram for SSAS Tabular.

FWIW, Microsoft just released a whitepaper (Jan 2013) that addresses this material
Hardware Sizing a Tabular Solution (SQL Server Analysis Services)

Related

Is therea way to count high-performance cores on heterogenous multicore CPUs

In a Heterogeneous MultiProcessing model, the different cores of a CPU or SoC don't have the same performance profiles. While HMP systems first got deployed a while ago (the wiki mentions Samsung started using this model back in 2013), they are coming to a head with apple's M-series (technically it was already an issue with HyperThreading, which could be considered a form of HMP I guess).
Parallelized tools generally try to guess the number of workers they should create by counting the number of cores on the system, however in an HMP model that can be counter-productive, especially for personal devices: while the "efficient" cores are technically available they have low performances, loading them will not be a huge performance gain, and it can drastically impact the interactivity / pleasantness of interacting with the system. This is usually configurable (so the user can set something more reasonable), but it seems a better default to segregate such tools to "high-performance" cores only (at least assuming they are CPU bound), leaving users the option to increase residency if they so choose.
And so is there a portable way to list and segregate "real" and "high-performance" cores from "virtual" and "efficient" cores?
Note: I've seen 68444429 "how can I distinguish between high- and low-performance cores in C++", while the title is similar the question and goals are rather different as the goal here is to avoid generating unnecessary and inefficient work by default.

Choices for shared memory system, MPI library, original RDMA or ULP over RDMA?

I am new on High Performance Computing (HPC), but I am going to have a HPC project, so I need some help to solve some fundamental problems.
The application scenario is simple: Several servers connected by the InfiniBand (IB) network, one server for Master, and others for slaves. only the master read/write in-memory data (size of the data ranges from 1KB to several hundred MBs) into slaves, while slaves just passively store the data in their memories ( and dump the in-memory data into disks at the right time). All computation are are performed in the Master, before the writing or after the reading the data to/from the slaves. The requirement of the system is low latency (small regions of data, such as 1KB-16KB) and high throughput (large regions of data, several hundred MBs).
So, My questions are
1. Which concrete way is more suitable for us? MPI, primitive IB/RDMA library or ULPs over RDMA.
As far as I know, existing Message Passing Interface (MPI) library, primitive IB/RDMA library such as libverbs and librdmacm and User Level Protocal (ULPs) over RDMA might be feasible choices, but I am not very sure of their applicable scopes.
2. Should I make some tunings for the OS or the IB network for better performance?
There is a paper [1] from Microsoft announces that
We improved performance
by up to a factor of eight with careful tuning and
changes to the operating system and the NIC drive
For my part, I will try to avoid such performance tuning as I can. However, if the tuning is unavoidable, I will try my best. The IB network of our environment is Mellanox InfiniBand QDR 40Gb/s and I can choose the Linux OS for servers freely.
If you have any ideas, comments and answers are welcome!
Thanks in advance!
[1] FaRM: Fast Remote Memory
If you use MPI, you will have the benefit of an interconnect-independent solution. It doesn't sound like this is going to be something you are going to keep around for 20 years, but software lasts longer than you ever think it will.
Using MPI also gives you the benefit of being able to debug on your (oversusbscribed, possibly) laptop or workstation before rolling it out onto the infiniband machines.
As to your second question about tuning the network, I am sure there is no end of tuning you can do but until you have some real workloads and hard numbers, you're wasting your time. Get things working first, then worry about optimizing the network. Maybe you need to tune for many tiny packages. Perhaps you need to worry about a few large transfers. The tuning will be pretty different depending on the case.

Something Similar to RAPL for non Sandy Bridge/xeon processors

First post ever here.
I wanted to know if there was something similar to the Running Average Power Limit for other processors(Intel i7) that aren't Sandy Bridge or Xeon Processors as the machine im working on in the lab.
For those who do not know. I pulled this description to bring you up to speed.
"RAPL(Running Average Power Limit) interface provides platform software
with the ability to monitor, control, and get notifications on SOC
power consumptions."
What I am looking for in particular is to acquire energy consumption measurements on a processor's individual cores after running some code like Matrix Multiplication or Vector Addition. Temperature would be excellent too but that's another question for another day(lm-sensors is a bit puzzling to me)
Thanks and Take Care.
Late answer on this: There's PowerTOP on Linux, but that works for Laptops only as it needs the battery discharge rate for that. It can display Watts per process, but don't ask me how accurate that is (personally I think there might be some problems with that). IIRC it counts the number of CPU wakeups from a CPU sleep state to calculate the energy consumption per process. Also, for AMD processors there's the fam15h_power driver in the lm-sensors software package. For rather new (2011 and newer) Bulldozer AMD CPUs you can get the energy consumption that way.
Note that RAPL does not provide energy consumption per core on a multicore CPU, but only for the whole CPU. You can get the energy consumption of core and non-core (like integrated graphics) separately, but per-core is not possible.

Multicore processor core communication speeds

I would look to find the speed of communication between the two cores of a computer.
I'm in the very early stages of planning to massively parallelise a sequential program and I need to think about network communication speeds vs. communication between cores on a single processor.
Ubuntu linux probably provides some way of seeing this sort of information? I would have thought speed fluctuates.. I just need some average value. I'm basically needing to write something up at the moment and it would be good to talk about these ratios.
Any ideas?
Thanks.
According to this benchmark: http://www.dragonsteelmods.com/index.php?option=com_content&task=view&id=6120&Itemid=38&limit=1&limitstart=4 (Last image on the page)
On an Intel Q6600, inter-core latency is 32 nanoseconds. Network latency is measured in milliseconds which 1,000,000 milliseconds / nanosecond. "Good" network latency is considered around or under 100ms, so given that, the difference is about the order of 1 million times faster for inter-core latency.
Besides latency though there's also Bandwidth to consider. Again based on the linked bookmark, benchmark for that particular configuration, inter-core bandwidth is about 14GB/sec whereas according to this: http://www.tomshardware.com/reviews/gigabit-ethernet-bandwidth,2321-3.html, real-world test of a Gigabit ethernet connection shows about 35.8MB/sec so the difference there is smaller, only on the order of around 500 times faster in terms of bandwidth as opposed to a 1,000,000 times in latency. Depending on which is more important to your application that might change your numbers.
The network speeds are measured in milliseconds for Ethernet ($5-$100/port), or microseconds for specialized MPI hardware like Dolphin on Myrintet (~ $1k/port). Inter-core speeds are measured in nanoseconds, as the data is copied from one memory area to another, and then some signal is sent from one CPU to another (the data will be protected from simultaneous access by a mutex or a full-bodied queue).
So, using a back'o'the'napkin calculation the ratio is 1:10^6.
Inter-core communication is going to be massively faster. Why ?
the network layer imposes a massive overhead in terms of packets, addressing, handling contention etc.
the physical distances will impose a sizeable impact
How you measure inter-core communication speed would be very difficult. But given the above I think it's a redundant calculation to make.
This is a non-trivial thing to find. The speed of data transfer between two cores depends entirely on the application. It could depend on any (or all) of - the speed of register access, the clock speed of the cores, the system bus speed, the latency of your cache, the latency of your memory, etc., etc., etc. In short, run a benchmark or you'll be guessing in the dark.

Highly concurrent multi-threaded application requires hardware

I am looking for a hardware, which must run about 256 computationally intensive real-time concurrent tasks in 24 hour mode (one multi-threaded C application). Each task takes about 40-50 MFLOPs, so all tasks require about 10 GFLOPs. CPU-RAM speed is insignificant. All tasks must be managed by a Linux Kernel (32 bit, with SMP).
I am looking for a one-mainboard solution with one multi-core CPU (if such CPU exist). If such CPU doesn't exist, then I need one mulit-socket mainboard solution (with multiple CPUs).
Can you please recommend me any professional CPU/Mainboard solution which will satisfy such requirements? It is also very important that there are no issues with Linux Kernel (2.6.25). No virtualization, no needs in huge RAM or CPU cache. I also would prefer Intel architecture and well-proved stability. I still have doubts that it is feasible at all.
Thank you in advance.
UPDATE:
I think I have found a right answer here and here.
UltraSPARC T2 has 8 cores with 8 threads each. Integrated high-bandwidth memory and IO. The T5140 carries two of them for 128 hardware threads.
The theoretical max raw performance of the 8 floating point units is 11 Giga flops per second (GFlops/s). A huge advantage over other implementations however is that 64 threads can share the units and thus we can achieve an extremely high percentage of theoretical peak. Our experiments have achieved nearly 90% of the 11 Gflop/s. - (http://blogs.oracle.com/deniss/entry/floating_point_performance_on_the)
Rent some Amazon EC2 nodes.
Updated: How about PS3's then? The NASA uses them for their simulation engines.
Maybe use CPU+GPU's in commercial servers?
Build it around FPGAs: nowadays, some variants include processors that can run Linux.
Even though you've given us the specs you think you need, we might be able to help you out better if you tell us what the application is intended to accomplish, and how it was implemented.
There may be a better way to split the work up or deal with it rather than your current solution.
Not Intel architecture but these run linux and have 64 cores on a single die.
TILEPro64
Get a bunch of four- or eight-core machines and split the processing across the machines using some sort of grid or clustering software. Maybe have a look at Beowulf.
As you mentioned, 10GFlops isn't exactly to be sneezed at so in a single machine, it'll be expensive. There's also the problem what you do when the machine breaks, you're unlikely to have a second machine of similar spec available. If you build a cluster using commodity hardware, you're a little more resilient and it's easier to find replacement machines.
MFLOPS and GFLOPS are very poor indicators of how well a program can run on any given CPU. These days, cache footprint is much more important; perhaps branch prediction accuracy as well.
There's almost no way to gauge performance of a given application on different architectures without actually giving it a spin. And even then, you may not get a good idea if you were unlucky enough to unknowingly build with compiler options that ruined your cache footprint, or used a bad threading library, or any of a hundred other things.
I see you'd prefer intel, but if you need one chip, I will again suggest the cell processor -
its theoretical peak performance is arount 25GFlops - kernel 2.6.25 had support for it already.
You could try a pre-slim playstation 3 for experimenting with (that would cost you little) or get yourself a server-based solution at around US$8K - you will have to re-write and fine tune your threads to take advabtage of the SPU co-processors there, but you could achieve your computational needs without breaking a sweat with a single CELL (1 PPC core + 8 SPU's)
NB.: with a playstation 3, you'd have only 6 available co-processors - but you don't seen to be on a budget with this project -
So you could at least try IBM's cell developer kit, which offers an emulator, to see if you can code your solution to run on it.
Thre are commercially available CELL products, both as stand-alone servers in blade form factory, and PCI Express add-on boards for PC workstations from
Mercury Computer Systems:
http://www.mc.com/microsites/cell/products.aspx?id=6986
Mercury does not list any prices on the site, but the pricing seens to be around the previoulsy mentioned U$8000.00 for these PCI Express cards.
A playstation 3 videogame can be purchased for about U$300.00 - and would allow you to prototype your application, and check if it is up to the needed performance. (I myself got one and have Fedora 9 running on it, although I did that as a hobbyst and have not, so far, used it for any calculations - I had also put together a Playstation-3 12 machinne cluster for Molecular simulations at the local University. The application they run did not take advantage of the multimedia SPU's, while I was in touch with then. But even so, clocked at 3.5GHz they performed better than standard ,s imlarly priced, PC's, even considering PS3's are priced 5x higher around here)

Resources