I need to maximize speed while converting videos using FFmpeg to h264
Any input format of source videos
User's machine can have any number of cores
Power and memory consumption are non-issues
Of course, there are a whole bunch of options that can be tweaked but this question is particularly about choosing the best -thread <count> option. I am trying to find an ideal thread count as a function of
no. of cores
input video format
h264-friendly values maybe?
anything else missed above?
I am aware the default -thread 0 follows one-thread-per-core approach which is supposed to be optimal. But I am not sure if this is time or space-optimized. Also, on certain testcases, I've seen more threads (say 4 threads on my dual core test machine) finishes quicker than the default.
Any other direction, say configure options w.r.t. threads, worth pursuing?
I have found that threads do not do a good job of utilizing all the cores, the hyper-threads do not get used at all. One solution I could come up with is to run a 3 to 4 ffmpeg processes in parallel, See: https://superuser.com/questions/538164/how-many-instances-of-ffmpeg-commands-can-i-run-in-parallel/547340#547340 This approach ends up using all the cores fully and is faster than the single input, multiple outputs in a single command option.
If your 'dual-core' has hyperthreading, then 2x cores would probably be correct. There's unlikely to be gain going beyond the number of virtual cores (inc. hyperthreading), but perhaps due to internal issues in FFmpeg it might be true.
I have experimented thoroughly with threads 0, 6, 12, 24 and it doesn't make a difference in frame rate, overall processing time or CPU utilization. Note my system has 12 physical cores too. Generally it seems to do a good job of using your processing power without specifying threads where my 12 cores are basically 98-99% utilized for the duration while watching top/system monitor.
I wish there was a magic bullet but for now there is no other way to speed things up as ffmpeg is currently optimized very well in my opinion. The only alternative is simply to get more computing power or to do distributed processing.
*Note all my tests were using ffmpeg version 3.3.1
Related
For my research I need a CPU benchmark to do some experiments on my Ubuntu laptop (Ubuntu 15.10, Memory 7.7 GiB, Intel Core i7-4500U CPU # 1.80HGz x 4, 64bit). In an ideal world, I would like to have a benchmark satisfying the following:
The CPU should be an official benchmark rather than created by my own for transparency purposes.
The time needed to execute the benchmark on my laptop should be at least 5 minutes (the more the better).
The benchmark should result in different levels of CPU throughout execution. For example, I don't want a benchmark which permanently keeps the CPU utilization level at around 100% - so I want a benchmark which will make the CPU utilization vary over time.
Especially points 2 and 3 are really key for my research. However, I couldn't find any suitable benchmarks so far. Benchmarks I found so far include: sysbench, CPU Fibonacci, CPU Blowfish, CPU Cryptofish, CPU N-Queens. However, all of them just need a couple of seconds to complete and the utilization level on my laptop is at 100% constantly.
Question: Does anyone know about a suitable benchmark for me? I am also happy to hear any other comments/questions you have. Thank you!
To choose a benchmark, you need to know exactly what you're trying to measure. Your question doesn't include that, so there's not much anyone can tell you without taking a wild guess.
If you're trying to measure how well Turbo clock speed works to make a power-limited CPU like your laptop run faster for bursty workloads (e.g. to compare Haswell against Skylake's new and improved power management), you could just run something trivial that's 1 second on, 2 seconds off, and count how many loop iterations it manages.
The duty cycle and cycle length should be benchmark parameters, so you can make plots. e.g. with very fast on/off cycles, Skylake's faster-reacting Turbo will ramp up faster and drop down to min power faster (leaving more headroom in the bank for the next burst).
The speaker in that talk (the lead architect for power management on Intel CPUs) says that Javascript benchmarks are actually bursty enough for Skylake's power management to give a measurable speedup, unlike most other benchmarks which just peg the CPU at 100% the whole time. So maybe have a look at Javascript benchmarks, if you want to use well-known off-the-shelf benchmarks.
If rolling your own, put a loop-carried dependency chain in the loop, preferably with something that's not too variable in latency across microarchitectures. A long chain of integer adds would work, and Fibonacci is a good way to stop the compiler from optimizing it away. Either pick a fixed iteration count that works well for current CPU speeds, or check the clock every 10M iterations.
Or set a timer that will fire after some time, and have it set a flag that you check inside the loop. (e.g. from a signal handler). Specifically, alarm(2) may be a good choice. Record how many iterations you did in this burst of work.
With a 1270v3 and a single thread app I'm at the end of performance but when I watch monitoring tools like atop I don't understand how this whole stuff works. I tried to find a nice article about this sort of topic but they either have been explained in a language I don't understand or are not about the stuff I would like to know. I hope it is alright to ask this kind of stuff here.
From my understanding a single-thread app does only use one thread for all/most of the work. So the performance is defined by the single-thread power of the CPU.
A moment before I wrote this question I played around with CPU-frequency and noticed that although there are only two instances of the app running the usage is shared across all cores.
So I assume that the thread jumps around between these cores.
So I set the CPU scaling to performance with cpufreq-set -g performance. The result was that all CPU cores/threads stayed at about 2GHz like it was before besides one that is permanently on 3.5GHz (100%). As I only changed the scaling for one core, why is the usage still shared across all cores? I mean the app is running at about 300%, why doesn't it stick to the CPU core with the 100%?
Furthermore as I noticed that only one of the CPU's got scaled up I looked into the help page and found -r which should scale all cores with the performance settings. Unfortunately nothing does change. (Is this a bug in Ubuntu 1404?) So I used -c with the number 8 (8 threads) -> didn't work. 4 -> works but only scales 2 cores out of 8. 7 -> scaled 4 cores. So I'm wondering, does this not support hyper-threading or is the whole program that buggy?
However as I understand it, the CPU's with the max frequency together with the thread jump around in the monitoring tools as they display the average the usage, which than looks like shared. Did I figure this right?
Would forcing one cpu to 3.5GHz and forcing the app to this core improve performance or is all the stuff I'm wondering about only about avg calculation between the data they show each second.
If so am I right that I should run best with cpufreq-set -c 7 -g performance if power consumption doesn't matter?
Thanks for reading so far, I hope you have a moment to help me understand the whole thing.
Atop example screenshots:
http://i.imgur.com/VFEBvLx.png
http://i.imgur.com/cBKOnJM.png
http://i.imgur.com/bgQfwfU.png
I believe a lot of your confusion has to do with the fuzzy mapping of the capabilities of cpufreq to the actual capabilities of the hardware.
Here’s a description of what is taking place on the HW and in the OS.
A processor is a collection of cores on the same silicon substrate. The cores are what we used to call CPUs with some enhancements. CPUs now have the capability of running multiple HW threads (hyperthreading), each hardware thread being equivalent to one of the old type CPUs. Putting this all together, the 1270v3 is a quad core (if I recall correctly), meaning it has 4 cores on the same silicon substrate. Each core can support two HW threads, each HW thread being equivalent to what the OS calls a CPU (and I’ll call a virtual CPU). So from the OS perspective, the 1270v3 has 8 (virtual) CPUs.
The OS doesn’t see packages, cores or HW threads. It sees CPUs, and to it there appear to be 8 of them.
To further complicate the issue, modern processors have various HW supporting power saving states called P-states, C-states and package C-states. Why do I mention these? It’s because they are intimately associated with the frequency of the processor. And cpufreq professes to provide the user with control over the processor’s frequency.
Now, I’m not familiar with cpufreq outside of reading the manpage and other material on the web. From my reading, it has a lot of idiosyncrasies, so I’ll talk about what it is doing from a broad perspective.
In a general sense, cpufreq has a lot of generic capability that may or may not be supported by the HW or the kernel. This is confusing because it looks like the functionality is there but then things don’t happen as you would expect. For example, cpufreq gives the impression that you can set each CPU’s frequency independently. In reality, on a hyperthreaded system, two “CPUs” are associated with each core and must have the same frequency.
A lot of the functionality that cpufreq is supposed to control is associated with the power efficiency characteristics of the processor, but again, its mapping to the processor’s actual hardware capabilities is incomplete and misleading. Though cpufreq seems to allow you to set max and min frequencies, the processor hardware doesn’t work this way. In modern Intel processors, such as the 1270v3, there are something called P-states. These P-states are frequency-voltage pairs that slow down a processor’s frequency (and drop its voltage) to reduce the processor’s power consumption at the cost of the application taking longer to run. These frequency-voltage pairings aren’t arbitrary though cpufreq gives the impression that they are.
What does this all mean? In addition to the thread migration issues that the commenter mentioned, cpufreq isn’t going to behave the way you expect because it appears to have capability that it actually doesn’t, and the capability that it does actually have maps only roughly to the actual capabilities of the HW and OS.
I embedded some further comments in your text.
With a 1270v3 and a single thread app I'm at the end of performance but when I watch monitoring tools like atop I don't understand how this whole stuff works. I tried to find a nice article about this sort of topic but they either have been explained in a language I don't understand or are not about the stuff I would like to know. I hope it is alright to ask this kind of stuff here.
From my understanding a single-thread app does only use one thread for all/most of the work. [Yes, but this doesn’t mean that the thread is locked to a specific virtual CPU or core.] So the performance is defined by the single-thread power of the CPU. [It’s not that simple. The OS migrates threads around, it has its own maintenance processes, etc] A moment before I wrote this question I played around with CPU-frequency and noticed that although there are only two instances of the app running the usage is shared across all cores. So I assume that the thread jumps around between these cores. So I set the CPU scaling to performance with cpufreq-set -g performance. The result was that all CPU cores/threads stayed at about 2GHz like it was before besides one that is permanently on 3.5GHz (100%). As I only changed the scaling for one core, why is the usage still shared across all cores? I mean the app is running at about 300%, why doesn't it stick to the CPU core with the 100%? [Since I can’t see what you are observing, I don’t really understand what you are asking.]
Furthermore as I noticed that only one of the CPU's got scaled up I looked into the help page and found -r which should scale all cores with the performance settings. Unfortunately nothing does change. (Is this a bug in Ubuntu 1404?) So I used -c with the number 8 (8 threads) -> didn't work. 4 -> works but only scales 2 cores out of 8. 7 -> scaled 4 cores. [I haven’t used cpufreq so can’t directly speak to its behavior, but the manpage implies that “-c ” refers to a specific virtual CPU and not the number of virtual CPUs.] So I'm wondering, does this not support hyper-threading or is the whole program that buggy? [Again, I’m not sure from your explanation what you are doing, but the n->n/2 behavior makes sense to me. You can change the frequency of a core but since each core has two hyperthreads/virtual CPUs, two of those virtual CPUs must scale together.]
However as I understand it, the CPU's with the max frequency together with the thread jump around in the monitoring tools as they display the average the usage, which than looks like shared. Did I figure this right? [Again, I’m not sure what you are observing. Both physically and in atop, the CPU designation should not change, meaning CPU001 will always refer to the same virtual CPU. The core with the max frequency shouldn’t physically jump around, though the user thread may. Something to note is that monitoring tools can be pretty heavy users of the CPU. This heavy usage can make figuring out your processor usage difficult if it causes threads to jump around to different virtual CPUs.]
Would forcing one cpu to 3.5GHz and forcing the app to this core improve performance or is all the stuff I'm wondering about only about avg calculation between the data they show each second. [I found a pretty good explanation of atop with a lot of helpful screen shots: http://www.unixmen.com/linux-basics-monitor-system-resources-processes-using-atop/] If so am I right that I should run best with cpufreq-set -c 7 -g performance if power consumption doesn't matter? [It all depends upon what other processes are running on your system. If your system is mostly idle except for your processes, then forcing a core to a certain frequency won’t make a difference. [I’m suspicious of what a “governor” does. The language appears to refer to power-efficiency/performance (“balanced”, “powersave”, “performance”, etc) but the details don’t match the capability of today’s hardware.]
Thanks for reading so far, I hope you have a moment to help me
I have noticed this a number of times while doing computational expensive tasks on my computer, anywhere from computing hashes, to rendering videos.
In this specific situation I was rendering a video using all 4 of my cores under Linux, and when I opened my system monitor once again I noticed it.
2 or more of my cores were under symmetrical usage, when one went up the other went down completely symmetrical and in sync.
I have no idea why this is the case and would love to know!
System monitor picture
I am looking for a hardware, which must run about 256 computationally intensive real-time concurrent tasks in 24 hour mode (one multi-threaded C application). Each task takes about 40-50 MFLOPs, so all tasks require about 10 GFLOPs. CPU-RAM speed is insignificant. All tasks must be managed by a Linux Kernel (32 bit, with SMP).
I am looking for a one-mainboard solution with one multi-core CPU (if such CPU exist). If such CPU doesn't exist, then I need one mulit-socket mainboard solution (with multiple CPUs).
Can you please recommend me any professional CPU/Mainboard solution which will satisfy such requirements? It is also very important that there are no issues with Linux Kernel (2.6.25). No virtualization, no needs in huge RAM or CPU cache. I also would prefer Intel architecture and well-proved stability. I still have doubts that it is feasible at all.
Thank you in advance.
UPDATE:
I think I have found a right answer here and here.
UltraSPARC T2 has 8 cores with 8 threads each. Integrated high-bandwidth memory and IO. The T5140 carries two of them for 128 hardware threads.
The theoretical max raw performance of the 8 floating point units is 11 Giga flops per second (GFlops/s). A huge advantage over other implementations however is that 64 threads can share the units and thus we can achieve an extremely high percentage of theoretical peak. Our experiments have achieved nearly 90% of the 11 Gflop/s. - (http://blogs.oracle.com/deniss/entry/floating_point_performance_on_the)
Rent some Amazon EC2 nodes.
Updated: How about PS3's then? The NASA uses them for their simulation engines.
Maybe use CPU+GPU's in commercial servers?
Build it around FPGAs: nowadays, some variants include processors that can run Linux.
Even though you've given us the specs you think you need, we might be able to help you out better if you tell us what the application is intended to accomplish, and how it was implemented.
There may be a better way to split the work up or deal with it rather than your current solution.
Not Intel architecture but these run linux and have 64 cores on a single die.
TILEPro64
Get a bunch of four- or eight-core machines and split the processing across the machines using some sort of grid or clustering software. Maybe have a look at Beowulf.
As you mentioned, 10GFlops isn't exactly to be sneezed at so in a single machine, it'll be expensive. There's also the problem what you do when the machine breaks, you're unlikely to have a second machine of similar spec available. If you build a cluster using commodity hardware, you're a little more resilient and it's easier to find replacement machines.
MFLOPS and GFLOPS are very poor indicators of how well a program can run on any given CPU. These days, cache footprint is much more important; perhaps branch prediction accuracy as well.
There's almost no way to gauge performance of a given application on different architectures without actually giving it a spin. And even then, you may not get a good idea if you were unlucky enough to unknowingly build with compiler options that ruined your cache footprint, or used a bad threading library, or any of a hundred other things.
I see you'd prefer intel, but if you need one chip, I will again suggest the cell processor -
its theoretical peak performance is arount 25GFlops - kernel 2.6.25 had support for it already.
You could try a pre-slim playstation 3 for experimenting with (that would cost you little) or get yourself a server-based solution at around US$8K - you will have to re-write and fine tune your threads to take advabtage of the SPU co-processors there, but you could achieve your computational needs without breaking a sweat with a single CELL (1 PPC core + 8 SPU's)
NB.: with a playstation 3, you'd have only 6 available co-processors - but you don't seen to be on a budget with this project -
So you could at least try IBM's cell developer kit, which offers an emulator, to see if you can code your solution to run on it.
Thre are commercially available CELL products, both as stand-alone servers in blade form factory, and PCI Express add-on boards for PC workstations from
Mercury Computer Systems:
http://www.mc.com/microsites/cell/products.aspx?id=6986
Mercury does not list any prices on the site, but the pricing seens to be around the previoulsy mentioned U$8000.00 for these PCI Express cards.
A playstation 3 videogame can be purchased for about U$300.00 - and would allow you to prototype your application, and check if it is up to the needed performance. (I myself got one and have Fedora 9 running on it, although I did that as a hobbyst and have not, so far, used it for any calculations - I had also put together a Playstation-3 12 machinne cluster for Molecular simulations at the local University. The application they run did not take advantage of the multimedia SPU's, while I was in touch with then. But even so, clocked at 3.5GHz they performed better than standard ,s imlarly priced, PC's, even considering PS3's are priced 5x higher around here)
I have c# Console app, Monte Carlo simulation entirely CPU bound, execution time is inversely proportional to the number of dedicated threads/cores available (I keep a 1:1 ratio between cores/threads).
It currently runs daily on:
AMD Opteron 275 # 2.21 GHz (4 core)
The app is multithread using 3 threads, the 4th thread is for another Process Controller app.
It takes 15 hours per day to run.
I need to estimate as best I can how long the same work would take to run on a system configured with the following CPU's:
http://en.wikipedia.org/wiki/Intel_Nehalem_(microarchitecture)
2 x X5570
2 x X5540
and compare the cases, I will recode it use the available threads. I want to justify that we need a Server with 2 x x5570 CPUs over the cheaper x5540 (they support 2 cpus on a single motherboard). This should make available 8 cores, 16 threads (that's how the Nehalem chips work I believe) to the operating system. So for my app that's 15 threads to the Monte Carlo Simulation.
Any ideas how to do this? Is there a website I can go and see benchmark data for all 3 CPUS involved for a single threaded benchmark? I can then extrapolate for my case and number of threads. I have access to the current system to install and run a benchmark on if necessary.
Note the business are also dictating the workload for this app over the next 3 months will increase about 20 times and needs to complete in a 24 hour clock.
Any help much appreciated.
Have also posted this here: http://www.passmark.com/forum/showthread.php?t=2308 hopefully they can better explain their benchmarking so I can effectively get a score per core which would be much more helpful.
have you considered recreating the algorithm in cuda? It uses current day GPU's to increase calculations like these 10-100 fold. This way you just need to buy a fat videocard
Finding a single-box server which can scale according to the needs you've described is going to be difficult. I would recommend looking at Sun CoolThreads or other high-thread count servers even if their individual clock speeds are lower. http://www.sun.com/servers/coolthreads/overview/performance.jsp
The T5240 supports 128 threads: http://www.sun.com/servers/coolthreads/t5240/index.xml
Memory and CPU cache bandwidth may be a limiting factor for you if the datasets are as large as they sound. How much time is spent getting data from disk? Would massively increased RAM sizes and caches help?
You might want to step back and see if there is a different algorithm which can provide the same or similar solutions with fewer calculations.
It sounds like you've spent a lot of time optimizing the the calculation thread, but is every calculation being performed actually important to the final result?
Is there a way to shortcut calculations anywhere?
Is there a way to identify items which have negligible effects on the end result, and skip those calculations?
Can a lower resolution model be used for early iterations with detail added in progressive iterations?
Monte Carlo algorithms I am familiar with are non-deterministic, and run time would be related to the number of samples; is there any way to optimize the sampling model to limit the number of items examined?
Obviously I don't know what problem domain or data set you are processing, but there may be another approach which can yield equivalent results.
tomshardware.com contains a comprehensive list of CPU benchmarks. However... you can't just divide them, you need to find as close to an apples to apples comparison as you can get and you won't quite get it because the mix of instructions on your workload may or may not depend.
I would guess please don't take this as official, you need to have real data for this that you're probably in the 1.5x - 1.75x single threaded speedup if work is cpu bound and not highly vectorized.
You also need to take into account that you are:
1) using C# and the CLR, unless you've taken steps to prevent it GC may kick in and serialize you.
2) the nehalems have hyperthreads so you won't be seeing perfect 16x speedup, more likely you'll see 8x to 12x speedup depending on how optimized your code is. Be optimistic here though (just don't expect 16x).
3) I don't know how much contention you have, getting good scaling on 3 threads != good scaling on 16 threads, there may be dragons here (and usually is).
I would envelope calc this as:
15 hours * 3 threads / 1.5 x = 30 hours of single threaded work time on a nehalem.
30 / 12 = 2.5 hours (best case)
30 / 8 = 3.75 hours (worst case)
implies a parallel run time if there is truly a 20x increase:
2.5 hours * 20 = 50 hours (best case)
3.74 hours * 20 = 75 hours (worst case)
How much have you profiled, can you squeeze 2x out of app? 1 server may be enough, but likely won't be.
And for gosh sakes try out the task parallel library in .Net 4.0 or the .Net 3.5 CTP it's supposed to help with this sort of thing.
-Rick
I'm going to go out on a limb and say that even the dual-socket X5570 will not be able to scale to the workload you envision. You need to distribute your computation across multiple systems. Simple math:
Current Workload
3 cores * 15 real-world-hours = 45 cpu-time-hours
Proposed 20X Workload
45 cpu-time-hours * 20 = 900 cpu-time-hours
900 cpu-time-hours / (20 hours-per-day-per-core) = 45 cores
Thus, you would need the equivalent of 45 2.2GHz Opteron cores to achieve your goal (despite increasing processing time from 15 hours to 20 hours per day), assuming a completely linear scaling of performance. Even if the Nehalem CPUs are 3X faster per-thread you will still be at the outside edge of your performance envelop - with no room to grow. That also assumes that hyper-threading will even work for your application.
The best-case estimates I've seen would put the X5570 at perhaps 2X the performance of your existing Opteron.
Source: http://www.dailytech.com/Server+roundup+Intel+Nehalem+Xeon+versus+AMD+Shanghai+Opteron/article15036.htm
It'd be swinging big hammer, but perhaps it makes sense to look at some heavy-iron 4-way servers. They are expensive, but at least you could get up to 24 physical cores in a single box. If you've exhausted all other means of optimization (including SIMD), then it's something to consider.
I'd also be weary of other bottlenecks such as memory bandwidth. I don't know the performance characteristics of Monte Carlo Simulations, but ramping up one resource might reveal some other bottleneck.