Today, I chrooted in my system from sysresCD and I discovered that running eix-update (a gentoo portage indexing program) ran almost 3x faster under the 3.2.x linux kernel compared to the 3.6.x kernel.
After poking around, I finally discovered why. Check out this CPU usage pattern...
Basically, this is a monothreaded program (it tops at 100% CPU total). Because it is probably doing tons on IO on top of consuming 1 full thread, it must release its context often and the scheduler spreads its execution on all the cores & hyperthreads.
. # time eix-update [...] eix-update 499.42s user 225.91s system 92% cpu 13:03.09 total #
This behavior is a rather severe performance killer because the CPU cores continuously go from different sleeping states with a latency penalty each time, the cache lines are cooling down etc ...[edit] The most important latency factor is the cpu frequency scalling under the ondemand governor see below [/edit]
Let's try the same operation forcing the process on one hyperthread.
Let's see the result...
. # time taskset 1 eix-update [...] taskset 1 eix-update 198.51s user 38.26s system 73% cpu 5:21.76 total #
The 1 is in fact a bitmask saying "allowed to run on CPU1 only"
So now why the 3.2.x kernel was faster ? I suspect it is because the idling driver for the CPU was not "as good" as the one in 3.6.x so the cores did not sleep as much as on the 3.6.x kernel.
[edit] In the comment I show that putting the cpu frequency governor to "performance" is improving it as much as forcing the affinity. But it basically says to forget about energy efficiency.
Tweaking the ondemand governor to be less trigger happy on the frequency is also possible :
. # echo -n 24 > /sys/devices/system/cpu/cpufreq/ondemand/up_threshold # time eix-update [...] eix-update 221.58s user 56.28s system 76% cpu 6:02.94 total
This up_thresold tells the governor to up the frequency if the load is more than 24% (I tried here to put slightly less than 100%/4)
It is way better, see the powertop screenshot for the different limits the CPU hits:
Trying to push it a little bit (17%)
. # echo -n 17 > /sys/devices/system/cpu/cpufreq/ondemand/up_threshold # time eix-update eix-update 212.72s user 49.50s system 76% cpu 5:42.72 total
Putting it at its minimum value (11%)
. # echo -n 11 > /sys/devices/system/cpu/cpufreq/ondemand/up_threshold # time eix-update eix-update 204.94s user 44.77s system 75% cpu 5:28.59 total
CPU scheduling mixed with power efficiency is a clearly a complex problem and a one size fit all is certainly not possible. Nevertheless, it is interesting to see where we can tweak our system for specific workloads. [/edit]
Not just a sleeping state, but most CPUs have different speeds they run at. When idling they slow down and if put under heavy load they'll slowly speed up. Jumping between cores means they'll all slow down and only speed up slightly when on that core. You can also do the test with frequency scaling turned off (put to performance mode at top frequency) to get an idea of how much that accounts for the time.
ReplyDeleteThis is *very* true... If I switch from ondemand to performance the cpu scaling, I have that :
ReplyDelete# echo "performance" >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# echo "performance" >/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
# echo "performance" >/sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
# echo "performance" >/sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
# time eix-update
[...]
eix-update 195.02s user 38.59s system 75% cpu 5:11.30 total
Now (for the science ;)) with the CPU affinity on top of it :
taskset 1 eix-update 193.87s user 36.06s system 71% cpu 5:19.80 total
So indeed the cpu scaling makes the most of it.
See my previous comment about switching to performance the cpu scaling.
ReplyDelete