As of 2016-02-26, there will be no more posts for this blog. s/blog/pba/
Showing posts with label performance. Show all posts

This morning, I noticed the system temperature was oddly low, only just above 40C. I didnt mind at first, but then I felt it seemed to run slower than before, which might just from confusion after I checked the frequencies from /proc/cpuinfo.

I had not looked at file for years, and I saw the frequency was fixed at 1833MHz and 1000MHz, two cores, respectively. So, I tried the ultimate fix, turning it off and on again, didnt work. I began to wonder if anything got updated recently, not the kernel nor any system/hardware stuff that I could remember.

At this point, I laughed before I knew there must be a setting wrong and for years, I had not realized that. So I went back to the power management in kernel configuration and found that I might have been using the wrong governor since 2012-08-27 as kernel 3.4 recommended ondemand governor according to ArchWiki.

That was 3.5 years ago.

First of all this isnt meant to be accurate or very reliable, even flawed, I just want to see some numbers, because in pymuxs README, it mentions about the performance, but not actual numbers:

Tmux is written in C, which is obviously faster than Python. This is noticeable when applications generate a lot of output. Where tmux is able to give fast real-time output for, for instance find / or yes, pymux will process the output slightly slower, and in this case render the output only a few times per second to the terminal. Usually, this should not be an issue. If it is, Pypy should provide a significant speedup.


I use my own test script,, which is written in Bash. I thought about using find or yes as mentioned in the README, but I am too lazy to write a script for the tests, I used what I already have in hand.

Since pymux is written in Python, so I tested with two implementations, the official CPython and PyPy (RPython to C). They will all run within the environment virtualenv creates, and using pip to install pymux 0.5 and Pyte 0.4.10, and along with their dependencies.

The test script is run with reset && ./ in urxvtc with font xft:Envy Code R:style=Regular:size=20:antialias in dwm with virtually full-screen of 1680x1050, dwm topbar is hidden, and the window size is 1669x1027 and geometry is 111x33.

Both tmux and pymux are run without configuration files.

Six months ago, I wrote Performance in shell and with C about converting a Bash project into a C. During it, I got some numbers, which showed me the performances of shell builtin and C. Almost two years ago, I wrote about sleep command, also talking about the cost of invoking external command.

Last night, suddenly I realized that I could have just given a very simple example by using true, using Bashs time builtin to time the loops:

test time
for i in {1..10000}; do      true; done 00.049s
for i in {1..10000}; do /bin/true; done 11.556s

The builtin true is 23,484% faster than the external /bin/true, numbers got from looping the following commands for 1000 iterations, also as another example:

test time
e      '(11.556-0.049)/0.049*100' 0.028s
bc <<< '(11.556-0.049)/0.049*100' 2.560s


bc actually returns 23400 if without using scale.

The e above is a shell builtin, from the e.bash project I forked off e, tiny expression evaluator. After I learned about the cost, I was no longer using bc just for simple calculation for floating numbers; but with e, I do calculate floating number again in Bash scripts.

These numbers should be very clear abort using external commands. If you have a long loop and there are lots of uses of external command, then you should really consider rewriting in other programming languages.

This June, I started to working on transitioning into a C project as well as using GNU Autotools to build. When I decided to do it, it didnt come to my mind that I would have the following statistics to look at:

name rate type
Bash implementation
print_td 1856 Bash function calls 159 Bash script executions
C implementation
td 52 C executions
Bash with C loadable extension
td 21631 Bash loadable executions
Python bindings 35 Python 2 script executions 12 Python 3 script executions

As you could see, the C loadable extension for Bash undoubtedly beats everyone else, 10+ times better in performance. Because of this, I decided to bring back vimps1, which was quietly disabled last August, when I couldnt get it to work with latest vcprompt code at the time, which was linked into vimps1.

At the time, when I didnt really have the way to measure the performance, but I knew it ran faster. How would I know? Well, thats simple, just holding down Enter and see the rate of prompt showing up. You could feel it when you used loadable extension or just pure Bash PS1.

Because of, now I have just brought back vimps1, although it has a problem with Git repositories, you could still see the benefits:

name rate
vimps1 1355
bash_ps1 141

Its more than 10 times because bash_ps1 is a seriously stripped down version of my normal Bash PS1.

Not many people are aware how bad their shell or prompt is wasting, probably nobody else cares. However, I begin to think, maybe I need a shell that having performance in its feature list. Bash, although its not a bad shell, from time to time, I do hope it could run faster. It might be wrong to think so, since I havent really seen anyone really talk about shell performance.

After I finished td.shs C transition, I thought about making an extension that would do arithmetic with float number, but thats just wrong to do so, because if you want to make Bash does float, you need to patch Bash or just use other programming language to write the whole code, I have seen a lot of people using bc. Frankly, they truly are doing it all wrong.


In September, 2014, I did make a project, e.bash, by forking e, tiny expression evaluator. It is a Bash builtin, and because I can, so I did it.

They dont understand how costly of an invocation, just look at the first table, 52 vs 21631 runs, thats the prize you have to pay. If you havent known, here is a tip for you, dont use external command in Bash script unless its something Bash cant do.


In Feburary, 2015, I realized there is a very simple example using true to see the performances and costs.

And if most of code is relying on external commands, maybe you should consider writing in a language could do all the tasks. Nevertheless, if thats a because I can, then be my guest.

I want to see the performance between with and without C extension. I downloaded the latest 2.1.3 version. And make three builds using Python 2.6.6:

  1. rm -rf build ; python build && rm build/lib*/simplejson/
  2. rm -rf build ; python build
  3. rm -rf build ; CFLAGS="-march=core2 -O2 -pipe -fomit-frame-pointer" python build

The first one removes the compiled C extension, the second one is normal, then third one uses the current CFLAGS I use in emerge. And the following lines is how C extensions got compiled:

Without customized CFLAGS
x86_64-pc-linux-gnu-gcc -pthread -fPIC -I/usr/include/python2.6 -c simplejson/_speedups.c -o build/temp.linux-x86_64-2.6/simplejson/_speedups.o
x86_64-pc-linux-gnu-gcc -pthread -shared build/temp.linux-x86_64-2.6/simplejson/_speedups.o -L/usr/lib64 -lpython2.6 -o build/lib.linux-x86_64-2.6/simplejson/

With customized CFLAGS
x86_64-pc-linux-gnu-gcc -pthread -march=core2 -O2 -pipe -fomit-frame-pointer -fPIC -I/usr/include/python2.6 -c simplejson/_speedups.c -o build/temp.linux-x86_64-2.6/simplejson/_speedups.o
x86_64-pc-linux-gnu-gcc -pthread -shared -march=core2 -O2 -pipe -fomit-frame-pointer build/temp.linux-x86_64-2.6/simplejson/_speedups.o -L/usr/lib64 -lpython2.6 -o build/lib.linux-x86_64-2.6/

I use the following code to do the test,

import timeit
t = timeit.Timer('json.loads(json_str)', 'import simplejson as json;json_str=open("test.json","r").read()')
print t.timeit(100)

The test.json is, over 3 MB, 500 entires.

The results is:

Tests Elapsed time for 100 loads()
Without C extension 126.230s
json 1.9 in Python 2.6.6 060.616s
With C extension 009.945s
With C extension (CFLAGS) 007.555s

With C extension is at least 10 times faster. I also put the simplejson 1.9 which in Python 2.6.6 in the result. Without extension, 1.9 -&gt; 2.1.3, twice more slower. I didnt download simplejson 1.9 to double check, but I dont think its modified for being shipped with Python.

I was just curious how fast can a terminal window (in X) get a refresh draw. So I wrote a script to test, It prints characters to fill up the whole window by default, then reset the cursor to home using ANSI escape code. Print again, doing so for 100 times by default. 100 / elapsed time is the FPS.

I ran several tests using ./ 1000 80 25. I used 80 by 25 because thats my VTs terminal size, and I maximize window before I run the test. Here are the result for 80x25 and 1000 frames, sort by fps:

terminal elapsed time fps
urxvtc1 08.882 112.580
urxvtc 09.126 109.574
urxvt 09.140 109.401
urxvtc + tmux2 10.568 094.616
urxvtc + tmux 10.546 094.813
urxvtc + tmux3 11.214 089.173
xterm 16.487 060.653
lxterminal 39.211 025.502
vt1 54.984 018.187
[1]no .Xdefaults
[2]with -2, in right panel of two
[3]with -2, in left panel of two

The slowest one is vt1, I didnt test framebuffer. urxvt is my terminal, but I also have xterm installed. I installed lxterminal for vte-based terminal test. My normal urxvt uses Rxvt.font: xft:Envy Code R:style=Regular:size=9:antialias=false. I ran more on urxvtc, invoked without .Xdefaults, so I could test without changes I made. Since I use tmux, I tested tmux invoked with 256 colors or not, running in a panel.

This script cant test the real FPS since Bash script takes some time to process, but the results arent really much lower and it does show the significant difference between terminals, or those FPS should all be capped around a same number. As you can see, urxvt runs fastest, then xterm, then lxterminal. Though there are some configuration differences, say the fonts, but its quite conclusive from the numbers I see.

For maximized urxvt+tmux terminal window I normally use in one screen with video played in another screen, here is the result:

For 239x65 100 frames, elapsed time: 18.567 seconds
Frames per second: 5.385