2011-10-26

Python optimizations exercise.

From there I wanted to post cleanly all my findings. I include all my sources, so you can reproduce it. Basically, Pavel wanted to optimize a simple use case. Here is my shot at it: First the test.py, the html7 implementation and a slightly improved one where I don't force python to go back and forth between bytes and unicode + the benchmark timer :
import timeit


WHITELIST2 = set('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')
WHITELIST2_UNI = set(u'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')
r = u'234782384723!#$#@!%$@#%$@#%#$%%%%%%%%%%%@#$%342583492058934028590342853490285902344#$%%*%****%7jkb6h546777776ynkk4b56byhbh5j'*500

def html7():
    s = r
    lstr = str; lord = ord; lWHITELIST2 = WHITELIST2
    return "".join([
        c if c in lWHITELIST2
        else "&#" + lstr(lord(c)) + ";"
        for c in s])

def htmlGB():
    s = r
    lstr = unicode; lord = ord; lWHITELIST2 = WHITELIST2_UNI
    return u''.join([
        c if c in lWHITELIST2
        else u'&#' + lstr(lord(c)) + u';'
        for c in s])
assert(html7() == htmlGB())

t = timeit.Timer(stmt = html7)
print 'html7:' + str(t.timeit(number=100))

t = timeit.Timer(stmt = htmlGB)
print 'htmlGB:' + str(t.timeit(number=100))
It gives me that as timing under python and pypy 1.5 :
⚫ python test.py 
html7:2.48782491684
htmlGB:2.14983606339

⚫ pypy-c1.5 test.py
html7:2.44286298752
htmlGB:1.84629392624
Note: the other implementation with caching & statistical tryouts are too convoluted and too much on the speed over memory tradeoff for this simple task IMHO. Then I wanted to go further, so I made a simple port on cython then a "bare metal" version on cython, and I have been really impressed by it ! In order to compile in cython you need a simple makefile-like setup.py file:
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

ext_modules = [Extension("encode", ["encode.pyx"])]

setup(
  name = 'Encoding Test',
  cmdclass = {'build_ext': build_ext},
  ext_modules = ext_modules
)
Then here are my implementations : cython is the direct port, cython_alt is the C-like bare metal one. In the file encode.pyx put:
WHITELIST = set(u'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')

from libc.stdlib cimport malloc, free
from libc.stdio cimport sprintf

def cython(unicode s):
    return u''.join([
    c if c in WHITELIST
    else u'&#' + unicode(< long > c) + u';'
    for c in s])

def cython_alt(unicode s):
    cdef char * buffer = < char * > malloc(len(s) * 10)
    cdef int i = 0
    cdef Py_UCS4 c

    for c in s:
        if (c >= u'a' and c <= u'z') or (c >= u'A' and c <= u'Z') or (c >= u'0' and c <= u'9'):
            buffer[i] = < char > c
            i += 1
        else:
            sprintf(buffer + i,'&#%d;',c)
            while buffer[i]:
                i += 1
    result = < bytes > buffer
    free(buffer)
    return result
And the same test runner test_cython :
import timeit
from encode import cython
from encode import cython_alt


r = u'234782384723!#$#@!%$@#%$@#%#$%%%%%%%%%%%@#$%342583492058934028590342853490285902344#$%%*%****%7jkb6h546777776ynkk4b56byhbh5j'*500

def cython2():
    cython(r)

def cython_alt2():
    cython_alt(r)

assert(cython(r) == cython_alt(r))

t = timeit.Timer(stmt = cython2)
print 'cython:' + str(t.timeit(number=100))

t = timeit.Timer(stmt = cython_alt2)
print 'cython_alt:' + str(t.timeit(number=100))
To compile it just do :
python setup.py build_ext --inplace
The results :
⚫ python test_cython.py 
cython:1.70311307907
cython_alt:0.348756790161
Booya ! :)

2011-07-07

How to apply User patches in portage

After years of tinkering with gentoo and feeding my custom overlay I discovered today the "user patches" feature of portage.

This is really awesome !

The Use Case :
You want to just try out a patch for something already in portage.

Instead of creating another ebuild in an overlay you can put your patch in :
/etc/portage/patches/{category}/{package_name}/

For example :
mkdir -p /etc/portage/patches/sys-libs/glibc-2.13-r3/
wget -O /etc/portage/patches/sys-libs/glibc-2.13-r3/fix1.patch "https://bugs.gentoo.org/attachment.cgi?id=279269"
emerge -1 glibc


To check if your patch applied you should see the following just after the standard batch of patches:
 * Applying user patches from /etc/portage/patches/sys-libs/glibc-2.13-r3 ...
 *   fix1.patch ...                                                                                 [ ok ]
 * Done with patching

2011-06-30

5 reasons why you should use open source tools for computer forensics investigations

In some ways computer forensics is a scientific process, you need to prove facts with a reproducible process.
Scientists use open source software all the time and they have good reasons that also apply for you, computer forensics.


1. The tool itself is publicly available

If another expert is called in, he can always download the tools you used and reproduce your findings without having to buy an often very expensive license for possibly a one shot third party tool.


2. The code is open.

Any doubt about the tool itself, you can show the source code and tell exactly what the software has done. You have to rely to documentation and explanations from a third party if you use closed source tools.


3. You (often) can go back in time.

Old versions are often available in public source repositories
Legal processes are really long. They can probably span several versions of any piece of software you used during your investigations. What if the publisher won't publish the old version anymore ?


4. Steps to reproduce your findings are easily documented.

Open source tools are often backed by command line tools. So in your report, you can just copy paste all the commands you used and all the outputs.
Any other expert can pop in, redo the same thing and check the output is exactly the same.


5. They are customizable and flexible

One of the mantra of linux for example is that one tool does one simple thing but does it right so you can combine them to fulfill your need.

The golden hammer never exists and sometime you miss just one little thing/feature to be able to accomplish a specific task. Open source rocks in this case, you have the source code, you can hack it to accomplish your task.

And ... why not sharing your hack back to the community ? :)

2011-06-05

poor man's backup and defragmentation process under linux for my desktops

No need to defragment linux ever ?! Try that and see for yourself.

As the title of this blog says I'm using gentoo. I just love this system. But compiling often has a drawback, it just spreads files around your filesystem all the time.

But there is no defragmenter yet under linux. So here is what I do to backup AND defrag my boxes.

First burn or even better, usb install the awesome SystemRescueCd.

Boot under systemrescueCD.

From there, mount back your main partition (here sda1) and do some cleanup prior to backup:
mount /dev/sda1 /mnt/gentoo
rm -Rf /mnt/gentoo/tmp/*
rm -Rf /mnt/gentoo/var/log/* # warning : if you don't care about your logs !
rm -Rf /mnt/gentoo/usr/portage/distfiles/* # you can always redownload them 
# etc ...

Important : umount the partition after cleanup
cd /
umount /mnt/gentoo

Now mount an external HD or your backup volume (here sdb1) :
Note : don't do that on FAT !! It must be something that we can name "filesystem" like ext4, xfs etc .. With at least the partition size you want to backup as free space of course.

mount /dev/sdb1 /mnt/backup

Dump an image of your main partition on it:
pv /dev/sda1 > /mnt/backup/sda1.img

You should see on the right side how many coffees you can drink during the dump.

Remount your backup read-only and check out quickly if the backup was successfull :
mount -o loop,ro /mnt/backup/sda1.img /mnt/custom
ll /mnt/custom

Now be extra careful, format your main partition (here sda1, the same as above), no typo are authorized here :
mkfs.ext4 /dev/sda1

Optional step for those who uses labels/UUID : restore them !
cat /mnt/custom/etc/fstab # to get whatever you use to mount your partition
tune2fs /dev/sda1 -U 4c7556bf-5c6c-4f85-bc45-d7fa55c8ca1d # for example for UUID

mount your all nice and clean main partition and restore the files from your backup.
mount /dev/sda1 /mnt/gentoo
rsync -av --progress /mnt/custom/ /mnt/gentoo/ #be careful the ending slashes ARE important

You'll notice how this is piece of cake for your main harddrive to cope with the incoming files and how painful it will be for your backup drive to be able to feed it from the files splashed all over the image (even from raid0 external disks on esata to a poor performing laptop drive !).

If your console slows down the process you can switch to another one and do df to see /dev/sda1 catching up the Use% of /mnt/backup/sda1.img.

Pack up, and enjoy the speed boost.

umount /mnt/gentoo
umount /mnt/custom
umount /mnt/backup
reboot