From
there I wanted to post cleanly all my findings.
I include all my sources, so you can reproduce it.
Basically, Pavel wanted to optimize a simple use case. Here is my shot at it:
First the test.py, the html7 implementation and a slightly improved one where I don't force python to go back and forth between bytes and unicode + the benchmark timer :
import timeit
WHITELIST2 = set('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')
WHITELIST2_UNI = set(u'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')
r = u'234782384723!#$#@!%$@#%$@#%#$%%%%%%%%%%%@#$%342583492058934028590342853490285902344#$%%*%****%7jkb6h546777776ynkk4b56byhbh5j'*500
def html7():
s = r
lstr = str; lord = ord; lWHITELIST2 = WHITELIST2
return "".join([
c if c in lWHITELIST2
else "&#" + lstr(lord(c)) + ";"
for c in s])
def htmlGB():
s = r
lstr = unicode; lord = ord; lWHITELIST2 = WHITELIST2_UNI
return u''.join([
c if c in lWHITELIST2
else u'&#' + lstr(lord(c)) + u';'
for c in s])
assert(html7() == htmlGB())
t = timeit.Timer(stmt = html7)
print 'html7:' + str(t.timeit(number=100))
t = timeit.Timer(stmt = htmlGB)
print 'htmlGB:' + str(t.timeit(number=100))
It gives me that as timing under python and pypy 1.5 :
⚫ python test.py
html7:2.48782491684
htmlGB:2.14983606339
⚫ pypy-c1.5 test.py
html7:2.44286298752
htmlGB:1.84629392624
Note: the other implementation with caching & statistical tryouts are too convoluted and too much on the speed over memory tradeoff for this simple task IMHO.
Then I wanted to go further, so I made a simple port on cython then a "bare metal" version on cython, and I have been really impressed by it !
In order to compile in cython you need a simple makefile-like setup.py file:
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
ext_modules = [Extension("encode", ["encode.pyx"])]
setup(
name = 'Encoding Test',
cmdclass = {'build_ext': build_ext},
ext_modules = ext_modules
)
Then here are my implementations : cython is the direct port, cython_alt is the C-like bare metal one.
In the file encode.pyx put:
WHITELIST = set(u'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')
from libc.stdlib cimport malloc, free
from libc.stdio cimport sprintf
def cython(unicode s):
return u''.join([
c if c in WHITELIST
else u'&#' + unicode(< long > c) + u';'
for c in s])
def cython_alt(unicode s):
cdef char * buffer = < char * > malloc(len(s) * 10)
cdef int i = 0
cdef Py_UCS4 c
for c in s:
if (c >= u'a' and c <= u'z') or (c >= u'A' and c <= u'Z') or (c >= u'0' and c <= u'9'):
buffer[i] = < char > c
i += 1
else:
sprintf(buffer + i,'&#%d;',c)
while buffer[i]:
i += 1
result = < bytes > buffer
free(buffer)
return result
And the same test runner test_cython :
import timeit
from encode import cython
from encode import cython_alt
r = u'234782384723!#$#@!%$@#%$@#%#$%%%%%%%%%%%@#$%342583492058934028590342853490285902344#$%%*%****%7jkb6h546777776ynkk4b56byhbh5j'*500
def cython2():
cython(r)
def cython_alt2():
cython_alt(r)
assert(cython(r) == cython_alt(r))
t = timeit.Timer(stmt = cython2)
print 'cython:' + str(t.timeit(number=100))
t = timeit.Timer(stmt = cython_alt2)
print 'cython_alt:' + str(t.timeit(number=100))
To compile it just do :
python setup.py build_ext --inplace
The results :
⚫ python test_cython.py
cython:1.70311307907
cython_alt:0.348756790161
Booya ! :)
No comments:
Post a Comment