27 Jun 2008

Dear lazyweb number 3.

So far, I’ve asked:

high latency net simulations – great answers.

python friendly back-end accessible search engines – many answers, none that fit the bill. So I wrote my own :).

Today, I shall ask – is there a python-accessible persistent b+tree(or hashtable, or …) module. Key considerations:

– scaling: millions of nodes are needed with low latency access to a nodes value and to determine a nodes absence

– indices are write once. (e.g. a group of indices are queried, and data is expired altered by some generational tactic such as combining existing indices into one larger one and discarding the old ones)

– reading and writing is suitable for sharply memory constrained environments. ideally only a few 100KB of memory are needed to write a 100K node index, or to read those same 100K nodes back out of a million node index. temporary files during writing are fine.

– backend access must either be via a well defined minimal api (e.g. ‘needs read, readv, write, rename, delete’) or customisable in python

– easy installation – if C libraries etc are needed they must be already pervasively available to windows users and Ubuntu/Suse/Redhat/*BSD systems

– ideally sorted iteration is available as well, though it could be layered on top

– fast, did I mention fast?

– stable formats – these indices may last for years unaltered after being written, so any libraries involved need to ensure that the format will be accessible for a long time. (e.g. python’s dump/marshal facility fails)

sqlite, bdb already fail at this requirements list.

snakesql, gadfly, buzhug and rbtree fail too.