27 Jun 2008
Dear lazyweb number 3.
So far, I’ve asked:
high latency net simulations – great answers.
python friendly back-end accessible search engines – many answers, none that fit the bill. So I wrote my own .
Today, I shall ask – is there a python-accessible persistent b+tree(or hashtable, or …) module. Key considerations:
- scaling: millions of nodes are needed with low latency access to a nodes value and to determine a nodes absence
- indices are write once. (e.g. a group of indices are queried, and data is expired altered by some generational tactic such as combining existing indices into one larger one and discarding the old ones)
- reading and writing is suitable for sharply memory constrained environments. ideally only a few 100KB of memory are needed to write a 100K node index, or to read those same 100K nodes back out of a million node index. temporary files during writing are fine.
- backend access must either be via a well defined minimal api (e.g. ‘needs read, readv, write, rename, delete’) or customisable in python
- easy installation – if C libraries etc are needed they must be already pervasively available to windows users and Ubuntu/Suse/Redhat/*BSD systems
- ideally sorted iteration is available as well, though it could be layered on top
- fast, did I mention fast?
- stable formats – these indices may last for years unaltered after being written, so any libraries involved need to ensure that the format will be accessible for a long time. (e.g. python’s dump/marshal facility fails)
sqlite, bdb already fail at this requirements list.
snakesql, gadfly, buzhug and rbtree fail too.
Filed under: Uncategorized | Leave a Comment