large-memory key-value pair store
large-scale and low-latency in-memory key-value pair store for Python
1) Modern boxes have 10-100s of Gigabytes of RAM
2) Gigabyte++-size Python dictionaries are slow to fill
3) Gigabyte++-size dictionaries are fun to use
4) atbr is fast (in particular to load from file)
Run the following to install atbr (including its dependencies)
$ cat INSTALL.sh # to see what it does
$ chmod +x ./INSTALL.sh && sudo ./INSTALL.sh
(note: for mac, run python setup-mac.py install afterwards)
it basically does this:
$ sudo apt-get install libboost-dev python-setuptools swig* python-dev -y
$ sudo pip install -r requirements.txt # or under virtualenv
$ wget http://sparsehash.googlecode.com/files/sparsehash-2.0.2.tar.gz
$ tar -zxvf sparsehash-2.0.2.tar.gz
$ cd sparsehash-2.0.2
$ ./configure && make && sudo make install
$ sudo python setup.py install # or under virtualenv
import atbr.atbr
# Create storage
mystore = atbr.atbr.Atbr()
# Load data
mystore.load("keyvaluedata.tsv")
# Number of key value pairs
print mystore.size()
# Get value corresponding to key
print mystore.get("key1")
# Return true if a key exists
print mystore.exists("key1")
Input for the benchmark was output from a small Hadoop (mapreduce) job that generated key, value pairs where both the key and value were json. The benchmark was done an Ubuntu-based Thinkpad x200 with SSD drive.
$ ls -al medium.tsv
-rw-r--r-- 1 amund amund 117362571 2012-04-25 15:36 medium.tsv
$ wc medium.tsv
212969 5835001 117362571 medium.tsv
$ python
>>> import atbr
>>> a = atbr.Atbr()
>>> a.load("medium.tsv")
Inserting took - 1.178468 seconds
Num new key-value pairs = 212969
Speed: 180716.807959 key-value pairs per second
Throughput: 94.803214 MB per second
atbr can also run as a server (default port is 8888), supporting both http and websocket
Start server:
$ cd atbserver ; python atbr_server.py
Load tsv-file data with http
$ curl http://localhost:8888/load/keyvaluedata.tsv
Get value for key = 'key1'
$ curl http://localhost:8888/get/key/key1
Add key, value pair key='foo', value='bar'
$ curl http://localhost:8888/put/key/foo/value/bar
Example that loads keyvaluedata.tsv using websocket load api
python websocket_cmdline_client.py keyvaluedata.tsv
import sys
from websocket import create_connection
ws = create_connection("ws://localhost:8888/loadws/")
# e.g. sys.argv[1] could 'keyvaluedata.tsv'
ws.send(sys.argv[1])
result = ws.recv()
ws.close()
print result
$ python atbr_server 8585 shard_data_1.tsv
$ python atbr_server 8686 shard_data_2.tsv
$ python atbr_server 8787 shard_data_3.tsv
$ python atbr_shard_server.py localhost:8585 localhost:8686 localhost:8787
$ python atbr_websocket_cmdline_client.py key1
atbr runs in-memory, and costs for running e.g. an Amazon EC2 68.4GB ram instance is $1.80/hour. Assume the node has roughly 65GB available (after os components are loaded), this gives a Gigabyte-hour-cost of $0.027 and a Terabyte-hour-cost of 1000*0.027 = $27. Since atbr is designed to hold only json key and values with metadata, and metadata can have pointers to larger objects in disk-based storage (e.g. AWS S3) a Terabyte in-memory brings you very far. Monthly cost in this case would be $20412
Currently:
atbr is developed and supported by Amund Tveit (amund (at) atbrox (dot) com), Atbrox.