Atbr - large memory key-value pair store - supported by atbrox

What is atbr?

large-scale and low-latency in-memory key-value pair store for Python

Why atbr?

1) Modern boxes have 10-100s of Gigabytes of RAM

2) Gigabyte++-size Python dictionaries are slow to fill

3) Gigabyte++-size dictionaries are fun to use

4) atbr is fast (in particular to load from file)

What is atbr built with?

c++ (heavy lifting), python (apis/websocket), swig (glu). python libraries: tornado (http/websocket server), boto (Amaazon Web Services API), zc.zk (zookeeper), websocket-client. c++ libraries: Google's sparsehash.

install

Run the following to install atbr (including its dependencies)

$ cat INSTALL.sh # to see what it does
$ chmod +x ./INSTALL.sh && sudo ./INSTALL.sh

(note: for mac, run python setup-mac.py install afterwards)

it basically does this:

$ sudo apt-get install libboost-dev python-setuptools swig* python-dev -y
$ sudo pip install -r requirements.txt # or under virtualenv

$ wget http://sparsehash.googlecode.com/files/sparsehash-2.0.2.tar.gz
$ tar -zxvf sparsehash-2.0.2.tar.gz
$ cd sparsehash-2.0.2
$ ./configure && make && sudo make install

$ sudo python setup.py install  # or under virtualenv

python-api example

import atbr.atbr

# Create storage
mystore = atbr.atbr.Atbr()

# Load data
mystore.load("keyvaluedata.tsv")

# Number of key value pairs
print mystore.size()

# Get value corresponding to key
print mystore.get("key1")

# Return true if a key exists
print mystore.exists("key1")

benchmark (loading)

Input for the benchmark was output from a small Hadoop (mapreduce) job that generated key, value pairs where both the key and value were json. The benchmark was done an Ubuntu-based Thinkpad x200 with SSD drive.

 $ ls -al medium.tsv
 -rw-r--r-- 1 amund amund 117362571 2012-04-25 15:36 medium.tsv

 $ wc medium.tsv
 212969   5835001 117362571 medium.tsv

 $ python
 >>> import atbr
 >>> a = atbr.Atbr()
 >>> a.load("medium.tsv")
 Inserting took - 1.178468 seconds
 Num new key-value pairs = 212969
 Speed: 180716.807959 key-value pairs per second
 Throughput: 94.803214 MB per second

atbr http and websocket server

atbr can also run as a server (default port is 8888), supporting both http and websocket

Start server:

 $ cd atbserver ; python atbr_server.py

HTTP API

Load tsv-file data with http

 $ curl http://localhost:8888/load/keyvaluedata.tsv

Get value for key = 'key1'

$ curl http://localhost:8888/get/key/key1

Add key, value pair key='foo', value='bar'

$ curl http://localhost:8888/put/key/foo/value/bar

Websocket API

Example that loads keyvaluedata.tsv using websocket load api

 python websocket_cmdline_client.py keyvaluedata.tsv

websocket client code

  import sys
  from websocket import create_connection

  ws = create_connection("ws://localhost:8888/loadws/")
  # e.g. sys.argv[1] could 'keyvaluedata.tsv'
  ws.send(sys.argv[1])
  result =  ws.recv()
  ws.close()
  print result

Sharded Websocket Modus?

Start several atbr servers and finally one (or several) atbr shard servers to talk to them.

Example with 3 shards on localhost



  $ python atbr_server 8585 shard_data_1.tsv
  $ python atbr_server 8686 shard_data_2.tsv
  $ python atbr_server 8787 shard_data_3.tsv
  
  $ python atbr_shard_server.py localhost:8585 localhost:8686 localhost:8787
  
  $ python atbr_websocket_cmdline_client.py key1

Cost of running atbr on EC2

atbr runs in-memory, and costs for running e.g. an Amazon EC2 68.4GB ram instance is $1.80/hour. Assume the node has roughly 65GB available (after os components are loaded), this gives a Gigabyte-hour-cost of $0.027 and a Terabyte-hour-cost of 1000*0.027 = $27. Since atbr is designed to hold only json key and values with metadata, and metadata can have pointers to larger objects in disk-based storage (e.g. AWS S3) a Terabyte in-memory brings you very far. Monthly cost in this case would be $20412

What type of storage datastructure is used in atbr?

Currently:

default: Google's sparsehash library
Google's densemap library
C++/STL unordered_map

The reason why Google's sparsehash is default is that is highly memory efficient, see this benchmark for more Will support other efficient C++-based datastructures in later versions

Roadmap

Increased concurrency and threadsafety support
Simplified sharded deployment with fabric
More benchmarks and comparison with other storage alternative (e.g. HBase, Redis, Cassandra)
More end-to-end examples (from mapreduce jobs to serving)
Objective C and Lua support
Lua-based map(reduce) on running atbr instances

Who develops and supports atbr?

atbr is developed and supported by Amund Tveit (amund (at) atbrox (dot) com), Atbrox.

Forking project on Github?

Fork