This week, Daniel gave a presentation on his Part II project, which is called “DLLAMA: Scaling LLAMA Into a Distributed Graph Database.”
A graph database represents entities and relationships in the form of a graph instead of a table. In this approach, calculating the distance between two entities (see Bacon Number) would be much more efficient. Daniel discussed three naive ways to represent graphs in a computer and pointed out the problems of each of them.
We could use a matrix to represent the connectivity between nodes by recording a “1” in the matrix when there is a connection, and “0” otherwise:
Unfortunately, the matrix representation takes up a lot of space. Alternatively, we could use a linked-list representation to represent the graph:
Unfortunately, a linked list representation has bad read performance and interacts poorly with the cache since data is spread throughout memory.
To solve some of these problems, Dan introduced a data structure named CSR. This data structure is similar to the linked-list representation; the difference is that all the edges are stored in a single array. Doing so solves the high costs associated with reading in linked-list representation, at the expense of significant additional cost when adding edges:
This data structure (with a small improvement called“snapchat”) is used in LLAMA, a graph database engine. Dan modified LLAMA to support distributed processing across multiple computers and scale the data structure in a distributed system. His system works better than a single machine for read-heavy workloads and he provided benchmark results which showed that DLLAMA was significantly better than Neo4j (another frequently-used graph database) and the original version of LLAMA.