Impact of storing RDF triples with Redland
Redland is an RDF application framework which allows to create, manipulate and store RDF triples. It comes with many language bindings, such as bindings for Python.
In my prototyping Archipel, my pet software configuration management system, I started to use Redland to store the version information as RDF triples. I quickly realised that the RDF storage (stored by Redland using Berkeley DB) was using a lot of space, compared to what I was doing. For instance, storing 143 files generated a database of 624Kb, plus a directory containing the actual file content (the RDF storage did only contain versioning information). This is something like 5 to 6 time the size I was expecting.
The problem
So here is the problem in more details: as I said, Archipel is a software configuration management tool, and stores the description of changes made to files (called deltas) as RDF triples. These RDF triples do not contain the actual changes (which are stored outside the RDF storage as plain files, usually in diff format). In this respect, my delta descriptions are just a bunch of strings like the following :
{e444bae023363477290e408bc6b6cb1fbee21c5a, class, "FileCreationDelta"}
{e444bae023363477290e408bc6b6cb1fbee21c5a, module, "typez.archipel.Deltas.FileDeltas"}
{e444bae023363477290e408bc6b6cb1fbee21c5a, parent, "None"}
{e444bae023363477290e408bc6b6cb1fbee21c5a, location, "2-Sources/typez/archipel/Deltas/FileDeltas.py"}
{e444bae023363477290e408bc6b6cb1fbee21c5a, nodeSignature, "45e1bf8fcde901f1bc408bdd6c48f4428a1d98 49"}
{e444bae023363477290e408bc6b6cb1fbee21c5a, product-mime-type, "text/x-java"}
{f02f7f3d3c9d2ba6325436dd2f583f906b0406f1, type, "File"}
{f02f7f3d3c9d2ba6325436dd2f583f906b0406f1, class, "FileCreationDelta"}
{f02f7f3d3c9d2ba6325436dd2f583f906b0406f1, module, "typez.archipel.Deltas.FileDeltas"}
{f02f7f3d3c9d2ba6325436dd2f583f906b0406f1, parent, "None"}
{f02f7f3d3c9d2ba6325436dd2f583f906b0406f1, location, "2-Sources/typez/archipel/Commands/Create.py"}
{f02f7f3d3c9d2ba6325436dd2f583f906b0406f1, nodeSignature, "ffdbdc9c69f1651b3c4c1fa274f857eb0871e3 2a"}
{f02f7f3d3c9d2ba6325436dd2f583f906b0406f1, product-mime-type, "text/x-java"
As you can notice, the subjects are repeated many times, so as the predicates and most of the objects. In this respect, I hoped that Redland would not store literally each node subject, predicate and object, so that the growth of my storage would quickly decrease, as there would be few new subjects or objects inserted.
This could be represented rather trivially as :
a = "f02f7f3d3c9d2ba6325436dd2f583f906b0406f1" (a, parent, "none") (a, product-mime-type, "text/x-Java") ...
so that in this case, this would save from writing the 40 chars string represented by a another time, hence saving a little bit less than 40 bytes !
Importance of triple storage overhead
To sum up things, I noticed that storing my triples took way more space than what I naively thought it would take. In fact I expected that storing the triples in the RDF storage would cost 1 or 2 times the size of the delta description in memory. I also expected this to lower, as many triples in the delta description shared common subjects, and that there are few predicates involved in describing the deltas, as I describe above.
However, it seemed like the impact was 3 to 6 times the size of my deltas. In this respect, I wanted to know if the overhead would increase, decrease or be constant over time. For Redland to be actually usable in my project (and not make Archipel too disk space hungry compared to similar software), the overhead should decrease.
I cannot say that for each modification you make on a file (and then commit to Archipel) you have to pay a something like 3kb. Imagine a project with thousands of files (like GCC), you would have to spend many Mb simply for describing the changes (this does not include the result of diff).
In this respect, even if I did not expect the overhead to be that high, I hoped that it would decrease and eventually go way beyond 100% (like 20%) when the storage starts to be populated.
The measures
I decided to write a simple Python script Δ to populate RDF persistent stores with different triples, and measure after adding n triples the size of the RDF storage files.
This script creates RDF statements of different types:
- Some that all have different subject, predicates and objets (referred to as type N -- N standing for normal).
- Some that all have the same subject, predicates or objects (respectively CS, CP and CO, C standing for constant).
- Some that have constant subject and object (CSO), others constant subject, predicate, and object (CSPO).
This will allow to have an idea of how Redland deals with triples that share common subjects, predicates or objects. As mentioned above, I expect the overhead to decrease (preferably below 1) when information is shared.
In this respect, I took two series of measures with statements of type N, CS, CP, CO, CSO and CSPO:
- The first measure adds 500 triples each times, starting at 0 and stopping at 10000 added triples.
- The second measure adds 50000 triples each time, starting at 0 and stopping at 100000 triples.
The results
During my measures, I ended up with databases of 6Mb for 10000 triples, and 66Mb for 100000 triples. Apparently, the Redland RDF storage scales rather well (in a linear fashion), which means that the overhead is constant, and does not decrease as I hoped.
The following diagram illustrates the overhead of triple storage for our various types of statements, depending on the number of triples of the same type in the database:
| Attach:redland-storage-overhead.png Δ |
From this diagram, we can make the following observations:
- The storage overhead is almost constant when the number of triples exceeds 4000.
- The overhead is for most cases between 4 and 6 times the orginial size.
- Having only one constant subject, object or predicate per triple does not change the storage overhead compared to completely different triples
- Having a constant subject and object (and supposedly subject and predicate or predicate and objects) per triple brings overhead down to 4.
- Having the same triple repeated has an overhead above 1.
So, basically, even if you store the same triple again and again (I mean, that you append it again and again), it will cost you something, and this cost is similar to the size of the data the triple contains (in our case, at least).
In Archipel, most deltas are of type CS, CP or CO. In this respect, the RDF storage will just grow in a constant manner, with an average overhead of 5.5 per delta.
So, if my deltas make about 500 bytes (this is a little bit more than 10 40 bytes strings), each single operation will result in growing the database of 500 * 5.5 = 2 750 bytes.
Imagine a project is composed of 3 000 files, and the 50 developers made 400 commits. This would just cost 2750 * 3000 * 400 / 10 = 330 000 000 (330Mb), simply for describing the changes. Also, note here that I divided the number of commits by 10 because I assumed that 300 files were changed at each commit (which is, I must say, rather improbable). Being more realistic, we could say that each commit involves modifications of 10 files, and then we would have 2750 * 400 * 10 = 11 000 000 (11Mb), which would still be too much (IMHO), as the change data (the result of a diff) is not included in this calculation.
Conclusion
It seems that unless I am not using Redland Python API properly, Redland has an important overhead on storing triples. I hoped to use it as a storage backend for Archipel, because I liked the idea of managing version information in RDF, but the overhead is disappointing, if not discouraging.
However, Redland scales really well, and obvisouly will not grow in an unexpected manner when you reach the million of triples, which makes it really robust.
So my advice is the following: know the size of what you store, and make a brief calculation of what it will cost you to store your data in RDF. To do so, apply the following rules:
- 5.5 times the size of your triple data when it has only one or no predicate, subject or object in common with another triple
- 4 times the size of your triple data when it has two common predicate, subject or object with another triple
- 1 time the size of your triple data when it was already added.
Maybe this should be taken with a grain of salt: for smaller or larger triples, the overhead may differ. I recommend that you tailor the script Δ I mentionned to make the benchmarks more suitable to your needs.
