HDT-MR: A Scalable Solution for RDF Compression with HDT and MapReduce
1. DataWeb Research, Department of Computer Science, Univ. de Valladolid (Spain),
2. Vienna University of Economics and Business (Austria)
This work has been presented to ESWC 2015 (access paper). In the following, we provide a brief technical overview of the proposal and its main current results. Please see the “RDF/HDT” project website for a further details on HDT.
|1. Introduction |
2. Current Uses and Scalability Problems
2. HDT-MR: Technical Overview
6. Source Code
The current Web of Data is producing increasingly large RDF datasets. The need to exchange large datasets has unveiled the drawbacks of traditional RDF representations: high levels of verbosity/redundancy and weak machine-processable capabilities. Similar problems arise with smaller datasets in smartphones and embedded devices. This scenario calls for efficient formats for publication, exchange and consumption of RDF.
RDF HDT is a data-centric format which reduces verbosity in favor of machine-understandability and data management. The specification defines a compact binary serialization format optimized for storage or transmission over a network.
An HDT-encoded dataset is composed by three logical components (Header, Dictionary, and Triples), carefully designed to address RDF peculiarities. They are shown in Figure 1.
- Header: The Header holds metadata describing an HDT semantic dataset using plain RDF. It acts as an entry point for the consumer, who can have an initial idea of key properties of the content even before retrieving the whole dataset.
- Dictionary: The Dictionary is a catalog comprising all the different terms used in the dataset, such as URIs, literals and blank nodes. A unique identifier (ID) is assigned to each term, enabling triples to be represented as tuples of three IDs, which reference their respective subject/predicate/object term from the dictionary. This is a first step toward compression, since it avoids long terms to be repeated again and again. Moreover, similar strings are now stored together inside the dictionary, fact that can be exploited to improve compression even more.
- Triples: As stated before, the RDF triples can now be seen as tuples of three IDs. Therefore, the Triples section models the graph of relationships among the dataset terms. By understanding the typical properties of RDF graphs, we can come up with more efficient ways of representing this information, both to reduce the overall size, but also to provide efficient search/traversal operations.
Figure 1: HDT Components: Header-Dictionary-Triples
The main strengths of the HDT format are:
- Compactness: Use less space to encode the same information, therefore saving storage, communication bandwidth and transfer times. Designed taking into account the peculiarities of RDF, improves the figures of general compressors, such as GZIP, BZIP2.
- Clean Publication and exchange: Keeps the metadata easily accessible in the Header, with information such as provenance as statistics. The content separates the terms (Dictionary) from the structure of relationships among them (Triples).
- On-demand indexed access: permits fast search operations to parts of the file without having to decompress it as a whole.
In summary, the binary HDT file can be directly loaded into the memory of a computational system and accessed as a data structure, therefore avoiding the expensive indexing operations. Thanks to the compression techniques, more data is readily available at the higher levels of the memory hierarchy. Data in faster memory always means faster retrieval operations.
2. Current HDT Uses and Scalability Problems
Built-in indexes in both HDT components allow RDF triples to be randomly retrieved in compressed space. In other words, HDT outputs more compact files than RDF plain formats and also enables RDF triples to be efficiently accessed without prior decompression. This fact makes HDT an ideal choice to play as storage engine within semantic applications. HDT-FoQ illustrates how HDT can be used for efficient triple pattern and SPARQL join resolution, while WaterFowl goes a step further and provides inference on top of HDT foundations. This notion of HDT-based store is deployed in several domains and applications.
HDT conforms the back-end store of the Linked Data Fragments proposal , which tries to alleviate SPARQL endpoint bottlenecks by disseminating the SPARQL resolution between several “intelligent” clients that store and consume HDT data. HDT also performs as the main engine of the SemStim recommendation system and it is used as “lite” store in HDTourist, an Android app that consumes compressed DBpedia data for tourist assistance.
Nevertheless, these HDT achievements are at the price of moving scalability issues to the publishers, or data providers in general. Serializing RDF into HDT is not as straightforward as with plain serializations, given that the whole dataset must be exhaustively processed to obtain the Dictionary and Triples components. Current HDT implementations demand not negligible amounts of memory, so the HDT serialization lacks of scalability for very large datasets. Although HDT serialization is done only once, while many users may consume the resulting serialized file, this fact is an issue that must be confronted in order to address the ever-growing RDF datasets. This paper improves the HDT workflow by introducing MapReduce as the computation model for large HDT serialization.
3.1 MapReduce Framework
MapReduce is a framework and programming model to process large amounts of data in a distributed way. Its main purpose is to provide efficient parallelization while abstracting the complexity of distributed processing. MapReduce is not schema-dependent: unstructured and semi-structured can be processed, at the price of parsing every item. A MapReduce job comprises two phases: The first phase, map, reads the data as pairs key-value (k1,v1) and outputs another series of pairs key-value of different domain (k2,v2). The second phase, teduce, processes the list of values v2, related to each key k2, to produce a final list of output values v2 pertaining to the same domain. Many tasks are launched on each phase, all of them processing a small piece of the input data. The following scheme illustrates input and output data to be processed in each phase:
- map: (k1,v1) → list(k2,v2)
- reduce: (k2,list(v2)) → list(v2)
MapReduce relies on a master/slave architecture. The master initializes the process, distributes the workload among the cluster and manages all bookkeeping information. The slaves (or workers) run map and reduce tasks. The workers commonly store the data using a distributed filesystem based on the GFS (Google File System) model, where data is split in small pieces and stored in different nodes. This allows workers to leverage data locality as much as possible, reading data from the same machine where the task runs.
3.2 HDT Dictionary
The main goal of the Dictionary is to contribute to compactness by the assignation of a unique ID to each element in the dataset. Thus, the use of the dictionary implies two important and minimum operations:
locate(element): returns a unique identifier for the given element, if it appears in the dictionary.
extract(id): returns the element with identifier id in the dictionary, if it exists.
The dictionary is divided into sections depending on whether the term plays subject, predicate, or object roles. Nevertheless, in semantic data it is quite common that a URI appears both as a subject in one triple and as object on another. To avoid repeating those terms twice in the subjects and in the objects sections, we can gather them into a fourth section called shared Subject-Object.
Figure 2 depicts the 4-section dictionary organization and how IDs are assigned to the corresponding terms. Each section is sorted lexicographically and then correlative IDs are assigned to each term from 1 to n. It is worth noting that, for subjects and objects, the shared Subject-Object section uses the lower range of IDs; e.g. if there are m terms playing interchangeably as subject and object, all IDs x such that x <= m belong to this shared section.
Figure 2: HDT dictionary organization into four sections.
HDT allows to use different techniques of dictionary representation. Each one can manage its catalog of terms in different ways, as long as they implement the mandatory locate/extract operations. In addition they can implement additional enhanced operations, such as prefix, full-text or even regular expression searches.
3.3 HDT Triples
HDT proposes a Triples encoding called BitmapTriples (BT) that organizes the information in a way that exploits graph redundancy to keep the information compact. Moreover, this encoding can be easily mapped into a data structure that allows basic retrieval operations to be performed efficiently.
BT needs the triples to be previously sorted in a specific order, such as subject-predicate-object (SPO). BT is able to handle all possible triple orderings, but for the explanation we will focus on the more intuitive SPO order. Basically, BT transforms the graph into a forest containing as many trees as different subjects are used in the dataset, and these trees are then ordered by subject ID. This way, the first tree represents all triples rooted by the subject identified as 1, the second tree represents all triples rooted by the subject identified as 2, and so on. Each tree comprises three levels: the root represents the subject, the second level lists all predicates related to the subject, and finally the leaves represent all objects for their path (subject, predicate). Predicate and object levels are also sorted, as shown in Figure 3 (middle).
Figure 3: Description of Bitmap Triples.
Then, the BT forest model is physicaly encoded layer by layer. The top layer (subjects) is always a correlative list from 1 to the total number of subjects, due to the correlative assignation of IDs in the dictionary and the SPO order of the triples. Therefore the subjects do not need to be encoded and is left implicit. This is an first obvious spatial saving.
The predicate layer is stored using i) a sequence of predicate IDs (Sp); ii) a bitsequence (Bp) containing one bit per element in Sp: A 1 bit indicates that the predicate is the first child of its parent subject in the tree, the remaining siblings are marked with 0. This allows knowing to which subject is each predicate associated by counting the number of 1′s in the bitmap up to that position, or even locating the range of predicates associated to the n-th subject by locating the n-th and (n+1)-th 1 in the bitmap. The object layer is stored likewise, using a sequence of Object IDs (So) together with a bit sequence (Bo). Each 1 bit indicates the first object of each parent predicate, therefore allowing traversals in the tree upwards and downwards.
3.4 Building HDT
Once Dictionary and Triples internals have been described, we proceed to summarize how current HDT implementations build them. Remind that this process is the main scalability bottleneck addressed by our current proposal. To date, HDT serialization can be seen as a three-stage process:
- Classifying RDF terms. This first stage performs a triple-by-triple parsing (from the input dataset file) to classify each RDF term into the corresponding Dictionary section. To do so, it keeps a temporal data structure, consisting of three hash tables storing subject-to-ID, predicate-to-ID, and object-to- ID mappings. For each parsed triple, its subject, predicate, and object are searched in the appropriate hash, obtaining the associated ID if present. Terms not found are inserted and assigned an auto-incremental ID. These IDs are used to obtain the temporal ID-triples (ids, idp, ido) representation of each parsed triple, storing all them in a temporary ID-triples array. At the end of the file parsing, subject and object hashes are processed to identify terms playing both roles. These are deleted from their original hash tables and inserted into a fourth hash comprising terms in the SO section.
- Building HDT Dictionary. Each dictionary section is sorted lexicograph- ically in this second stage. This decision is made because prefix-based encoding is a well-suited choice for compressing string dictionaries. Finally, an auxiliary array coordinates the previous temporal ID and the definitive ID after the Dictionary sorting.
- Building HDT Triples. This final stage scans the temporary array storing ID-triples. For each triple, its three IDs are replaced by their definitive IDs in the newly created Dictionary. Once updated, ID-triples are sorted by subject, predicate and object IDs to obtain the BitmapTriples streams. In practice, it is a straightforward task which scans the array to sequentially extract the predicates and objects into the Sp and So sequences, and denoting list endings with 1-bits in the bitsequences.
This section briefly describes HDT-MR, our MapReduce-based approach to serialize large RDF datasets in HDT. Further details can be found in the original paper.
Figure 4 illustrates the HDT-MR workflow, consisting in two stages: (1) Dictionary Encoding (top) and (2) Triples Encoding (bottom), described in the following subsections. The whole process assumes the original RDF dataset is encoded in N-Triples format (one statement per line).
Figure 4: HDT-MR workflow.
As can be seen, the first process builds the HDT Dictionary from the original N-Triples dataset. Similar to the traditional HDT construction, it can be seen as a three-task process of (i) identifying the role of each term in the dataset, (ii) obtaining the aforementioned sections (SO, S, O, and P) in lexicographic order, and (iii) effectively encoding the Dictionary component. We design HDT-MR to perform these three tasks as two distributed MapReduce jobs and a subsequent local process (performed by the master node), as shown in the figure.
The second process performs a second pass over the original N-Triples dataset to obtain, in this case, the HDT Triples component. The main tasks for such Triples encoding are (i) replacing RDF terms by their corresponding IDs encoded in the Dictionary, and (ii) getting the ID-triples representation sort by subject, predicate and object IDs. As in the previous process, HDT-MR accomplishes these tasks by two MapReduce jobs and a final local process.
This section evaluates the performance of HDT-MR, the proposed MapReduce-based HDT construction, and compares it against the traditional single-node approach. To do so, we have developed a proof-of-concept HDT-MR prototype, under the Hadoop framework (version 1.2.1). It makes use of the existing HDT-Java library (version RC-2) which it is also used for deploying the baseline HDT running on a single node.
The experimental setup is designed as follows (see Table 1). On the one hand, we use a powerful computational configuration to implement the role of data provider running HDT on a single node. On the other hand, we deploy HDT-MR using a potent master and 10 slave nodes running on a more memory-limited configuration. This infrastructure tries to simulate a computational cluster in which further nodes may be plugged to process huge RDF datasets. For a fair comparison, the amount of main memory in the single node is the same as the total memory available for the full cluster of Hadoop.
|Master||Intel Xeon X5675||4||3.07 GHz||48 GB||Ubuntu 12.04.2|
|Single Node||Xeon E5-2650v2||32||2.60GHz||48 GB||Debian 7.8|
|Slaves||Intel Xeon X5675||4||3.07 GHz||8 GB||Debian 7.7|
Regarding datasets, we consider a varied configurarion comprising real- world and synthetic ones. All of them are statistically described in Table 2.
Table 3 compares serialization times for HDT and HDT-MR. These time are averaged over three independent serialization procesess for each dataset. Figure 5 and Figure 6 illustrate these same compression times for the real-world and LUBM datasets.
Figure 5: Graphical comparison on Compression times (in seconds). Real-world datasets.
Figure 6: Graphical comparison on Compression times (in seconds). LUBM datasets.
Finally, Table 4 compares HDT space savings regarding the original serialization. Although RDF compression is not the main purpose of this paper, it is worth noting that previous literature do not report compression numbers for such large datasets. This table reports the size of the datasets (i) in NTriples format, (ii) compressed with lzo (the configuration used for storage in HDFS), (iii) in HDT, and (iv) in HDT compressed with gzip (as a practical configuration for HDT uses.
6. Source Code
We provide a first alpha release of HDT-MR source code. While this build has been extensively tested, the current alpha state is still subject to bugs and optimizations. You can also checkout the latest source code from the Java HDT-MR library in the GitHub repository.
HDT-MR content is licensed by Lesser General Public License.
[Brisaboa, Canovas, Claude, Martínez-Prieto,
and Navarro 2011]
N. Brisaboa , R. Canovas, F. Claude, M. Martínez-Prieto, and G. Navarro (2011). Compressed String Dictionaries. In Proc. of 10th International Symposium on Experimental Algorithms (SEA), pp. 136-147.
[Curé, Blin, Revuz, and Faye 2014]
O. Curé, G. Blin, D. Revuz, and D.C. Faye. WaterFowl: A Compact, Self-indexed and Inference-Enabled Immutable RDF Store. In Proc. of ESWC, pages 302–316, 2014.
[Dean and Ghemawat 2004]
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Largee Clusters. In Proc. of OSDI, pages 137–150, 2004.
[Heitmann and Haye 2014]
B. Heitmann and C. Haye. Semstim at the lod-recsys 2014 challenge. In Proc. of SemWebEval, pages 170–175, 2014
[Martínez-Prieto, Arias, and Fernández 2012]
M. Martínez-Prieto, M. Arias, and J. Fernández (2012). Exchange and Consumption of Huge RDF Data. In Proc. of the 9th Extended Semantic Web Conference (ESWC), pp. 437-452.