Hi!
I'm Ruben Taelman, a Web postdoctoral researcher at IDLab,
with a focus on decentralization, Linked Data publishing, and querying.

My goal is to make data accessible for everyone by providing
intelligent infrastructure and algorithms for data publication and retrieval.

To support my research, I develop various open source JavaScript libraries such as streaming RDF parsers and the Comunica engine to query Linked Data on the Web.
As this website itself contains Linked Data, you can query it live with Comunica.

Have a look at my publications or projects
and contact me if any of those topics interest you.

Latest blog posts

  • Querying a Decentralized Web
    The road towards effective query execution of Decentralized Knowledge Graphs.

    Most of today’s applications are built based around the assumption that data is centralized. However, with recent decentralization efforts such as Solid quickly gaining popularity, we may be evolving towards a future where data is massively decentralized. In order to enable applications over decentralized data, there is a need for new querying techniques that can effectively execute over it. This post discusses the impact of decentralization on query execution, and the problems that need to be solved before we can use it effectively in a decentralized Web.

  • 5 rules for open source maintenance
    Guidelines for publishing and maintaining open source projects.

    Thanks to continuing innovation of software development tools and services, it has never been easier to start a software project and publish it under an open license. While this has lead to the availability of a huge arsenal of open source software projects, the number of qualitative projects that are worth reusing is of a significantly smaller order of magnitude. Based on personal experience, I provide five guidelines in this post that will help you to publish and maintain highly qualitative open-source software.

More blog posts

Highlighted publications

  1. Conference Link Traversal Query Processing over Decentralized Environments with Structural Assumptions
    1. 0
    2. 1
    In Proceedings of the 22nd International Semantic Web Conference To counter societal and economic problems caused by data silos on the Web, efforts such as Solid strive to reclaim private data by storing it in permissioned documents over a large number of personal vaults across the Web. Building applications on top of such a decentralized Knowledge Graph involves significant technical challenges: centralized aggregation prior to query processing is excluded for legal reasons, and current federated querying techniques cannot handle this large scale of distribution at the expected performance. We propose an extension to Link Traversal Query Processing (LTQP) that incorporates structural properties within decentralized environments to tackle their unprecedented scale. In this article, we analyze the structural properties of the Solid decentralization ecosystem that are relevant for query execution, we introduce novel LTQP algorithms leveraging these structural properties, and evaluate their effectiveness. Our experiments indicate that these new algorithms obtain accurate results in the order of seconds, which existing algorithms cannot achieve. This work reveals that a traversal-based querying method using structural assumptions can be effective for large-scale decentralization, but that advances are needed in the area of query planning for LTQP to handle more complex queries. These insights open the door to query-driven decentralized applications, in which declarative queries shield developers from the inherent complexity of a decentralized landscape. 2023
    More
  2. Conference Comunica: a Modular SPARQL Query Engine for the Web
    1. 0
    2. 1
    3. 2
    4. 3
    In Proceedings of the 17th International Semantic Web Conference Query evaluation over Linked Data sources has become a complex story, given the multitude of algorithms and techniques for single- and multi-source querying, as well as the heterogeneity of Web interfaces through which data is published online. Today’s query processors are insufficiently adaptable to test multiple query engine aspects in combination, such as evaluating the performance of a certain join algorithm over a federation of heterogeneous interfaces. The Semantic Web research community is in need of a flexible query engine that allows plugging in new components such as different algorithms, new or experimental SPARQL features, and support for new Web interfaces. We designed and developed a Web-friendly and modular meta query engine called Comunica that meets these specifications. In this article, we introduce this query engine and explain the architectural choices behind its design. We show how its modular nature makes it an ideal research platform for investigating new kinds of Linked Data interfaces and querying algorithms. Comunica facilitates the development, testing, and evaluation of new query processing capabilities, both in isolation and in combination with others. 2018
    More
  3. Journal Triple Storage for Random-Access Versioned Querying of RDF Archives
    1. 0
    2. 1
    3. 2
    4. 3
    5. 4
    In Journal of Web Semantics When publishing Linked Open Datasets on the Web, most attention is typically directed to their latest version. Nevertheless, useful information is present in or between previous versions. In order to exploit this historical information in dataset analysis, we can maintain history in RDF archives. Existing approaches either require much storage space, or they expose an insufficiently expressive or efficient interface with respect to querying demands. In this article, we introduce an RDF archive indexing technique that is able to store datasets with a low storage overhead, by compressing consecutive versions and adding metadata for reducing lookup times. We introduce algorithms based on this technique for efficiently evaluating queries at a certain version, between any two versions, and for versions. Using the BEAR RDF archiving benchmark, we evaluate our implementation, called OSTRICH. Results show that OSTRICH introduces a new trade-off regarding storage space, ingestion time, and querying efficiency. By processing and storing more metadata during ingestion time, it significantly lowers the average lookup time for versioning queries. OSTRICH performs better for many smaller dataset versions than for few larger dataset versions. Furthermore, it enables efficient offsets in query result streams, which facilitates random access in results. Our storage technique reduces query evaluation time for versioned queries through a preprocessing step during ingestion, which only in some cases increases storage space when compared to other approaches. This allows data owners to store and query multiple versions of their dataset efficiently, lowering the barrier to historical dataset publication and analysis. 2018
    More
More publications

Latest publications

  1. Poster Optimizing Traversal Queries of Sensor Data Using a Rule-Based Reachability Approach
    1. 0
    2. 1
    3. 2
    4. 3
    In Proceedings of the 23rd International Semantic Web Conference: Posters and Demos Link Traversal queries face challenges in completeness and long execution time due to the size of the web. Reachability criteria define completeness by restricting the links followed by engines. However, the number of links to dereference remains the bottleneck of the approach. Web environments often have structures exploitable by query engines to prune irrelevant sources. Current criteria rely on using information from the query definition and predefined predicate. However, it is difficult to use them to traverse environments where logical expressions indicate the location of resources. We propose to use a rule-based reachability criterion that captures logical statements expressed in hypermedia descriptions within linked data documents to prune irrelevant sources. In this poster paper, we show how the Comunica link traversal engine is modified to take hints from a hypermedia control vocabulary, to prune irrelevant sources. Our preliminary findings show that by using this strategy, the query engine can significantly reduce the number of HTTP requests and the query execution time without sacrificing the completeness of results. Our work shows that the investigation of hypermedia controls in link pruning of traversal queries is a worthy effort for optimizing web queries of unindexed decentralized databases. 2024
    More
  2. Journal Towards Applications on the Decentralized Web using Hypermedia-driven Query Engines
    1. 0
    In ACM SIGWEB Newsletter The Web is facing unprecedented challenges related to the control and ownership of data. Due to recent privacy and manipulation scandals caused by the increasing centralization of data on the Web into increasingly fewer large data silos, there is a growing demand for the re-decentralization of the Web. Enforced by user-empowering legislature such as GDPR and CCPA, decentralization initiatives such as Solid are being designed that aim to break personal data out of these silos and give back control to the users. While the ability to choose where and how data is stored provides significant value to users, it leads to major technical challenges due to the massive distribution of data across the Web. Since we cannot expect application developers to take up this burden of coping with all the complexities of handling this data distribution, there is a need for a software layer that abstracts away these complexities, and makes this decentralized data feel as being centralized. Concretely, we propose personal query engines to play the role of such an abstraction layer. In this article, we discuss what the requirements are for such a query-based abstraction layer, what the state of the art is in this area, and what future research is required. Even though many challenges remain to achieve a developer-friendly abstraction for interacting with decentralized data on the Web, the pursuit of it is highly valuable for both end-users that want to be in control of their data and its processing and businesses wanting to enable this at a low development cost. 2024
    More
  3. Workshop The R3 Metric: Measuring Performance of Link Prioritization during Traversal-based Query Processing
    1. 0
    2. 1
    3. 2
    In Proceedings of the 16th Alberto Mendelzon International Workshop on Foundations of Data Management The decentralization envisioned for the current centralized web requires querying approaches capable of accessing multiple small data sources while complying with legal constraints related to personal data, such as licenses and the GDPR. Link Traversal-based Query Processing (LTQP) is a querying approach designed for highly decentralized environments that satisfies these legal requirements. An important optimization avenue in LTQP is the order in which links are dereferenced, which involves prioritizing links to query-relevant documents. However, assessing and comparing the algorithmic performance of these systems is challenging due to various compounding factors during query execution. Therefore, researchers need an implementation-agnostic and deterministic metric that accurately measures the marginal effectiveness of link prioritization algorithms in LTQP engines. In this paper, we motivate the need for accurately measuring link prioritization performance, define and test such a metric, and outline the challenges and potential extensions of the proposed metric. Our findings show that the proposed metric highlights differences in link prioritization performance depending on the queried data fragmentation strategy. The proposed metric allows for evaluating link prioritization performance and enables easily assessing the effectiveness of future link prioritization algorithms. 2024
    More
  4. Workshop Opportunities for Shape-based Optimization of Link Traversal Queries
    1. 0
    2. 1
    3. 2
    4. 3
    In Proceedings of the 16th Alberto Mendelzon International Workshop on Foundations of Data Management Data on the web is naturally unindexed and decentralized. Centralizing web data, especially personal data, raises ethical and legal concerns. Yet, compared to centralized query approaches, decentralization-friendly alternatives such as Link Traversal Query Processing (LTQP) are significantly less performant and understood. The two main difficulties of LTQP are the lack of apriori information about data sources and the high number of HTTP requests. Exploring decentralized-friendly ways to document unindexed networks of data sources could lead to solutions to alleviate those difficulties. RDF data shapes are widely used to validate linked data documents, therefore, it is worthwhile to investigate their potential for LTQP optimization. In our work, we built an early version of a source selection algorithm for LTQP using RDF data shape mappings with linked data documents and measured its performance in a realistic setup. In this article, we present our algorithm and early results, thus, opening opportunities for further research for shape-based optimization of link traversal queries. Our initial experiments show that with little maintenance and work from the server, our method can reduce up to 80% the execution time and 97% the number of links traversed during realistic queries. Given our early results and the descriptive power of RDF data shapes it would be worthwhile to investigate non-heuristic-based query planning using RDF shapes. 2024
    More
  5. Conference Decentralized Search over Personal Online Datastores: Architecture and Performance Evaluation
    1. 0
    2. 1
    3. 2
    4. 3
    5. 4
    6. 5
    7. 6
    8. 7
    In Proceedings of the 24th International Conference on Web Engineering Data privacy and sovereignty are open challenges in today’s Web, which the Solid ecosystem aims to meet by providing personal online datastores (pods) where individuals can control access to their data. Solid allows developers to deploy applications with access to data stored in pods, subject to users’ permission. For the decentralised Web to succeed, the problem of search over pods with varying access permissions must be solved. The ESPRESSO framework takes the first step in exploring such a search architecture, enabling large-scale keyword search across Solid pods with varying access rights. This paper provides a comprehensive experimental evaluation of the performance and scalability of decentralised keyword search across pods on the current ESPRESSO prototype. The experiments specifically investigate how controllable experimental parameters influence search performance across a range of decentralised settings. This includes examining the impact of different text dataset sizes (0.5MB to 50MB per pod, divided into 1 to 10,000 files), different access control levels (10%, 25%, 50%, or 100% file access), and a range of configurations for Solid servers and pods (from 1 to 100 pods across 1 to 50 servers). The experimental results confirm the feasibility of deploying a decentralised search system to conduct keyword search at scale in a decentralised environment. 2024
    More
More publications