https://www.yangdb.org/
wiki
Version Info
Release.notes
Lior Perry
Roman Margolis
Moti Cohen
Elad Wies
Shimon Benoshti
A Post introducing our new Open source initiative for building a Scalable Distributed Graph DB Over Opensearch https://www.linkedin.com/pulse/making-db-lior-perry/
Another usage of Opensearch as a graph DB https://medium.com/@imriqwe/elasticsearch-as-a-graph-database-bc0eee7f7622
The world of graph databases has had a tremendous impact during the last few years, in particularity relating to social networks and their effect of our everyday activity.
The once mighty (and lonely) RDBMS is now obliged to make room for an emerging and increasingly important partner in the data center: the graph database.
Twitter’s using it, Facebook’s using it, even online dating sites are using it; they are using a relationship graphs. After all, social is social, and ultimately, it’s all about relationships.
There are two main elements that distinguish graph technology: storage and processing.
Graph storage commonly refers to the structure of the database that contains graph data.
Such graph storage is optimized for graphs in many aspects, ensuring that data is stored efficiently, keeping nodes and relationships close to each other in the actual physical layer.
Graph storage is classified as non-native when the storage comes from an outside source, such as a relational, columnar or any other type of database (most cases a NoSQL store is preferable)
Non-native graph databases usually comprise of existing relational, document and key value stores, adapted for the graph data model query scenarios.
Graph Processing includes accessing the graph, traversing the vertices & edges and collecting the results.
A traversal is how you query a graph, navigating from starting nodes to related nodes, following relationships according to some rules.
finding answers to questions like "what music do my friends like that I don’t yet own?"
One of the more popular models for representing a graph is the Property Model.
This model contains connected entities (the nodes) which can hold any number of attributes (key-value-pairs).
Nodes have a unique id and list of attributes represent their features and content.
Nodes can be marked with labels representing their different roles in your domain. In addition to relationship properties, labels can also serve metadata over graph elements.
Nodes are often used to represent entities but depending on the domain relationships may be used for that purpose as well.
Relationship is represented by the source and target node they are connecting and in case of multiple connections between the same vertices – additional label of property to distinguish (type of relationship)
Relationships organize nodes into arbitrary structures, allowing a graph to resemble a list, a tree, a map, or a compound entity — any of which may be combined into yet more complex structures.
Very much like foreign keys between tables in relational DB model, In the graph model relationship describes the relations between the vertices.
One major difference in this model (compared to the strict relational schema) is that this schema-less structure enables adding / removing relationship between vertices without any constraints.
Additional graph model is the Resource Description Framework (RDF) model.
Our use-case is in the domain of the social networks. A very large social graph that must be frequently updated and available for both:
simple (mostly textual) search
graph based queries.
All the read & write are made in concurrency with reasonable response time and ever growing throughput.
The first requirement was fulfilled using Opensearch – a well known and established NoSql document search and storage engine capable of containing very large volume of data.
For the second requirement we decided that our best solution would be to use Opensearch as the non-native graph-DB storage layer.
As mentioned before, a graph-DB storage layer can be implemented using a non-native storage such as NoSql storage.
In future discussion I’ll get into details why the most popular community alternative for graph-DB – Neo4J, could not fit our needs.
The first issue on our plate is to design the data model representing the graph, as a set of vertices and edges.
With Opensearch we can utilize its powerful search abilities to efficiently fetch node & relation documents according to the query filters.
In Opensearch each index can be described as a table for a specific schema, the index itself is partitioned into shared which allow scale and redundancy (with replicas) across the cluster.
A document is routed to a particular shard in an index using the following formula:
shard_num = hash(_routing) % num_primary_shards
Each index has a schema (called type in Opensearch) which defines the documents structure (called mapping in Opensearch). Each index can hold only a single type of mapping (since Opensearch 6)
The vertices index will contain the vertices documents with the properties, the edges index will contain the edges documents with their properties.
The way we describe how to traverse the graph (data source)
There are few graph-oriented query languages:
Cypher is a query language for the Neo4j graph database (see openCypher initative)
Gremlin is an Apache Software Foundation graph traversal language for OLTP and OLAP graph systems.
SPARQL is a query language for RDF graphs.
Some of the languages are more pattern based and declarative, some are more imperative – they all describe the logical way of traversing the data.
Let’s consider Cypher - a declarative, SQL-inspired language for describing patterns in graphs visually using an ascii-art syntax.
It allows us to state what we want to select, insert, update or delete from our graph data without requiring us to describe exactly how to do it.
Once a logical query is given we need to translate it to the physical layer of the data storage which is Opensearch.
Opensearch has a query DSL which is focused on search and aggregations – not on traversing, we need an additional translation phase that will take into account the schematic structure of the graph (and the underlying indices).
Logical to physical query translation is a process that involves few steps:
validating the query against the schema
translating the labels into real schema entities (indices)
creating the physical Opensearch query
This is the process in a high-level review, in practice - there will be more stages that optimize the logical query; in some cases it is possible to create multiple physical plans (execution plans) and rank them according to some efficiency (cost) strategy such as count of elements needed to fetch...
We started with discussing the purpose of graphs DB in today’s business use cases and reviewed different models for representing a graph. Understanding the fundamental logical building blocks that a potential graph DB should consist and discussed an existing NoSql candidate to fulfill the storage layer requirements.
Once we selected Opensearch as the storage layer we took the LDBC Social Network Benchmark graph model and simplified it to be optimized in that specific storage. We discussed the actual storage schema with the redundant properties and reviewed cypher language to query the storage in an sql-like graph pattern language.
We continued to see the actual transformation of the cypher query into a physical execution query that will run by Opensearch.
In the last section we took a simple graph query and drilled down into the details of the execution strategies and the bulking mechanism.
Installaiton tutorial:
Schema creation tutorial:
Data Loading tutorial:
Query the Graph tutorial:
Projection materialization & count tutorial: