Svarm is a distributed storage solution.
svarm provides an open-source enterprise-grade, federated, multi-tenant key/value datastore in the same vein as DynamoDB and Cassandra that scales linearly and with little required maintenance.
svarm is a key/value datastore, designed for those familiar with DynamoDB and Cassandra. It's built to easily add nodes to the cluster and internally re-distribute the workload as needed. Designed from the ground up to allow for multi-tenant usages, and minimal maintenance. Security is built in from the start.
It would be great if svarm was API compatible with DynamoDB, but that is only aspirational.
svarm has four main layers that work together.
- Proxy: Provides for client access. Routes the request internally to the correct data node, handling redundancy and failover.
- Data Node: Nodes that manage the data itself.
- Control: Manages the other nodes in the cluster, replication strategy and joining/merging nodes.
- Reporting: Extracts usage patterns from the nodes and proxies.
There are various ways to design the layout for the system. A docker swarm will be made available for internal development or small usage. The minimum environment would consist of one instance of the proxy, node and control plane. Data files and configurations can be stored and migrated to larger installations without downtime, provided the proxy starts in the network you want to use it in.
The full-build out includes multiple data planes for high-availability and network isolation of the control and reporting servers for proper security.
Proxy handles authentication of the client, finding the correct node to talk to, and making the requests to the data nodes. Since replication requires multiple data notes, the proxy will be talking to multiple data nodes at once. Those requests may be async or sync, depending on the requirements from the downstream client.
Data nodes are responsible for the storage/retrieval of the data. When a new node is added to the system, it registers with the control plane and makes itself available for data access. Once the control plane uses it for a specific table, the new node has to ask other nodes to stream the data being replicated to this node. For security reasons, these requests require the node to talk directly to the other nodes and does not go through the proxies. It can manage this transfer in an async fashion. Data nodes are self-preserving and thus maintains their own system limits. They can reject work if its resources are full, informing the control plane to redistribute the workload. Ideally, nodes monitor their rate of change and when the rate is such that the system resources will be depleted, notifies the control plane of the doom.
Functionally, the data nodes can operate independently of all other components. But when a data node is added to a control node, it becomes available for a greater part of the system. Each data node can respond to the core API requirements. There will be multiple table models available, each designed to operate with high-cardinality workloads.
V1 tables look like this:
- ID: Indexed, first part of the primary composite key. This is the unique tenantResource
- C_COL: Indexed, second part of the primary composite key.
- HASH: The hash value of the ID for mgmt.
- C_DATA_TYPE: Enum, either String or Integer.
- C_DATA: Nullable String.
Each data node has a table to describe tables it controls. Example columns
- RID_TENANT: Indexed, first part of the primary composite key
- TABLE_NAME: Index, second part of the primary composite key
- HASH: String hash of what keys it is intended to store.
- QUANTITY_EST: Estimate number of entries in the table.
- TABLE_VERSION: The version of table this requires.
Functionally, the data requests to the node takes in JSON object. Each JSON object is broken down into multiple rows in the relational table. Each table for each Tenant uses a unique database connection. These individual database instances assist with liquibase patching as well as individual encryption keys per database instance.
Data can be added, updated or removed from the node by key if needed. So there is no need to get all datasets that make up the entry.
Every component in the cluster has their own 256 bit AES key. This includes:
- Proxy
- Control Plane
- Node
- Node per control plane.
- Proxy per control plane.
- Tenant per control plane
- Tenant per node.
These are used at various points for encryption. All encryption is AES/GCM/SIV. Keys are identified by their UUID and all are 256 bits in length. The nonce are stored on the service that does the encryption.
The key hashing technique will be Murmur3 which provides good randomness and executes very fast. Collisions are allowed in the lookup strategy, and the 32bit variant is enough of a namespace for us.
When a node is started up, a 256bit key is created and store locally that is specific to the node. When a node connects to the control plane, a second 256bit key is retrieved from the control plane which is unique to that node in the control plane. These two keys are XOR together and provides the AES key for the node internal configuration database. If a node is removed from the control plane, that database cannot be decrypted as the key from the control plane is never stored in the node itself.
When a client as added to a node, the key for the client is made up of the following keys:
- Node key.
- Tenant based on the node
- Tenant based on the control plane.
This ensures that compromising one component of the network does not let a intruder decrypt all the data. Or more specifically, access to one node does not give details on keys for the other nodes. This reduces the blast radius of a node takeover event. If nothing else, it ensures the hard drives taken from a decommissioned node is secured.
The node implementation is a separate datasource instance per Tenant table. The reasons to do this includes:
- Easy cleanup by file deletion.
- Liquibase can be used without having to worry about different table names.
- Overhead is minor
- Quick to build.
I have a feeling this will need to change in the future, but I'd like to get to that point. Is this tech debt? Not necessarily. This seems like a good way to use HSQLDB at this point. Folks with a better idea are welcome to comment. In the case of using MySQL, we can encrypt at the 'tablespace' layer. For PostgreSQL, the data partition will need to be encrypted at the OS layer.
We treat deleted columns as a state change that needs propagation, and reuse the same row for management of the deletion request. This is similar to the approach taken by Cassandra's tombstone. When a row's column is deleted, the column data field is removed and the column state is set to 'deleted', and the expiry time for that row. At some time interval, the rows are removed automatically if the expiration time has passed. This means when data is migrating to new servers, these deletion tombstones are also migrated. We still have the updated timestamp on the rows in-case there is a collision so the latest row wins.
The Proxy and Data nodes are managed by the Control plane. Not pictured here is how new components are added to the control plane. A UI is expected at some point, but at least initially it will only be CLI based.
The Control segment allows for new nodes to be added or nodes discontinued. This also impacts the proxies as they need to know which nodes to access. The configuration for the nodes and proxies exists in etcd.
Control will use a consistent hashing approach to map entry ids to the nodes themselves. The familiar 'ring-space' will be used, which entry ids and node ids are hashed to be place on a ring based on the hash range. Data replication is done by sub-dividing the hashed value into the ring so its equally divided out.
Unlike a traditional approach, the control plane will place nodes on the ring manually based on usage requirements. Given enough data, hot-spots will happen so this provides a type of self-healing.
The control plane writes to etcd what data ranges for tenant resources each node should own. The nodes and proxies only read from the etcd data set. Those nodes and proxies that read from etcd will set up watchers to look for data changes.
Nodes themselves communicate status events directly with the control plane. Proxies are fairly divorced from the control plane, and like the nodes, only have read-access to the etcd contents. From a security perspective, this limits access patterns to the configuration layer from the edge.
When a new tenant resource needs to be made, the control plane will assign nodes to the tenant resource via the etcd structure. Nodes will have watch ranges on the etcd structure based on their id. When each node assigned to a resource is ready, it will contact the control plane directly that its ready. When all nodes for a resource are ready, the control plane will update etcd with the range for the tenant resource so proxies can find the nodes that handle the data based on the hash.
When a node range needs to be split, combined or shuffled, the control plane will do the following (simplified):
- Add the new node and tenant resource metaData to the system for the nodes to init themselves.
- When the nodes are ready, add in the new changes to the tenant resource config.
- Remove the node tenant resource from the node details so nodes can start transferring data. (They transfer by the same lookup the proxies do.)
- When a node has transferred everything, tells the control plane it's done and the control plane finalizes the resource.
This is the most important feature of the control plane. It will be further documented later.
etcd is the standard with k8s at this point, and has much of the same functionality as zookeeper but more modern and flexible. Nothing is wrong with zookeeper, but since etcd is standard on k8s installation and there by default. For execution environments that are not using k8s, etcd is an easy install.
All data in etcd is basically key/value pairs. However, we are using a path-style namespace for this. Here are the following structures: Note, the main line consists of the namespace and the id of the thing being named.
| Namespace/Key | Value | Purpose |
|---|---|---|
| node/{uuid}/id/{tenant}/{tenantResource} | {"hash":32767} | Metadata of a table, defined by the controller . Node read this data |
| tenant/{tenant}/{tenantResource} | {"node":"{uuid}", "hash":32767, "uri":"{uri}"} | Look up for a tenantResource metaData. Used by proxies to find nodes, and by nodes when transferring data. |
The reporting infrastructure pushes data to an external data store. It does not define the reports required, rather provides the mechanism to export the data out. Data here includes the utilization of servers from the client side and the storage that is actually used. The goal is to provide data on client utilization for svarm service, as well as how the nodes are doing keeping up with demand. But at this point in the project, the first goal is to provide the data funnel.
Resource IDs identify structure within svarm and related properties. This format
is a variation of what is found within Amazon's ARN. It represents the unique
tenantResource of any resource within this system.
General format is:
rid:service:tenant:location:resource_type:resource_id
So for example, svarm object would generically look like this:
rid:svarm:<tenant>:<location>:table:<table_name>
And given the tenant id being 1234, the location is NA for the table named entries, we would have:
rid:svarm:1234:NA:table:entries
Vi URLs for svarm are effectively CRUD operations. Storage initially is based
on a simple ID for each entry in the table. The ID is either String, Integer or
Bytes. When you define the table, you have to pick one type for the ID.
- List: HTTP GET /v1/tenant
- Read: HTTP GET /v1/tenant/{tenant}
- Create: HTTP PUT /v1/tenant/{tenant}
- Delete: HTTP DELETE /v1/tenant/{tenant}
You can create, delete or list tables. These URLS are as follows:
- List: HTTP GET /v1/tenant/{tenant}/table
- Create: HTTP PUT /v1/tenant/{tenant}/table/{table}
- Read: HTTP GET /v1/tenant/{tenant}/table/{table}
- Delete: HTTP DELETE /v1/tenant/{tenant}/table/{table}
- Create: HTTP PUT /v1/tenant/{tenant}/table/{table}/id/{id}
- Read: HTTP GET /v1/tenant/{tenant}/table/{table}/id/{id}
- Update: HTTP POST /v1/tenant/{tenant}/table/{table}/id/{id}
- Delete: HTTP DELETE /v1/tenant/{tenant}/table/{table}/id/{id}
In Create/Update requests above, you must supply a body of a message which includes the data to store.
- Rust-built
- Web Framework Actix (ref)
- IoC
hsqldb is used for the database on the nodes because it's fast and we can add in AES/GCM/SIV for the encryption per database. MySQL was considered but decided to keep it 'in process' database instead of external process. (MySQL can encrypt per tablespace, which meets the needs here. PostgreSQL cannot do that in the same way.) Other embedded databases do not support the encryption we need.
Moving to MySQL is a possibility here, but seriously increases the complexity of an installation. This may be warranted if popularity of the project increases and someone can demonstrate it will be actually worth it. I would love PostgreSQL over MySQL if encryption can be maintained correctly.
The control plane will use PostgreSQL, but non end-to-end testing will all be in HSQLDB.
There are many databases per server, and two types of databases. The internal and the tenant type. Liquibase is used when enabling a node.
Decided to use JDBI for the data access layer instead of hibernate. It integrates easily with DropWizard and quite simple to use.
Dropwizard, Jackson, Dagger, Immutables, Logback, Micrometer, AssertJ, and Jupiter are standard for me on these types of applications.
