Metadata reference architecture: A quick guide

[ Guides ]

Free O'Reilly Guide

Learn to build efficient, elastically scaling, multi-region applications from the experts at O'Reilly.

Get your copy

Metadata management is a critical part of any business application. Let’s take a quick look at what metadata is, why it’s important, and how you can architect your application to ensure highly available, consistent metadata at scale.

Already know the basics? Skip to the reference architecture!

What is metadata?

Put simply, metadata is data about other data.

Consider, for example, a cloud photo storage application. When a user uploads a photo, the image file itself would likely be stored in an object storage database, but the application would also need to store metadata – smaller data about the image – in a metadata database. This metadata would included details such as:

The user who uploaded the photo
The date the photo was uploaded
The size and resolution of the photo
People or objects the user tagged in the photo
Any user description or caption for the photo
The photo’s location in the object storage database

…et cetera. These metadata are highly valuable for businesses because they make other data easier to find.

For example, having the metadata listed above would make it easy to quickly locate all of a specific user’s photos. Rather than having to search through all of the image files in the object storage database, the application can query the metadata database for all entries with a particular user, and then it will have a list of the locations for each file that specific user has uploaded.

Why metadata matters

In many cases, metadata availability is critical to an application’s functionality. For example, our cloud photo storage application would use metadata to facilitate locating, sorting, and filtering photos. If the metadata database goes offline, the photos would still exist in the application’s object storage database, but they would become inaccessible to users because the application would lack the metadata necessary to locate specific photos in that database.

Consistency is another major concern that arises when companies are architecting metadata management systems. Metadata is often duplicated across multiple databases – for example, the same metadata might be stored both on a metadata database that serves the application and on a separate database for analytics, logging, audit compliance, etc. Companies must ensure that the data on these two (or more) databases remains consistent; if inconsistency is introduced, it can become very difficult to determine which database is correct (which can, in turn, have serious implications for audits, regulatory compliance, etc.).

Consistency across regions can also be an important consideration for multi-region applications – if a region goes down, the other regions will still require access to metadata that is correct to be able to function properly. Moreover, if the regions are not consistent with each other, disaster recovery becomes very challenging.

What is metadata management?

Metadata management is a term that’s used to describe all of the tasks associated with collecting, organizing, storing, maintaining, and retrieving metadata.

As we’ve already established, metadata is often critical to the day-to-day functioning of an application. A metadata management strategy ensures that this data is identified, captured, maintained, and retrieved whenever it’s needed. Metadata management is almost always automated, so choosing tools that support common metadata management tasks can help make the process easier.

[ blog ]

Best practices for building a pain-free metadata store

Read Blog →

Having a database that supports row-level TTL, for example, prevents you from having to write bespoke code to find and delete metadata after a certain period of time, if you have data that needs an expiration date.

But having a great metadata management strategy doesn’t necessarily solve the consistency problems we touched on earlier, nor does it ensure your metadata will always remain available.

To solve those problems, let’s zoom out and take a look at the architecture of an application that handles metadata without having to worry about problems with availability or consistency.

An example of metadata reference architecture

In the diagram below, we’ve laid out a simple example of how an application with a microservices architecture might integrate CockroachDB as a metadata store. Note that we’ve chosen to focus on a multi-region architecture here because of the inherent advantages that multi-region setups offer in terms of both user latency and (in some cases) regulatory compliance.

Metadata reference architecture diagram

Note that in the image above, only three services and one database are pictured per cluster for the sake of visual clarity. A real application would likely have many more services, and those services would be also be sending data to other databases, not just to the metadata database. In a photo storage application, for example, the image files themselves would likely be sent to a different database that is optimized for large object storage.

Requests and data from the front end (which might be a web or mobile application) are sent to a load balancer that distributes them to the appropriate Kubernetes cluster, where they are processed by the application’s microservices.

CockroachDB can be deployed and managed within Kubernetes (rather than just alongside it), and treated like a single-instance Postgres database. But unlike a single-instance Postgres database, CockroachDB is distributed, so even if a database node goes offline, all metadata would still be accessible via other nodes. In fact, depending on how it’s configured, CockroachDB can survive AZ and even cloud region outages.

In the architecture above, we’re solving the potential consistency problems inherent in building a metadata store for a multi-region application in two ways.

First, to solve the potential consistency issues that can arise from the dual-write problem, we’re using CockroachDB’s Change Data Capture (CDC) feature to copy metadata to Apache Kafka (or any message queuing system( and then into an analytics database. We could accomplish the same thing on a database that didn’t include CDC using a transactional outbox.

Second, to solve the potential consistency problems that can arise with multi-region, we’re taking advantage of CockroachDB’s multi-active availability model, which avoids some of the problems inherent in active-passive and active-active configurations and allows for synchronous and performant writes across regions natively.

The choice of CockroachDB here also enables an easy road to multi-region for developers, since multi-region CockroachDB databases can still be treated as a single logical database by the application. This ensures our metadata will be highly available, and also allows for “data homing” down to the row level, which can be helpful for both latency (locating data in the cloud region physically closest to the user) and regulatory compliance.

Of course, real-world metadata architectures can get significantly more complex. When designing the architecture for your own application, it may be helpful to look at public examples such as Netflix’s device management architecture, which uses CockroachDB to store metadata related to all of the different hardware devices with which Netflix apps are compatible.

Pricing

Contact us

Sign In

Metadata reference architecture: A quick guide

Free O'Reilly Guide

What is metadata?

Why metadata matters

What is metadata management?

Best practices for building a pain-free metadata store

An example of metadata reference architecture

About the author

Charlie Custer

Keep Reading

Change data capture: Fine-tuning changefeeds for performance and durability

How to export data with changefeeds

What is change data capture?