Introducing: Metrics reporting done right

I would like to share a passion project of mine that has been in the works for more than 5 years (is it really been that long???) - Metrics reporting framework for NodeJS:

GitHub - ysa23/metrics-reporter: A reporting framework for data point information (measurements and time series) to aggregators like Graphite & DataDog
A reporting framework for data point information (measurements and time series) to aggregators like Graphite & DataDog - GitHub - ysa23/metrics-reporter: A reporting framework for data point in...

Today I am releasing the first official major RTM version of the library - 1.0.0. Β πŸŽ‰

In the past five years, the library have been used by multiple high scale and high load production applications, so its finally time to release it officially πŸ’ͺ

For detailed documentation please refer to the library's README

Core principles

Metrics is one of three important technics for production visibility (the others are logging and tracing πŸ˜‰). As such, there are many solutions in every eco-system for reporting them.

The goal of the metrics-reporter library is to create a common and simple interface for metrics that is open for extension and performant in high throughput for NodeJS eco-system.

Oh, and most importantly, to reduce npm-hell, the library does not have ANY runtime dependencies - and we commit to keep it that way! 🀝

Abstraction & Extensibility

There's an age old discussion about abstraction for storage layers. What are the chances that during a life time of an application you'll be replacing the underliying database? Come on, it hardly happens. Nevertheless, if you look at codebases, you'll will always see some abstraction of the storage layer in some way. Some will be more coupled than others.

Metrics are not the same. In a life span of an application, moving between metrics aggregators is quite common. For example: deciding to move from in-house to a SaaS provider, and vice-versa, or moving between different SaaS providers. Metrics aggregators are highly cost driven. Β There is always a tension between the effort of transitioning to a new provider and the cost save. As such, keeping lose coupling between the aggregator and the application is a good practice. A good abstraction, reduces that transition cost (at least the integration cost), making it easier to move between providers when a better price is available.

Developers usually create their own abstraction of metrics reporting that will allow them to de-couple the aggregator from the codebase, while still using the terminology from said aggregator in the abstraction.

Different aggregators use different terminology for different metric types, usually to reflect advanced usages for the service they provide.

metrics-reporter abstraction is two-fold:

First, the API of metrics-reporter uses simple terminology, that is not coupled to a specific aggregator, and is aimed to what you want to report, instead of how to report it, or how it will be stored. Instead of finding concepts such is gauge, histogram, rate and distribution, you will find increment, value and meter:

metrics.space('queue', { name: queueName }).space('size').increment();

metrics.space('healthcheck', { status: 'failed' }).increment();

metrics.space('posts').value(posts.length());
Reporting metrics
This doesn't mean that these types are not needed, nor they will not be added in the future in some form. The goal is look at the terminology from a user perspective, make it easier to use and make the code more readable.

Second, metrics-reporter completely de-couples the underlying aggregator using plugins. During setup you create a reporter for your aggregator, plug it into the metrics object, and thats it. From that moment every usage of said metrics object will report to the aggregator(s) of your choice.

const datadogReporter = new DataDogReporter({
  host: agentHost,
});

const consoleReporter = new ConsoleReporter()

const metrics = new Metrics({ reporters: [datadogReporter, consoleReporter] });
Plugin reporters

Different projects and companies use different solutions for metrics: in-house deployments (such as Graphite or Prometheus), or SaaS solutions such as Logz.io and DataDog (and many many others).

As of today, the library comes out of the box with Graphite and DataDog (via StatsD) as aggregators, but the intention is for these to be used as live examples. If one wants to use a different time series tech, like Prometheus, InfluxDB , ElasticSearch or a custom in-house solution, its fairly easy to implement a reporter and plug it into the framework.

We plan to release additional reporters in the near future (separated from the core library), for the more common aggregators. Prometheus will be next on our list.

Using plugins, you can also setup metrics reporting for each environment. On local machine you may want to show reported metrics on your console, while in production environment you will be using Graphite.

const reporters = NODE_ENV === "production" ? 
  [new GraphiteReporter({ host: '127.0.0.1' })] : 
  [new ConsoleReporter()];

const metrics = new Metrics({ reporters });
Change reporters by environment

Multiple reporters are supported - so it will be easier to make gradual transitions between different providers.

The core library is also provided with a MemoryReporter that you can setup for testing! Thats right - with metrics-reporter you can write unit tests for metrics reporting without mocks πŸ†

Performance

A key consideration when developing metrics-reporter is performance. Reporting metrics should have minimal impact on latency and capacity of an application.

Memory & CPU utilization are key factors here, and both are addressed in the library. To be more specific they are addressed per reporter with some common similarities.

Buffering

Sending any data to an external source have its own costs and effects. Reducing the number of calls to the external provider is essential, especially in high load scenarios, like sending metrics. To do so, both DataDogReporter and GraphiteReporter use buffering: instead of sending a metric to the respected provider when its reported, metrics-reporter adds it to a buffer in-memory. The buffer will be flushed either at an interval or when the buffer is full.

By now, you're wondering "If we buffer stuff in memory, then surely it will have a negative effect on memory consumption, right?". Well, intuitivly you are right - we are using more memory. But the real question is, are we using more memory then sending a report online to the provider. Now here is the cavet: In Node every async operation consumes memory. When opening a promise, Node needs to save the current position in the stack, the code of the handler, scoped variables and more. So if we are reporting metrics at a rate of 100 per second, we are opening 100 promises per second, and you can see where everything can go wrong.

An additional cavet is the resolving of said promises: resolved promises go into the event loop and can increase latency, especially on high scale use-cases such as metrics.

Buffering allows us to have better control on memory consumption, as both the size of the buffer is capped and the number of opened promises is reduce to the minimum necessary. Buffering also has a positive effect on CPU utilization since the event loop is free to handle the application code instead of handling resolved promises from metrics reports.

Buffering is completely configurable and have sensible defaults, but it is highly recommended to tweak them according to the application needs. For more information, please refer to the README.

Buffering is specific per reporter. As mentioned, the current reporters available within the core library are meant to be used as live examples. As the framework move forward, some of the common internals of elements like buffering will be exposed to allow implementing new reporters on top of these tested building blocks.

What's next?

This is only the beginning of the path. We have a long road ahead of us.

Future plans include:

  1. Additional reporters for common metrics aggregators - Prometheus is first on the list
  2. Performance improvments - to reduce even more the memory footprint

But most importantly is additional enhancements from community feedback, and that's where you come in πŸ˜‰.

We welcome you to try out the library and give feedback as issues, PRs or even comments on this post, for everything you think we can change, improve or missing. Community feedback is very important to us, and will be heavily taken into account as we move forward.


Thanks

So many people helped me with this project, thank you for your support!

I would like to specially thank Igor (GitHub, Twitter, Medium) - my partner in development of this library for the last couple of years, and to Johnny (GitHub, Twitter, website) - which started the first version of this concept and gave me the inception to start working on it as an open source.

I would also like to thank JetBrains for providing me an Open Source WebStorm license used for the development of the library.


Photo by Mitchel Boot on Unsplash