Metadata Project Phase 1 Completed

Table of Contents

A couple of weeks ago we hit an exciting milestone, where all the initial work to implement the “metadata system” itself is now completed for the java agent. I started this work back in Feb/March, and it feels good to look back at what we have accomplished so far. There’s still a lot of work to do, but the foundation is now in place to build on, and I think it’s a good time to consider this a “phase 1 complete” moment, and reflect a bit.

I’ve been at Grafana for almost 4 months now, and throughout that time I have been consistently supported and encouraged to push this project forward. It’s been interesting to join an Observability company and start to see areas where this metadata can help improve the user experience with OpenTelemetry in Grafana products, and with our various teams who are still getting up to speed with OpenTelemetry.

Most of this blog post was written on a plane while flying back from a fun week in Belgium at an observability team offsite. One of the days we had a hackathon, and my team combined a few tools, including the Ecosystem Explorer (which is built on this metadata) to build a workflow that can analyze a Java application codebase, identify its instrumented libraries using its dependency tree, identify all the metrics emitted by those instrumentations, and then automatically generate an APM-like Grafana dashboard using the new Grafana Assistant. It was pretty cool (in my opinion), and demonstrates the utility of having this structured metadata about instrumentations.

These are some previous posts where I have written about this project, for more context:

Ok back to the topic of the “phase 1” milestone, let’s review what we’ve done so far.

Metadata Files

We defined and implemented a metadata.yaml schema for individual instrumentation modules to track various pieces of information. This involved:

  • Hashing out the various metadata attributes that would be useful, along with their definitions
  • Creating documentation explaining how to write these metadata files
  • Working through the 250+ modules to populate this information (still a work in progress)
  • Defining a set of “classifications” that can be used to group instrumentations by their general purpose
  • Defining a set of “features” that can be used to further classify instrumentations beyond the original 3 classifications
  • Incorporating a way to track the semantic conventions implemented for each instrumentation
  • Coming up with a way to define telemetry manually for instrumentation that might not have an easy way to generate it via tests

Here is what the schema looks like as of now:

description: "This instrumentation enables..."    # Description of the instrumentation module
semantic_conventions:                             # List of semantic conventions the instrumentation adheres to
  - HTTP_CLIENT_SPANS
  - DATABASE_CLIENT_SPANS
  - JVM_RUNTIME_METRICS
features:                                        # List of features this instrumentation provides
  - HTTP_ROUTE
  - CONTEXT_PROPAGATION
disabled_by_default: true                         # Defaults to `false`
classification: internal                          # instrumentation classification: library | internal | custom
library_link: https://...                         # URL to the library or framework's main website or documentation
configurations:
  - name: otel.instrumentation.common.db-statement-sanitizer.enabled
    description: Enables statement sanitization for database queries.
    type: boolean               # boolean | string | list | map
    default: true
override_telemetry: false                         # Set to true to ignore auto-generated .telemetry files
additional_telemetry:                             # Manually document telemetry metadata
  - when: "default"
    metrics:
      - name: "metric.name"
        description: "Metric description"
        type: "COUNTER"
        unit: "1"
        attributes:
          - name: "attribute.name"
            type: "STRING"
    spans:
      - span_kind: "CLIENT"
        attributes:
          - name: "span.attribute"
            type: "STRING"

This “schema” is still evolving, but I feel pretty good about where it is at this point. We are now starting to explore what similar metadata would (or already does) look like for other OpenTelemetry components, like the collector, javascript, and golang.

Using Test Suites To Identify Telemetry Emitted

We implemented a way to instrument test suites in order to intercept and document the telemetry emitted. This allows for annotating different test suite configurations in order to distinguish telemetry by various conditions (configuration options etc). This also involved refactoring many of the test suite modules in order to support this approach.

For more information on this, see an earlier post: OpenTelemetry Java Instrumentation Metadata - Telemetry Variations

Telemetry Variations

Documenting Standalone Library Instrumentation

Within the java instrumentation repo, there are two main types of instrumentations:

  • Java Agent Instrumentations - these are instrumentations that are applied via the java agent at runtime
  • Standalone Library Instrumentations - these are instrumentations that are applied by adding the library dependency to your project and doing some manual setup

There were already readme’s for a handful of the standalone library instrumentations, so I extracted the “template” they used, and formalized it, creating a document that explains how to write these readme’s. I then went through and created the readme’s for each one that was missing, and have incorporated them into the Ecosystem Explorer, where we render them on the instrumentation page, like in this example:

Library readmes

This work was tracked via issue: #6947

Created An Instrumentation Catalog

Using the new metadata and telemetry, we created a yaml catalog of all this information in the docs/instrumentation-list.yaml file. We now have a system that crawls the repo and combines metadata files, generated telemetry files, and some other analysis on a nightly basis and automatically keeps the list up to date.

As we continue populating the metadata and configuring telemetry collection, this catalog becomes more useful.

As of today, 130 instrumentations have been populated with descriptions, and 120 are configured to intercept and document their emitted telemetry. We are around 45% through this effort.

Documentation Synchronization Automation (opentelemetry.io)

Now that we have some structured data, we can start to automate some of the documentation maintenance tasks. We started with:

  • Nightly runs to check for changes to any of the instrumentation data, and automatically creates PRs accordingly
  • After releases, we have jobs that run that analyze the repo catalog compared to the opentelemetry.io documentation to ensure there are no missing instrumentation. It will open an issue if there are differences.

I still plan on automating one more step of actually going all the way and creating the PRs for updating the opentelemetry.io documentation, but that is still a work in progress.

The nightly job that synchronizes the instrumentation list catalog has already been useful in unexpected ways. There was a recent situation where it caught some changes to the way gradle configurations worked after a version upgrade. A large portion of our tests had started failing silently, and the nightly job alerted us to this when it opened a PR trying to remove a significant portion of the emitted telemetry by dozens of instrumentations (because there was no longer any telemetry being emitted). Since the tests were failing silently, we might not have caught this issue for a while otherwise. A nice win.

Ecosystem Explorer / Instrumentation Explorer

We created a POC website that allows users to explore the instrumentation, including ways to filter and search based on various metadata attributes. This is still a work in progress, but it shows the potential of what can be done with this data. You can check it out here: https://jaydeluca.github.io/instrumentation-explorer/

Some of the capabilities include:

  • See telemetry differences across different agent versions
  • Can integrate with otel-checker CLI tool to identify all supported instrumentations and list the available telemetry

See the Leveraging Instrumentation Metatadata post for a more detailed breakdown of some of the features.

We have started organizing a working group to further develop this tool and explore how it can be used for other languages and components in the ecosystem. If you are interested in joining, please join us in #otel-ecosystem-explorer channel in the CNCF slack.

So what’s next?

OpenTelemetry has a project proposal process, and we have submitted a proposal to continue this work as an official project within the OpenTelemetry community. We are in search of contributors to collaborate with as we build this out further. If you are interested in helping out, please reach out on that issue or find me in the CNCF slack: #otel-ecosystem-explorer channel