OpenTelemetry Java Instrumentation Metadata - Telemetry Variations
Table of Contents
(This is a continuation of the first post about this project)
One challenge with documenting metadata associated with instrumentations is how to present the variations of the telemetry data emitted when different configuration options are enabled.
Previously we were just listing and iterating upon modules to run the generation on, and we now need the ability to run each module with different configuration options enabled, and then be able to attribute the resulting data to those configuration values that were set.
Tagging Telemetry
There is already a pattern used for running test suites with different variations, for example a common one is to run tests with stable semantic conventions enabled. This is done by registering a new test suite and adding a JVM argument to the gradle configuration:
tasks {
val testStableSemconv by registering(Test::class) {
jvmArgs("-Dotel.semconv-stability.opt-in=database")
}
check {
dependsOn(testStableSemconv)
}
}
If we want to run this particular gradle task, instead of targeting the standard test suite
(./gradlew :instrumentation:<module>:javaagent:test
) we would run
./gradlew :instrumentation:<module>:javaagent:testStableSemconv
We could augment this pattern by adding a system property with a name like metaDataConfig
that acts as
a way to provide a description of the configuration options that were used to generate the data.
val collectMetadata = findProperty("collectMetadata")?.toString() ?: "false"
tasks {
val testStableSemconv by registering(Test::class) {
jvmArgs("-Dotel.semconv-stability.opt-in=database")
systemProperty("collectMetadata", collectMetadata)
systemProperty("metaDataConfig", "otel.semconv-stability.opt-in=database")
}
test {
systemProperty("collectMetadata", collectMetadata)
}
check {
dependsOn(testStableSemconv)
}
}
And then in our file writer we can add that info to our output, or indicate if it’s emitted by default:
String config = System.getProperty("metaDataConfig");
String when = "default";
if (config != null && !config.isEmpty()) {
when = config;
}
writer.write("when: " + when + "\n");
writer.write("metrics:\n");
...
The resulting output for a database client that emits different metrics based on the semantic convention flag might have multiple files, one for each configuration option. For example, the default instrumentation might output:
- when: default
metrics:
- name: db.client.connections.idle.max
description: The maximum number of idle open connections allowed.
type: LONG_SUM
unit: connections
attributes:
- name: pool.name
type: STRING
- name: db.client.connections.idle.min
description: The minimum number of idle open connections allowed.
type: LONG_SUM
unit: connections
attributes:
- name: pool.name
type: STRING
- name: db.client.connections.max
description: The maximum number of open connections allowed.
type: LONG_SUM
unit: connections
attributes:
- name: pool.name
type: STRING
- name: db.client.connections.pending_requests
description: The number of pending requests for an open connection, cumulative
for the entire pool.
type: LONG_SUM
unit: requests
attributes:
- name: pool.name
type: STRING
- name: db.client.connections.usage
description: The number of connections that are currently in state described
by the state attribute.
type: LONG_SUM
unit: connections
attributes:
- name: pool.name
type: STRING
- name: state
type: STRING
And another file for the same instrumentation with the semantic convention flag enabled:
- when: otel.semconv-stability.opt-in=database
metrics:
- name: db.client.connection.count
description: The number of connections that are currently in state described
by the state attribute.
type: LONG_SUM
unit: connection
attributes:
- name: db.client.connection.pool.name
type: STRING
- name: db.client.connection.state
type: STRING
- name: db.client.connection.idle.max
description: The maximum number of idle open connections allowed.
type: LONG_SUM
unit: connection
attributes:
- name: db.client.connection.pool.name
type: STRING
- name: db.client.connection.idle.min
description: The minimum number of idle open connections allowed.
type: LONG_SUM
unit: connection
attributes:
- name: db.client.connection.pool.name
type: STRING
- name: db.client.connection.max
description: The maximum number of open connections allowed.
type: LONG_SUM
unit: connection
attributes:
- name: db.client.connection.pool.name
type: STRING
- name: db.client.connection.pending_requests
description: The number of current pending requests for an open connection.
type: LONG_SUM
unit: request
attributes:
- name: db.client.connection.pool.name
type: STRING
This approach was implemented in this PR.
Span Data
In terms of characterizing telemetry data emitted by an instrumentation, metrics are more straight forward than spans. For one thing, most of the instrumentation in the agent is focused on spans instead of metrics, so there is a lot more data to collect and analyze. Additionally, span names are typically at least partially dynamic, and depending on the span type, there might not be an easy way to enumerate and describe all the different variations that are emitted by an instrumentation.
Another thing to consider when writing our gatherer, is that not all spans emitted by our tests are the result of the particular instrumentation we are testing. For example, if we are testing a database client instrumentation, we might have spans that are emitted by the database server, or by an underlying http client library. We need to be able to filter out those spans that are not relevant to the instrumentation we are testing, and only include those that are emitted by the instrumentation under test.
Due to these considerations, my initial approach is to focus on summarizing the span data by focusing on the
attributes emitted in the context of each Span Kind
. Luckily for us, each span also has an instrumentation scope
attached to it, which we can use to filter out spans that are not relevant to the instrumentation we are testing.
Our resulting raw output for intercepting spans from a test might look like this:
when: default
spans:
- scope: test
spans:
- span_kind: INTERNAL
attributes:
- scope: io.opentelemetry.clickhouse-client-0.5
spans:
- span_kind: CLIENT
attributes:
- name: db.operation
type: STRING
- name: db.name
type: STRING
- name: server.address
type: STRING
- name: server.port
type: LONG
- name: db.statement
type: STRING
- name: db.system
type: STRING
Span Parser
We can generate these span files the same way we did for the metrics, so all of that plumbing is already in place. Next we need to focus on parsing this data and then incorporating it into our instrumentation list output. Since the test runners don’t have a great way to know which scopes we are interested in, we will delegate the responsibility of filtering out that data to our parser.
Similar to what we needed to do with metrics, we also need a way to deduplicate and clean up all the span data, as there will be a lot of duplicate spans and attributes emitted by different tests.
After implementing all of this, we end up with something like:
telemetry:
- when: default
spans:
- span_kind: CLIENT
attributes:
- name: db.operation
type: STRING
- name: server.address
type: STRING
- name: server.port
type: LONG
- name: db.name
type: STRING
- name: db.system
type: STRING
- name: db.statement
type: STRING
- when: otel.semconv-stability.opt-in=database
metrics:
- name: db.client.operation.duration
description: Duration of database client operations.
type: HISTOGRAM
unit: s
attributes:
- name: db.namespace
type: STRING
- name: db.operation.name
type: STRING
- name: db.system.name
type: STRING
- name: server.address
type: STRING
- name: server.port
type: LONG
spans:
- span_kind: CLIENT
attributes:
- name: db.system.name
type: STRING
- name: db.namespace
type: STRING
- name: error.type
type: STRING
- name: db.operation.name
type: STRING
- name: server.port
type: LONG
- name: db.query.text
type: STRING
- name: db.response.status_code
type: STRING
- name: server.address
type: STRING
This approach was implemented in this PR.
Next
Longer term, we want to be able to run this across all modules as part of a nightly run, and not maintain a list of explicit gradle tasks to execute, but this approach gives us a way to slowly roll this out for easier review and allows for smaller iterations.
One thing I’m now thinking about is whether we should also be specifying telemetry based on whether the instrumentation is provided by the javaagent vs library instrumentations. To be continued…