Elixir Observability: Using logs, metrics, and traces
Logs are a fundamental component of observability, providing a record of what and when a system event happened. They often contain additional context in the form of metadata, offering insights such as, “This event succeeded” or “This event failed.” Logs help operators and developers understand the sequence of events.
Metrics are logs that include numerical values for monitoring system events over time. They help operators understand the system’s health and trends, allowing them to detect anomalies and performance issues. Metrics provide a quantitative view of the system, such as, “This event is getting slower” or “This event is failing more often.”
Traces capture the journey of an event through various services, helping to identify where things may go wrong. For example, traces can reveal, “This event is propagating across the system and failing because the email service is down.”
Oban long-job
First, we need to update the Oban long-job
to be more realistic, where it represents IO that can succeed or fail.
def perform(%Oban.Job{
id: trace_id,
args: args,
attempt: attempt,
queue: queue,
max_attempts: max_attempts,
worker: worker
}) do
job_name = Map.get(args, "name", "Default Name")
parent_span_id = Map.get(args, "parent_span_id", nil)
if parent_span_id do
Logger.metadata(parent_span_id: parent_span_id)
end
start_time = System.system_time(:millisecond)
broadcast_update(:job_started, "#{job_name} job started")
random_sleep_ms = :rand.uniform(9_000) + 1_000
Process.sleep(random_sleep_ms)
end_time = System.system_time(:millisecond)
# Fails 1 out of 3 times
should_fail = :rand.uniform(3) == 1
status =
if should_fail do
"failed"
else
"completed"
end
Logger.metadata(
trace_id: trace_id,
span_id: "#{trace_id}-#{attempt}",
parent_span_id: parent_span_id,
start_time: start_time,
end_time: end_time,
job_name: job_name,
worker: worker,
status: status,
queue: queue,
attempt: attempt,
max_attempts: max_attempts
)
if should_fail do
if attempt < max_attempts do
warning_message = "#{job_name} job failed on attempt #{attempt}. Retrying..."
Logger.metadata(status: "retrying")
Logger.warning(warning_message)
broadcast_update(:job_warning, warning_message)
else
failed_message = "#{job_name} job failed on final attempt #{attempt}"
Logger.error(failed_message)
broadcast_update(:job_failed, failed_message)
end
Logger.metadata([])
{:error, "Job failed"}
else
complete_message = "#{job_name} completed after #{random_sleep_ms} ms"
Logger.info(complete_message)
broadcast_update(:job_completed, complete_message)
Logger.metadata([])
:ok
end
end
This code simulates a long-running IO job using Oban’s retry logic, which fails after exhausting its maximum allowed attempts. Throughout execution, the job updates its status by broadcasting events corresponding to successes, warnings, or failures. Although not deployed separately, Oban jobs mirror microservices by fostering modularity and resilience, allowing each job to be independently managed and scaled.
Log
- Timestamp: Records when the log entry is created.
- Severity: The level of importance or urgency.
- Message: A human-readable description of the event.
Metadata
- Job Name: Identifies the specific job being executed.
- Status: The current state of the job, such as started, completed, or failed.
- Worker: The worker handling the job.
- Queue: The queue to which the job belongs.
Metrics
- Start and End Timestamps: The timing and duration of each attempt.
- Attempts: For understanding retry logic and failure patterns.
Trace
- Trace ID: A unique id for each job.
- Span ID: A unique id for each job attempt.
- Parent Span ID: Links the event that triggered the job.
The logs enable us to observe when an event occurred and determine whether it succeeded or failed. The metrics facilitate the monitoring of performance bottlenecks, diagnosis of problems, and optimization of efficiency. The trace values help us understand where jobs originate and how the complexities of multiple attempts are managed. This approach offers valuable insights into the operation and health of the system.
Live View
For the LiveView
, we’ve updated the system to generate a parent_span_id
and to handle the new :job_warning
and :job_failed
messages from the PubSub
. This helps track the origin of jobs and manage the life cycle of job execution, including error handling and retries.
defmodule GameAppWeb.LongJobLive.Index do
use GameAppWeb, :live_view
alias GameApp.Workers.LongJobWorker
alias Phoenix.PubSub
import Ecto.UUID, only: [generate: 0]
@impl true
def mount(_params, _session, socket) do
PubSub.subscribe(GameApp.PubSub, LongJobWorker.topic())
new_socket =
socket
|> assign(:job_results, [])
|> assign(:job_number, 0)
|> assign(:parent_span_id, generate())
{:ok, new_socket}
end
@impl true
def handle_event("trigger_job", _params, socket) do
job_number = socket.assigns.job_number + 1
parent_span_id = socket.assigns.parent_span_id
job_name = "Job #{job_number}"
LongJobWorker.new(%{
name: job_name,
parent_span_id: parent_span_id
})
|> Oban.insert()
|> case do
{:ok, _job} ->
new_socket =
socket
|> assign(:job_number, job_number)
{:noreply, new_socket}
{:error, _reason} ->
new_socket =
socket
|> put_flash(:error, "Failed to trigger #{job_name}.")
{:noreply, new_socket}
end
end
@impl true
def handle_info({:job_started, message}, socket) do
new_socket =
socket
|> put_flash(:info, message)
{:noreply, new_socket}
end
@impl true
def handle_info({:job_completed, message}, socket) do
new_socket =
socket
|> update(:job_results, fn results ->
results ++ [message]
end)
|> put_flash(:info, message)
{:noreply, new_socket}
end
@impl true
def handle_info({:job_warning, message}, socket) do
new_socket =
socket
|> update(:job_results, fn results ->
results ++ [message]
end)
{:noreply, new_socket}
end
@impl true
def handle_info({:job_failed, message}, socket) do
new_socket =
socket
|> update(:job_results, fn results ->
results ++ [message]
end)
|> put_flash(:error, message)
{:noreply, new_socket}
end
end
Viewing Logs
This project utilizes a local instance of Loki and Grafana for comprehensive log management and visualization. For detailed instructions on setting up these services, please refer to this previous blog post.
Running the Application Locally
To start the application, execute the following commands:
make up
make setup.dev
make serve.dev
After initiating a few jobs at http://127.0.0.1:4080/long-job, visit the locally running instance of Grafana at http://localhost:3000. Log in using the username admin
with the password admin
, as was set up in docker-compose
. Once logged in, select the local Loki data source at http://loki:3100
to start exploring the logs.
Analyzing Logs with Loki’s Query Language (LogQL)
Loki utilizes a query language called LogQL. Below are some examples.
-
Count Completed Jobs:
sum(count_over_time({job="game_app"} | logfmt | worker="GameApp.Workers.LongJobWorker" | status="completed" [5m]))
This tallies the number of jobs marked as “completed” grouped in 5-minute intervals.
-
Count Retries:
sum(count_over_time({job="game_app"} | logfmt | worker="GameApp.Workers.LongJobWorker" | status="retrying" [5m]))
This is the same, but for retries.
-
Count Failures:
sum(count_over_time({job="game_app"} | logfmt | worker="GameApp.Workers.LongJobWorker" | status="failed" [5m]))
And this counts the number of jobs that have failed, also in 5-minute increments.
Grafana Dashboards
Grafana dashboards begin with a query from the data source, but they include data transformations to generate more complex views. For example, the duration can be calculated by taking the difference between start_time
and end_time
. Alternatively, rather than looking at the duration for each job attempt, you can group data by trace_id
to calculate the duration of a job across multiple retry attempts.