Elixir Observability: Using logs, metrics, and traces

Logs are a fundamental component of observability, providing a record of what and when a system event happened. They often contain additional context in the form of metadata, offering insights such as, “This event succeeded” or “This event failed.” Logs help operators and developers understand the sequence of events.

Metrics are logs that include numerical values for monitoring system events over time. They help operators understand the system’s health and trends, allowing them to detect anomalies and performance issues. Metrics provide a quantitative view of the system, such as, “This event is getting slower” or “This event is failing more often.”

Traces capture the journey of an event through various services, helping to identify where things may go wrong. For example, traces can reveal, “This event is propagating across the system and failing because the email service is down.”

See the code

Read the docs

Oban `long-job`

First, we need to update the Oban long-job to be more realistic, where it represents IO that can succeed or fail.

 def perform(%Oban.Job{
        id: trace_id,
        args: args,
        attempt: attempt,
        queue: queue,
        max_attempts: max_attempts,
        worker: worker
      }) do
    job_name = Map.get(args, "name", "Default Name")
    parent_span_id = Map.get(args, "parent_span_id", nil)

    if parent_span_id do
      Logger.metadata(parent_span_id: parent_span_id)
    end

    start_time = System.system_time(:millisecond)

    broadcast_update(:job_started, "#{job_name} job started")

    random_sleep_ms = :rand.uniform(9_000) + 1_000
    Process.sleep(random_sleep_ms)
    end_time = System.system_time(:millisecond)

    # Fails 1 out of 3 times
    should_fail = :rand.uniform(3) == 1

    status =
      if should_fail do
        "failed"
      else
        "completed"
      end

    Logger.metadata(
      trace_id: trace_id,
      span_id: "#{trace_id}-#{attempt}",
      parent_span_id: parent_span_id,
      start_time: start_time,
      end_time: end_time,
      job_name: job_name,
      worker: worker,
      status: status,
      queue: queue,
      attempt: attempt,
      max_attempts: max_attempts
    )

    if should_fail do
      if attempt < max_attempts do
        warning_message = "#{job_name} job failed on attempt #{attempt}. Retrying..."
        Logger.metadata(status: "retrying")
        Logger.warning(warning_message)

        broadcast_update(:job_warning, warning_message)
      else
        failed_message = "#{job_name} job failed on final attempt #{attempt}"
        Logger.error(failed_message)
        broadcast_update(:job_failed, failed_message)
      end

      Logger.metadata([])
      {:error, "Job failed"}
    else
      complete_message = "#{job_name} completed after #{random_sleep_ms} ms"
      Logger.info(complete_message)
      broadcast_update(:job_completed, complete_message)
      Logger.metadata([])
      :ok
    end
  end

This code simulates a long-running IO job using Oban’s retry logic, which fails after exhausting its maximum allowed attempts. Throughout execution, the job updates its status by broadcasting events corresponding to successes, warnings, or failures. Although not deployed separately, Oban jobs mirror microservices by fostering modularity and resilience, allowing each job to be independently managed and scaled.

Log

Timestamp: Records when the log entry is created.
Severity: The level of importance or urgency.
Message: A human-readable description of the event.

Metadata

Job Name: Identifies the specific job being executed.
Status: The current state of the job, such as started, completed, or failed.
Worker: The worker handling the job.
Queue: The queue to which the job belongs.

Metrics

Start and End Timestamps: The timing and duration of each attempt.
Attempts: For understanding retry logic and failure patterns.

Trace

Trace ID: A unique id for each job.
Span ID: A unique id for each job attempt.
Parent Span ID: Links the event that triggered the job.

The logs enable us to observe when an event occurred and determine whether it succeeded or failed. The metrics facilitate the monitoring of performance bottlenecks, diagnosis of problems, and optimization of efficiency. The trace values help us understand where jobs originate and how the complexities of multiple attempts are managed. This approach offers valuable insights into the operation and health of the system.

Live View

For the LiveView, we’ve updated the system to generate a parent_span_id and to handle the new :job_warning and :job_failed messages from the PubSub. This helps track the origin of jobs and manage the life cycle of job execution, including error handling and retries.

defmodule GameAppWeb.LongJobLive.Index do
  use GameAppWeb, :live_view
  alias GameApp.Workers.LongJobWorker
  alias Phoenix.PubSub
  import Ecto.UUID, only: [generate: 0]

  @impl true
  def mount(_params, _session, socket) do
    PubSub.subscribe(GameApp.PubSub, LongJobWorker.topic())

    new_socket =
      socket
      |> assign(:job_results, [])
      |> assign(:job_number, 0)
      |> assign(:parent_span_id, generate())

    {:ok, new_socket}
  end

  @impl true
  def handle_event("trigger_job", _params, socket) do
    job_number = socket.assigns.job_number + 1
    parent_span_id = socket.assigns.parent_span_id

    job_name = "Job #{job_number}"

    LongJobWorker.new(%{
      name: job_name,
      parent_span_id: parent_span_id
    })
    |> Oban.insert()
    |> case do
      {:ok, _job} ->
        new_socket =
          socket
          |> assign(:job_number, job_number)

        {:noreply, new_socket}

      {:error, _reason} ->
        new_socket =
          socket
          |> put_flash(:error, "Failed to trigger #{job_name}.")

        {:noreply, new_socket}
    end
  end

  @impl true
  def handle_info({:job_started, message}, socket) do
    new_socket =
      socket
      |> put_flash(:info, message)

    {:noreply, new_socket}
  end

  @impl true
  def handle_info({:job_completed, message}, socket) do
    new_socket =
      socket
      |> update(:job_results, fn results ->
        results ++ [message]
      end)
      |> put_flash(:info, message)

    {:noreply, new_socket}
  end

  @impl true
  def handle_info({:job_warning, message}, socket) do
    new_socket =
      socket
      |> update(:job_results, fn results ->
        results ++ [message]
      end)

    {:noreply, new_socket}
  end

  @impl true
  def handle_info({:job_failed, message}, socket) do
    new_socket =
      socket
      |> update(:job_results, fn results ->
        results ++ [message]
      end)
      |> put_flash(:error, message)

    {:noreply, new_socket}
  end
end

Viewing Logs

This project utilizes a local instance of Loki and Grafana for comprehensive log management and visualization. For detailed instructions on setting up these services, please refer to this previous blog post.

Running the Application Locally

To start the application, execute the following commands:

make up
make setup.dev
make serve.dev

After initiating a few jobs at http://127.0.0.1:4080/long-job, visit the locally running instance of Grafana at http://localhost:3000. Log in using the username admin with the password admin, as was set up in docker-compose. Once logged in, select the local Loki data source at http://loki:3100 to start exploring the logs.

Analyzing Logs with Loki’s Query Language (LogQL)

Loki utilizes a query language called LogQL. Below are some examples.

Count Completed Jobs:

sum(count_over_time({job="game_app"} | logfmt | worker="GameApp.Workers.LongJobWorker" | status="completed" [5m]))

This tallies the number of jobs marked as “completed” grouped in 5-minute intervals.

Count Retries:

sum(count_over_time({job="game_app"} | logfmt | worker="GameApp.Workers.LongJobWorker" | status="retrying" [5m]))

This is the same, but for retries.

Count Failures:

sum(count_over_time({job="game_app"} | logfmt | worker="GameApp.Workers.LongJobWorker" | status="failed" [5m]))

And this counts the number of jobs that have failed, also in 5-minute increments.

Grafana Dashboards

Grafana dashboards begin with a query from the data source, but they include data transformations to generate more complex views. For example, the duration can be calculated by taking the difference between start_time and end_time. Alternatively, rather than looking at the duration for each job attempt, you can group data by trace_id to calculate the duration of a job across multiple retry attempts.

Oban long-job

Log