Troubleshooting & Diagnostics

Published

April 2, 2026

Overview

This section outlines common issues encountered during the deployment of Operational Decision Manager (ODM) on Amazon EKS. It focuses specifically on the challenges introduced by the strict security requirements: end-to-end TLS encryption (HTTPS everywhere) and restrictive database network policies.

Logging Strategy

Effective troubleshooting requires inspecting logs at three distinct layers. Use the following commands to isolate errors.

ODM Application Logs

The most critical logs are generated by the WebSphere Liberty server running inside the ODM pods. Look here for JDBC connection errors, rule execution failures, or startup timeouts.

# Stream logs for the Decision Center pod
kubectl logs -f -l app.kubernetes.io/component=decisionCenter -n odm-pilot

# Check for specific "messages.log" errors inside a running container
kubectl exec -it <pod-name> -n odm-pilot -- cat /logs/messages.log

AWS Load Balancer Controller Logs

If the Ingress resource is created but the AWS ALB or Target Groups are not provisioning, inspect the controller logs in the kube-system namespace.

kubectl logs -f -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller

Kubernetes Events

Use events to diagnose scheduling issues, resource quota limits, or image pull failures.

kubectl get events -n odm-pilot --sort-by='.lastTimestamp'

Common Failure Scenarios

Use this matrix to quickly resolve the most frequent deployment errors.

Symptom Probable Cause Corrective Action
ALB returns 502 Bad Gateway Protocol Mismatch. The ALB is sending HTTP to a Pod expecting HTTPS, or the Pod is failing its readiness probe. 1. Verify the Ingress annotation exists: alb.ingress.kubernetes.io/backend-protocol: HTTPS
2. Check if the Pod’s Readiness Probe is failing (see Section 1.3).
Pod stuck in CrashLoopBackOff Database Failure. The ODM container cannot reach RDS to initialize the schema. 1. Check pod logs for Connection refused or SocketTimeout.
2. Verify the RDS Security Group allows inbound traffic on port 5432 from the EKS Node Security Group.
Deployment fails OPA Validation Policy Violation. The deployment spec exposes a non-secure port (HTTP). Ensure the container spec creates the service on port 9443 (HTTPS) rather than 9060. The OPA policy mandates secure listeners.
“Datasource not found” Error Config Error. The PostgreSQL driver isn’t loaded, or the datasource XML is malformed. Verify that the PostgreSQL JDBC driver is mounted correctly in the container’s shared/resources volume and referenced in server.xml.
ALB Target Group is “Unhealthy” Health Check Mismatch. The ALB is health-checking the wrong port or protocol. Verify the Ingress health check annotations. Ensure alb.ingress.kubernetes.io/healthcheck-protocol is set to HTTPS and points to a valid endpoint (e.g., /decisioncenter/health).

Deep Dive: TLS & Ingress

Because of the strict OPA requirement for HTTPS traffic at the pod level, the connection between the AWS ALB and the Pods is the most fragile configuration point.

Infinite Redirect Loops

Context: ODM often attempts to redirect traffic internally, which can conflict with ALB redirects.
Fix: Ensure the annotation alb.ingress.kubernetes.io/ssl-redirect: '443' is configured. Additionally, verify the ODM Liberty server configuration (server.xml) is configured to trust the proxy headers (X-Forwarded-Proto) coming from the ALB so it recognizes the original request was secure.

Untrusted Certificates

Context: The ALB attempts to connect to the Pod via HTTPS, but the Pod is using a self-signed certificate.
Fix: By default, the AWS ALB accepts self-signed certificates from backends when the backend protocol is HTTPS. However, you must ensure the Pod is actually listening on port 9443.

TipTesting Internal Connectivity

You can test the internal TLS handshake from within the cluster using a temporary debug pod.

# Run a temporary curl pod
kubectl run curl-debug --image=curlimages/curl -n odm-pilot --rm -it -- sh

# Inside the pod, test the handshake (-k allows self-signed certs)
curl -k -v https://<odm-pod-ip>:9443/decisioncenter

Database Connectivity Checklist

If the Decision Center or Decision Server Console fails to start, 90% of the time it is a Database connectivity or permission issue.

ImportantCritical Checks
  1. VPC Peering/Routing: Ensure the EKS VPC has a valid route table entry to the RDS subnet.
  2. Security Groups:
    • Source: The Security Group attached to the EKS Worker Nodes.
    • Destination: The Security Group attached to RDS (Must allow TCP/5432).
  3. Schema Privileges: If this is a fresh install, ensure the user provided in the JDBC connection string has CREATE TABLE privileges. ODM must create its own schema tables on the very first startup.

OPA Gatekeeper Diagnostics

Since the cluster enforces strict security policies (such as requiring HTTPS listeners), OPA Gatekeeper acts as the admission controller. If your deployments fail validation, use the following commands to investigate why.

Identifying Policy Violations

If a kubectl apply or Helm install fails with an error message like Error from server (Forbidden): admission webhook "validation.gatekeeper.sh" denied the request, follow these steps to identify the blocking policy.

1. List Active Constraints View all constraints currently enforced on the cluster to find the relevant policy (e.g., k8srequiredlabels, k8shttpsonly, etc.).

kubectl get constraints

2. Inspect Specific Violations If a resource is blocked or audited, the details are stored in the status field of the Constraint object. This will tell you exactly which field in your manifest failed validation.

# Syntax: kubectl describe <ConstraintKind> <ConstraintName>
# Example: Checking for ingress security violations
kubectl describe k8shttpsonly ingress-must-be-secure

Look for the Total Violations field and the Violations list in the output.

Debugging Policy Logic

If you suspect a policy is misconfigured or behaving unexpectedly (e.g., blocking valid resources), you can inspect the underlying Rego logic or check the Gatekeeper controller logs.

Inspect the Constraint Template (Rego) The logic resides in the ConstraintTemplate. Inspecting this allows you to see the actual Rego code being executed.

# List available templates
kubectl get constrainttemplates

# View the Rego logic for a specific template
kubectl get constrainttemplate k8shttpsonly -o yaml

Check Gatekeeper Controller Logs The controller logs provide detailed information on webhook admission requests, including the JSON payload that was sent to OPA.

kubectl logs -l control-plane=controller-manager -n gatekeeper-system
NoteDry Run Mode

If you are debugging a new policy and want to observe violations without blocking deployments, you can temporarily set the enforcement action to dryrun in the Constraint YAML:

spec:
  enforcementAction: dryrun

Violations will appear in the Constraint status (via kubectl describe) but will not block the creation or update of resources.

View All Violations Cluster-Wide

To get a comprehensive list of every active policy violation in the cluster (audited vs. blocked), you can use this formatted command. It iterates through every Constraint and prints the resource name and specific error message for each violation.

kubectl get constraints -o jsonpath='{range .items[*]}{"POLICY: "}{.metadata.name}{"\n"}{range .status.violations[*]}{"  FAIL: "}{.message}{"\n    -> Resource:  "}{.kind}/{.name}{"\n"}{end}{"\n"}{end}'

If you only want violations from the odm-pilot namespace:

kubectl get constraints -o jsonpath='{range .items[*]}{"POLICY: "}{.metadata.name}{"\n"}{range .status.violations[?(@.namespace=="odm-pilot")]}{"  FAIL: "}{.message}{"\n    -> Resource:  "}{.kind}/{.name}{"\n"}{end}{"\n"}{end}'

Output Example:

POLICY: ingress-must-be-secure
  FAIL: Ingress must be https only
    -> Resource:  Ingress/odm-ingress

POLICY: container-must-have-probes
  FAIL: Container <decision-center> is missing a readinessProbe
    -> Resource:  Pod/odm-decision-center-0