Health Checks

The following chapters outline health check commands for database servers and APIO Core servers. These commands help assess system stability and aid in troubleshooting potential issues.

Database Servers

Processes

Check that the database is running by listing active PostgreSQL processes. There should be several processes with connections to apio_core, one streaming process on the primary server, and one receiver process on the standby.

Look for:

Multiple processes with apio_core connections
A walsender process on the primary server
A walreceiver process on the standby server

[postgres@core-db1 ~]$ systemctl status postgresql-15
* postgresql-15.service - PostgreSQL 15 database server
   Loaded: loaded (/usr/lib/systemd/system/postgresql-15.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/postgresql-15.service.d
           `-override.conf
   Active: active (running) since Tue 2025-03-11 11:51:23 CET; 1 day 21h ago
     Docs: https://www.postgresql.org/docs/15/static/
  Process: 794 ExecStartPre=/usr/pgsql-15/bin/postgresql-15-check-db-dir ${PGDATA} (code=exited, status=0/SUCCESS)
 Main PID: 826 (postmaster)
    Tasks: 19 (limit: 23632)
   Memory: 411.2M
   CGroup: /system.slice/postgresql-15.service
           |-   826 /usr/pgsql-15/bin/postmaster -D /data/db
           |-   988 postgres: logger
           |-   989 postgres: checkpointer
           |-   990 postgres: background writer
           |-   994 postgres: walwriter
           |-   995 postgres: autovacuum launcher
           |-   996 postgres: logical replication launcher
           |-138171 postgres: apio_core apio_core 172.18.0.4(47384) idle
           |-279190 postgres: apio_core apio_core 172.18.0.4(53790) idle
           |-279191 postgres: apio_core apio_core 172.18.0.4(53792) idle
           |-281136 postgres: apio_core apio_core 172.18.0.4(54344) idle
           |-281202 postgres: apio_core apio_core 172.18.0.2(35814) idle
           |-281375 postgres: walsender repmgr 10.0.10.73(46010) streaming 0/1400F390
           |-281379 postgres: apio_core apio_core 10.0.10.73(57392) idle
           |-281380 postgres: apio_core apio_core 10.0.10.73(53608) idle
           |-281381 postgres: apio_core apio_core 10.0.10.73(53368) idle
           |-281383 postgres: apio_core apio_core 10.0.10.73(57400) idle
           |-281441 postgres: apio_core apio_core 172.18.0.3(39620) idle
           `-281465 postgres: apio_core apio_core 172.18.0.3(39774) idle

[postgres@core-db2 ~]$ systemctl status postgresql-15
* postgresql-15.service - PostgreSQL 15 database server
   Loaded: loaded (/usr/lib/systemd/system/postgresql-15.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/postgresql-15.service.d
           `-override.conf
   Active: active (running) since Thu 2025-03-13 09:21:21 CET; 21min ago
     Docs: https://www.postgresql.org/docs/15/static/
  Process: 786 ExecStartPre=/usr/pgsql-15/bin/postgresql-15-check-db-dir ${PGDATA} (code=exited, status=0/SUCCESS)
 Main PID: 962 (postmaster)
    Tasks: 6 (limit: 23632)
   Memory: 32.2M
   CGroup: /system.slice/postgresql-15.service
           |- 962 /usr/pgsql-15/bin/postmaster -D /data/db
           |-1047 postgres: logger
           |-1062 postgres: checkpointer
           |-1063 postgres: background writer
           |-1064 postgres: startup recovering 000000010000000000000014
           `-1067 postgres: walreceiver streaming 0/1400F448

Logs

Ensure that PostgreSQL logs are generated daily and review them for any abnormal statements. Check the logs for errors or warnings that could indicate potential issues, such as performance bottlenecks, replication issues, slow queries, or failed connections.

[postgres@core-db1 ~]$ ls -lrt /data/db/log/
total 44
-rw-------. 1 postgres postgres 16021 Mar  7 16:24 postgresql-Fri.log
-rw-------. 1 postgres postgres     0 Mar  8 00:00 postgresql-Sat.log
-rw-------. 1 postgres postgres     0 Mar  9 00:00 postgresql-Sun.log
-rw-------. 1 postgres postgres   485 Mar 10 16:39 postgresql-Mon.log
-rw-------. 1 postgres postgres 10951 Mar 11 14:44 postgresql-Tue.log
-rw-------. 1 postgres postgres  3275 Mar 12 15:04 postgresql-Wed.log
-rw-------. 1 postgres postgres  7363 Mar 13 10:49 postgresql-Thu.log

Datastore

Check that the WAL files are retained and rotated as expected.

[postgres@core-db1 ~]$ ls -lrt /data/db/pg_wal/
total 327680
drwx------. 2 postgres postgres        6 Mar  7 13:43 archive_status
...
-rw-------. 1 postgres postgres 16777216 Mar 12 14:59 000000010000000000000013
-rw-------. 1 postgres postgres 16777216 Mar 13 09:34 000000010000000000000014

Check that the datastore disk usage matches expectations:

[postgres@core-db1 ~]$ du -hs /data/db
31G	    /data/db

Check that the number of records in the different tables is within expected limits, considering the configured retention period.

[postgres@core-db1 ~]$ psql -c "SELECT relname AS table_name, n_live_tup AS row_count FROM pg_stat_user_tables ORDER BY n_live_tup DESC;" apio_core
       table_name        | row_count
-------------------------+-----------
 tasks                   |  17952216
 processing_traces       |  12222474
 requests                |   9554392
 contexts                |   9405087
 instances               |   1675506
 events                  |   1525213
 login_attempts          |     22677
 ott                     |     11479
 errors                  |     10900
 idp_sessions            |      5484
...

Check that the sizes of the different tables are within expected limits.

[postgres@core-db1 ~]$ psql -c "SELECT relname AS table_name, pg_size_pretty(pg_total_relation_size(relid)) AS total_size, pg_size_pretty(pg_relation_size(relid)) AS table_size, pg_size_pretty(pg_indexes_size(relid)) AS indexes_size FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC;" apio_core
       table_name        | total_size | table_size | indexes_size
-------------------------+------------+------------+--------------
 requests                | 27 GB      | 9051 MB    | 4580 MB
 contexts                | 17 GB      | 3628 MB    | 1442 MB
 processing_traces       | 15 GB      | 6871 MB    | 3243 MB
 instances               | 5409 MB    | 2370 MB    | 919 MB
 tasks                   | 3730 MB    | 1677 MB    | 2053 MB
 events                  | 1070 MB    | 815 MB     | 252 MB
 refresh_tokens          | 65 MB      | 840 kB     | 64 MB
 errors                  | 21 MB      | 19 MB      | 1448 kB
...

Backups

To ensure that backups are created and purged as expected, and that disk space matches expectations, perform the following checks.

[postgres@core-db1 ~]$ ls -lrt /data/backup/
total 0
drwx------. 2 postgres postgres 48 Mar 12 14:51 core-db1-2025-03-12
drwx------. 2 postgres postgres 48 Mar 12 14:59 basebackup-core-db1-2025-03-12
[postgres@core-db1 ~]$ ls -lrt /data/backup/basebackup-core-db1-2025-03-12/
total 5492
-rw-------. 1 postgres postgres 5339616 Mar 12 14:59 base.tar.gz
-rw-------. 1 postgres postgres  280764 Mar 12 14:59 backup_manifest

Verify that disk space usage meets expectations and the backups are within the expected size and retention policies

[postgres@core-db1 ~]$ du -h /data/backup/
5.4G	/data/backup/basebackup-core-db1-2025-03-12
5.4G	/data/backup/

Replication

Check that the different nodes are registered as expected by running the following command on any database node.

This should display all registered nodes, their roles (primary or standby), statuses, and connection details. Ensure that:

The primary node is correctly identified and running.
Standby nodes are listed with the correct upstream node.
No nodes are missing or in an unexpected state.

[postgres@core-db1 ~]$ /usr/pgsql-15/bin/repmgr cluster show
 ID | Name      | Role    | Status    | Upstream  | Location | Priority | Timeline | Connection string
----+-----------+---------+-----------+-----------+----------+----------+----------+-------------------------------------------
 1  | core-db-1 | primary | * running |           | default  | 100      | 1        | host=10.0.10.71 dbname=repmgr user=repmgr
 2  | core-db-2 | standby |   running | core-db-1 | default  | 100      | 1        | host=10.0.10.73 dbname=repmgr user=repmgr

INFO

Note that this displays the current topology of registered nodes and is a static view. It does not ensure that replication is actively running.

On the primary server, check that replication is up and that the streamed WAL location matches the replayed location. Also, ensure that the reply_time is current.

[postgres@core-db1 ~]$ psql -x -c "select * from pg_stat_replication"
-[ RECORD 1 ]----+------------------------------
pid              | 3721
usesysid         | 17432
usename          | repmgr
application_name | core-db-2
client_addr      | 10.0.10.73
client_hostname  |
client_port      | 60374
backend_start    | 2025-03-11 12:01:28.71278+01
backend_xmin     |
state            | streaming
sent_lsn         | 0/14007730
write_lsn        | 0/14007730
flush_lsn        | 0/14007730
replay_lsn       | 0/14007730
write_lag        |
flush_lag        |
replay_lag       |
sync_priority    | 0
sync_state       | async
reply_time       | 2025-03-13 09:18:50.959107+01

APIO Core Servers

Containers

Ensure that Docker is running and that one Docker proxy instance is spawned per exposed port to the outside.

[root@core-app-1 ~]# systemctl status docker
* docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2025-03-11 11:51:31 CET; 1 day 22h ago
     Docs: https://docs.docker.com
 Main PID: 1102 (dockerd)
    Tasks: 27
   Memory: 157.7M
   CGroup: /system.slice/docker.service
           |-  1102 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
           |-138154 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9090 -container-ip 172.18.0.4 -container-port 9090 -use-listen-fd
           `-278630 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 80 -container-ip 172.18.0.5 -container-port 80 -use-listen-fd

Ensure that the different containers are running by checking their uptime with the following command, from the directory /opt/apio_core.

[root@core-app-1 apio_core]# docker compose ps
NAME                    IMAGE                                         COMMAND                  SERVICE     CREATED      STATUS                          PORTS
apio_core-core-1        docker.bxl.netaxis.be/apio_bsft/core:2.15.2   "/usr/local/go/serve…"   core        3 days ago   Up 22 hours                     5000/tcp, 0.0.0.0:9090->9090/tcp
apio_core-nginx-1       nginx:latest                                  "/docker-entrypoint.…"   nginx       3 days ago   Up 9 minutes                    0.0.0.0:80->80/tcp
apio_core-p1-1          docker.bxl.netaxis.be/apio_bsft/core:2.15.2   "/usr/local/go/serve…"   p1          3 days ago   Restarting (2) 54 seconds ago
apio_core-scheduler-1   docker.bxl.netaxis.be/apio_bsft/core:2.15.2   "/usr/local/go/sched…"   scheduler   3 days ago   Up 45 hours                     5000/tcp, 9090/tcp

Ensure that the containers currently running match the docker-compose.yml configuration by simulating the up command in dry-run mode. If it matches, it will indicate that the containers are already running.

[root@core-app-1 apio_core]# docker compose up --dry-run
[+] Running 3/3
 ✔ DRY-RUN MODE -  Container apio_core-core-1       Running                                    0.0s
 ✔ DRY-RUN MODE -  Container apio_core-scheduler-1  Running                                    0.0s
 ✔ DRY-RUN MODE -  Container apio_core-nginx-1      Running                                    0.0s
end of 'compose up' output, interactive run is not supported in dry-run mode

If they differ, the output will indicate that some actions need to be performed to align the running containers with the docker-compose.yml configuration.

[root@core-app-1 apio_core]# docker compose up --dry-run
[+] Running 3/3
 ✔ DRY-RUN MODE -  p1 Pulled                                                                   0.2s
 ✔ DRY-RUN MODE -  core Pulled                                                                 0.2s
 ✔ DRY-RUN MODE -  scheduler Pulled                                                            0.2s
[+] Running 4/4
 ✔ DRY-RUN MODE -  Container apio_core-scheduler-1  Recreated                                  0.0s
 ✔ DRY-RUN MODE -  Container apio_core-nginx-1      Created                                    0.0s
 ✔ DRY-RUN MODE -  Container apio_core-p1-1         Recreated                                  0.0s
 ✔ DRY-RUN MODE -  Container apio_core-core-1       Recreated                                  0.0s
end of 'compose up' output, interactive run is not supported in dry-run mode

Logs

Verify that each service has its own log file in the /var/log/apio_core/ directory. Then, confirm that the logrotate configuration is correctly rotating these logs on a daily basis by checking if old logs are archived and new log files are being created.

[root@core-app-1 apio_core]# ls -lrt /var/log/apio_core/
...
-rw-r--r--. 1 root root      20 Mar 12 10:52 nginx_error.log-20250312.gz
-rw-r--r--. 1 root root      20 Mar 12 10:52 nginx_access.log-20250312.gz
-rw-r--r--. 1 root root      20 Mar 12 10:52 access.log-20250312.gz
-rw-r--r--. 1 root root     129 Mar 12 11:05 error.log-20250312.gz
-rw-------. 1 root root     480 Mar 12 11:06 apio_core-p1-1.log-20250312.gz
-rw-------. 1 root root     302 Mar 12 11:06 apio_core-scheduler-1.log-20250312.gz
-rw-------. 1 root root     192 Mar 12 11:06 apio_core-core-1.log-20250312.gz
...
-rw-r--r--. 1 root root     505 Mar 13 09:00 error.log
-rw-r--r--. 1 root root       0 Mar 13 09:00 access.log
-rw-r--r--. 1 root root     771 Mar 13 09:16 nginx_error.log
-rw-------. 1 root root 2877500 Mar 13 09:17 apio_core-p1-1.log.1
-rw-------. 1 root root 3329109 Mar 13 09:23 apio_core-scheduler-1.log.1
-rw-r--r--. 1 root root  112607 Mar 13 09:23 nginx_access.log
-rw-------. 1 root root 2052895 Mar 13 09:23 apio_core-core-1.log.1

NGINX Connectivity

Check the NGINX GUI serving functionality by requesting the GUI homepage from the host by requesting the following URL:

[root@core-app-1 apio_core]# curl http://localhost/index.html
...

To check that the NGINX proxying functionality to the core main service works as expected, send a POST request to the designated URL. If the configuration is correct, you should receive an error message due to an incorrect payload. This error message is expected and indicates that the proxy is working as intended.

[root@core-app-1 apio_core]# curl -X POST http://localhost/api/v01/auth/login
{"message":"Invalid json body"}

To verify the NGINX proxying functionality to the proxied instance targeting the BWGW, send a POST request to the following URL:

[root@core-app-1 apio_core]# curl -X POST http://localhost/api/v01/p1/login
{"error":"unauthorized"}

Prometheus Metrics

To verify that Prometheus metrics are exposed and retrievable, perform an HTTP GET request to the metrics endpoint of the Prometheus target. The response should return a list of metrics in a text-based format.

[root@core-app-1 apio_core]# curl http://localhost:9090/metrics
# HELP apio_active_instances_total The total number of active instances
# TYPE apio_active_instances_total gauge
apio_active_instances_total 0
# HELP apio_instances_errors The number of instances in error
# TYPE apio_instances_errors gauge
apio_instances_errors 0
# HELP apio_reponse_time_avg The average response time over the last minute
# TYPE apio_reponse_time_avg gauge
apio_reponse_time_avg 0
# HELP apio_reponse_time_max The maximum response time over the last minute
# TYPE apio_reponse_time_max gauge
apio_reponse_time_max 0
# HELP apio_requests_total The number of requests
# TYPE apio_requests_total gauge
apio_requests_total 0
# HELP apio_users_total The total number of users
# TYPE apio_users_total gauge
apio_users_total 1
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 2
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0

Health Checks ​

Database Servers ​

Processes ​

Logs ​

Datastore ​

Backups ​

Replication ​

APIO Core Servers ​

Containers ​

Logs ​

NGINX Connectivity ​

Prometheus Metrics ​

Health Checks

Database Servers

Processes

Logs

Datastore

Backups

Replication

APIO Core Servers

Containers

Logs

NGINX Connectivity

Prometheus Metrics