Health Checks
The following chapters outline health check commands for database servers and APIO Core servers. These commands help assess system stability and aid in troubleshooting potential issues.
Database Servers
Processes
Check that the database is running by listing active PostgreSQL processes. There should be several processes with connections to apio_core
, one streaming process on the primary server, and one receiver process on the standby.
Look for:
- Multiple processes with
apio_core
connections - A
walsender
process on the primary server - A
walreceiver
process on the standby server
[postgres@core-db1 ~]$ systemctl status postgresql-15
* postgresql-15.service - PostgreSQL 15 database server
Loaded: loaded (/usr/lib/systemd/system/postgresql-15.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/postgresql-15.service.d
`-override.conf
Active: active (running) since Tue 2025-03-11 11:51:23 CET; 1 day 21h ago
Docs: https://www.postgresql.org/docs/15/static/
Process: 794 ExecStartPre=/usr/pgsql-15/bin/postgresql-15-check-db-dir ${PGDATA} (code=exited, status=0/SUCCESS)
Main PID: 826 (postmaster)
Tasks: 19 (limit: 23632)
Memory: 411.2M
CGroup: /system.slice/postgresql-15.service
|- 826 /usr/pgsql-15/bin/postmaster -D /data/db
|- 988 postgres: logger
|- 989 postgres: checkpointer
|- 990 postgres: background writer
|- 994 postgres: walwriter
|- 995 postgres: autovacuum launcher
|- 996 postgres: logical replication launcher
|-138171 postgres: apio_core apio_core 172.18.0.4(47384) idle
|-279190 postgres: apio_core apio_core 172.18.0.4(53790) idle
|-279191 postgres: apio_core apio_core 172.18.0.4(53792) idle
|-281136 postgres: apio_core apio_core 172.18.0.4(54344) idle
|-281202 postgres: apio_core apio_core 172.18.0.2(35814) idle
|-281375 postgres: walsender repmgr 10.0.10.73(46010) streaming 0/1400F390
|-281379 postgres: apio_core apio_core 10.0.10.73(57392) idle
|-281380 postgres: apio_core apio_core 10.0.10.73(53608) idle
|-281381 postgres: apio_core apio_core 10.0.10.73(53368) idle
|-281383 postgres: apio_core apio_core 10.0.10.73(57400) idle
|-281441 postgres: apio_core apio_core 172.18.0.3(39620) idle
`-281465 postgres: apio_core apio_core 172.18.0.3(39774) idle
[postgres@core-db2 ~]$ systemctl status postgresql-15
* postgresql-15.service - PostgreSQL 15 database server
Loaded: loaded (/usr/lib/systemd/system/postgresql-15.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/postgresql-15.service.d
`-override.conf
Active: active (running) since Thu 2025-03-13 09:21:21 CET; 21min ago
Docs: https://www.postgresql.org/docs/15/static/
Process: 786 ExecStartPre=/usr/pgsql-15/bin/postgresql-15-check-db-dir ${PGDATA} (code=exited, status=0/SUCCESS)
Main PID: 962 (postmaster)
Tasks: 6 (limit: 23632)
Memory: 32.2M
CGroup: /system.slice/postgresql-15.service
|- 962 /usr/pgsql-15/bin/postmaster -D /data/db
|-1047 postgres: logger
|-1062 postgres: checkpointer
|-1063 postgres: background writer
|-1064 postgres: startup recovering 000000010000000000000014
`-1067 postgres: walreceiver streaming 0/1400F448
Logs
Ensure that PostgreSQL logs are generated daily and review them for any abnormal statements. Check the logs for errors or warnings that could indicate potential issues, such as performance bottlenecks, replication issues, slow queries, or failed connections.
[postgres@core-db1 ~]$ ls -lrt /data/db/log/
total 44
-rw-------. 1 postgres postgres 16021 Mar 7 16:24 postgresql-Fri.log
-rw-------. 1 postgres postgres 0 Mar 8 00:00 postgresql-Sat.log
-rw-------. 1 postgres postgres 0 Mar 9 00:00 postgresql-Sun.log
-rw-------. 1 postgres postgres 485 Mar 10 16:39 postgresql-Mon.log
-rw-------. 1 postgres postgres 10951 Mar 11 14:44 postgresql-Tue.log
-rw-------. 1 postgres postgres 3275 Mar 12 15:04 postgresql-Wed.log
-rw-------. 1 postgres postgres 7363 Mar 13 10:49 postgresql-Thu.log
Datastore
Check that the WAL files are retained and rotated as expected.
[postgres@core-db1 ~]$ ls -lrt /data/db/pg_wal/
total 327680
drwx------. 2 postgres postgres 6 Mar 7 13:43 archive_status
...
-rw-------. 1 postgres postgres 16777216 Mar 12 14:59 000000010000000000000013
-rw-------. 1 postgres postgres 16777216 Mar 13 09:34 000000010000000000000014
Check that the datastore disk usage matches expectations:
[postgres@core-db1 ~]$ du -hs /data/db
31G /data/db
Check that the number of records in the different tables is within expected limits, considering the configured retention period.
[postgres@core-db1 ~]$ psql -c "SELECT relname AS table_name, n_live_tup AS row_count FROM pg_stat_user_tables ORDER BY n_live_tup DESC;" apio_core
table_name | row_count
-------------------------+-----------
tasks | 17952216
processing_traces | 12222474
requests | 9554392
contexts | 9405087
instances | 1675506
events | 1525213
login_attempts | 22677
ott | 11479
errors | 10900
idp_sessions | 5484
...
Check that the sizes of the different tables are within expected limits.
[postgres@core-db1 ~]$ psql -c "SELECT relname AS table_name, pg_size_pretty(pg_total_relation_size(relid)) AS total_size, pg_size_pretty(pg_relation_size(relid)) AS table_size, pg_size_pretty(pg_indexes_size(relid)) AS indexes_size FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC;" apio_core
table_name | total_size | table_size | indexes_size
-------------------------+------------+------------+--------------
requests | 27 GB | 9051 MB | 4580 MB
contexts | 17 GB | 3628 MB | 1442 MB
processing_traces | 15 GB | 6871 MB | 3243 MB
instances | 5409 MB | 2370 MB | 919 MB
tasks | 3730 MB | 1677 MB | 2053 MB
events | 1070 MB | 815 MB | 252 MB
refresh_tokens | 65 MB | 840 kB | 64 MB
errors | 21 MB | 19 MB | 1448 kB
...
Backups
To ensure that backups are created and purged as expected, and that disk space matches expectations, perform the following checks.
[postgres@core-db1 ~]$ ls -lrt /data/backup/
total 0
drwx------. 2 postgres postgres 48 Mar 12 14:51 core-db1-2025-03-12
drwx------. 2 postgres postgres 48 Mar 12 14:59 basebackup-core-db1-2025-03-12
[postgres@core-db1 ~]$ ls -lrt /data/backup/basebackup-core-db1-2025-03-12/
total 5492
-rw-------. 1 postgres postgres 5339616 Mar 12 14:59 base.tar.gz
-rw-------. 1 postgres postgres 280764 Mar 12 14:59 backup_manifest
Verify that disk space usage meets expectations and the backups are within the expected size and retention policies
[postgres@core-db1 ~]$ du -h /data/backup/
5.4G /data/backup/basebackup-core-db1-2025-03-12
5.4G /data/backup/
Replication
Check that the different nodes are registered as expected by running the following command on any database node.
This should display all registered nodes, their roles (primary or standby), statuses, and connection details. Ensure that:
- The primary node is correctly identified and running.
- Standby nodes are listed with the correct upstream node.
- No nodes are missing or in an unexpected state.
[postgres@core-db1 ~]$ /usr/pgsql-15/bin/repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-----------+---------+-----------+-----------+----------+----------+----------+-------------------------------------------
1 | core-db-1 | primary | * running | | default | 100 | 1 | host=10.0.10.71 dbname=repmgr user=repmgr
2 | core-db-2 | standby | running | core-db-1 | default | 100 | 1 | host=10.0.10.73 dbname=repmgr user=repmgr
INFO
Note that this displays the current topology of registered nodes and is a static view. It does not ensure that replication is actively running.
On the primary server, check that replication is up and that the streamed WAL location matches the replayed location. Also, ensure that the reply_time is current.
[postgres@core-db1 ~]$ psql -x -c "select * from pg_stat_replication"
-[ RECORD 1 ]----+------------------------------
pid | 3721
usesysid | 17432
usename | repmgr
application_name | core-db-2
client_addr | 10.0.10.73
client_hostname |
client_port | 60374
backend_start | 2025-03-11 12:01:28.71278+01
backend_xmin |
state | streaming
sent_lsn | 0/14007730
write_lsn | 0/14007730
flush_lsn | 0/14007730
replay_lsn | 0/14007730
write_lag |
flush_lag |
replay_lag |
sync_priority | 0
sync_state | async
reply_time | 2025-03-13 09:18:50.959107+01
APIO Core Servers
Containers
Ensure that Docker is running and that one Docker proxy instance is spawned per exposed port to the outside.
[root@core-app-1 ~]# systemctl status docker
* docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2025-03-11 11:51:31 CET; 1 day 22h ago
Docs: https://docs.docker.com
Main PID: 1102 (dockerd)
Tasks: 27
Memory: 157.7M
CGroup: /system.slice/docker.service
|- 1102 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
|-138154 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 9090 -container-ip 172.18.0.4 -container-port 9090 -use-listen-fd
`-278630 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 80 -container-ip 172.18.0.5 -container-port 80 -use-listen-fd
Ensure that the different containers are running by checking their uptime with the following command, from the directory /opt/apio_core
.
[root@core-app-1 apio_core]# docker compose ps
NAME IMAGE COMMAND SERVICE CREATED STATUS PORTS
apio_core-core-1 docker.bxl.netaxis.be/apio_bsft/core:2.15.2 "/usr/local/go/serve…" core 3 days ago Up 22 hours 5000/tcp, 0.0.0.0:9090->9090/tcp
apio_core-nginx-1 nginx:latest "/docker-entrypoint.…" nginx 3 days ago Up 9 minutes 0.0.0.0:80->80/tcp
apio_core-p1-1 docker.bxl.netaxis.be/apio_bsft/core:2.15.2 "/usr/local/go/serve…" p1 3 days ago Restarting (2) 54 seconds ago
apio_core-scheduler-1 docker.bxl.netaxis.be/apio_bsft/core:2.15.2 "/usr/local/go/sched…" scheduler 3 days ago Up 45 hours 5000/tcp, 9090/tcp
Ensure that the containers currently running match the docker-compose.yml
configuration by simulating the up
command in dry-run mode. If it matches, it will indicate that the containers are already running.
[root@core-app-1 apio_core]# docker compose up --dry-run
[+] Running 3/3
✔ DRY-RUN MODE - Container apio_core-core-1 Running 0.0s
✔ DRY-RUN MODE - Container apio_core-scheduler-1 Running 0.0s
✔ DRY-RUN MODE - Container apio_core-nginx-1 Running 0.0s
end of 'compose up' output, interactive run is not supported in dry-run mode
If they differ, the output will indicate that some actions need to be performed to align the running containers with the docker-compose.yml
configuration.
[root@core-app-1 apio_core]# docker compose up --dry-run
[+] Running 3/3
✔ DRY-RUN MODE - p1 Pulled 0.2s
✔ DRY-RUN MODE - core Pulled 0.2s
✔ DRY-RUN MODE - scheduler Pulled 0.2s
[+] Running 4/4
✔ DRY-RUN MODE - Container apio_core-scheduler-1 Recreated 0.0s
✔ DRY-RUN MODE - Container apio_core-nginx-1 Created 0.0s
✔ DRY-RUN MODE - Container apio_core-p1-1 Recreated 0.0s
✔ DRY-RUN MODE - Container apio_core-core-1 Recreated 0.0s
end of 'compose up' output, interactive run is not supported in dry-run mode
Logs
Verify that each service has its own log file in the /var/log/apio_core/
directory. Then, confirm that the logrotate
configuration is correctly rotating these logs on a daily basis by checking if old logs are archived and new log files are being created.
[root@core-app-1 apio_core]# ls -lrt /var/log/apio_core/
...
-rw-r--r--. 1 root root 20 Mar 12 10:52 nginx_error.log-20250312.gz
-rw-r--r--. 1 root root 20 Mar 12 10:52 nginx_access.log-20250312.gz
-rw-r--r--. 1 root root 20 Mar 12 10:52 access.log-20250312.gz
-rw-r--r--. 1 root root 129 Mar 12 11:05 error.log-20250312.gz
-rw-------. 1 root root 480 Mar 12 11:06 apio_core-p1-1.log-20250312.gz
-rw-------. 1 root root 302 Mar 12 11:06 apio_core-scheduler-1.log-20250312.gz
-rw-------. 1 root root 192 Mar 12 11:06 apio_core-core-1.log-20250312.gz
...
-rw-r--r--. 1 root root 505 Mar 13 09:00 error.log
-rw-r--r--. 1 root root 0 Mar 13 09:00 access.log
-rw-r--r--. 1 root root 771 Mar 13 09:16 nginx_error.log
-rw-------. 1 root root 2877500 Mar 13 09:17 apio_core-p1-1.log.1
-rw-------. 1 root root 3329109 Mar 13 09:23 apio_core-scheduler-1.log.1
-rw-r--r--. 1 root root 112607 Mar 13 09:23 nginx_access.log
-rw-------. 1 root root 2052895 Mar 13 09:23 apio_core-core-1.log.1
NGINX Connectivity
Check the NGINX GUI serving functionality by requesting the GUI homepage from the host by requesting the following URL:
[root@core-app-1 apio_core]# curl http://localhost/index.html
...
To check that the NGINX proxying functionality to the core main service works as expected, send a POST request to the designated URL. If the configuration is correct, you should receive an error message due to an incorrect payload. This error message is expected and indicates that the proxy is working as intended.
[root@core-app-1 apio_core]# curl -X POST http://localhost/api/v01/auth/login
{"message":"Invalid json body"}
To verify the NGINX proxying functionality to the proxied instance targeting the BWGW, send a POST request to the following URL:
[root@core-app-1 apio_core]# curl -X POST http://localhost/api/v01/p1/login
{"error":"unauthorized"}
Prometheus Metrics
To verify that Prometheus metrics are exposed and retrievable, perform an HTTP GET request to the metrics endpoint of the Prometheus target. The response should return a list of metrics in a text-based format.
[root@core-app-1 apio_core]# curl http://localhost:9090/metrics
# HELP apio_active_instances_total The total number of active instances
# TYPE apio_active_instances_total gauge
apio_active_instances_total 0
# HELP apio_instances_errors The number of instances in error
# TYPE apio_instances_errors gauge
apio_instances_errors 0
# HELP apio_reponse_time_avg The average response time over the last minute
# TYPE apio_reponse_time_avg gauge
apio_reponse_time_avg 0
# HELP apio_reponse_time_max The maximum response time over the last minute
# TYPE apio_reponse_time_max gauge
apio_reponse_time_max 0
# HELP apio_requests_total The number of requests
# TYPE apio_requests_total gauge
apio_requests_total 0
# HELP apio_users_total The total number of users
# TYPE apio_users_total gauge
apio_users_total 1
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 2
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0