BP-Node Message Bus and System-wideTracing and Monitoring

Kyryll Prytula

7.10.2022

BP-Node is a Tier 1 banking switch capable of sustained processing of over 2000TPS. BP-Node has been designed and written in a state-less, message-driven architecture. At the core of the BP-Node architecture is Message Bus. Message Bus acts as a kind of network hub for messages to and from all BP-Node components. Each BP-Node configured component has its own subscribers to MessageBus who search for a particular message type.

When BP-Node’s Message Bus receives a random message, it iterates it to all subscribers to see if they are interested in processing it. And as the message itself is represented by a shared pointer — a nearly static object — its delivery to a subscriber occurs in real-time. The number of circulating messages has no impact on BP-Node’s performance, as each message is working in its separate thread and thus this processing is asynchronous to that of the other messages.

There are many message types: Timeout, UserInterface, Security, Persistance, RemoteCommand, SystemStatus, CryptographicRequest, KeyRequest and lastly Financial messages. All of these messages are constantly being triggered by a variety of external events (front-end enquiries, network connections) and also by internal timers (scheduler, cleaners, message time-outs, health system and monitoring). All are handled by the MessageBus. Although this causes the Message Bus to be constantly busy (basic BP-Node configuration constantly processes about 30TPS), there is no processing overhead when the MessageBus receives a message. The Message Bus message processing is instantaneous and non-blocking and thus has minimal impact on system performance. It is important to understand how Message Bus operates to understand how system-wide tracing and monitoring are provided in BP-Node.

BP-Node uses several different components to provide system-wide tracing and monitoring: Health, External Monitoring, Events, Tracing and Debug Mode, as described below. These are the base components delivered as part of every BP-Node installation and they are essential for any BP-Node delivery. All these components subscribe automatically to the Message Bus and use it as a central communication node to broadcast their status to other components.

Lets have a closer look on the options provided:

Health – BP-Node’s Health component consists of a set of in-memory hooks. Any change in the health condition of a particular monitored component (e.g. Connection lost event) triggers a hook, which then runs a number of lambda functions associated with this condition (e.g. Close Listening Socket). These are all in-memory operations with no impact on system performance. There are no writes to persistence or other IO operations.

External Monitoring – BP-Node’s External Monitoring is currently mainly handled by Zabbix Endpoint. The External Monitoring component provides a deadline timeout set to one second, which then triggers a message to gather all statistics from all BP-Node components. These statistics are collected as they happen and are aggregated up to the moment the External Monitoring forwards the package to the Zabbix Endpoint. On the Zabbix Endpoint side, the statistics are packed into a single message, which is then relayed to the Zabbix trapper. These are all in-memory operations that have a minimal impact on system performance. The only IO that takes place here is TCP with Zabbix, which occurs every 5 seconds, and can range from 4kb up to 100kb of data based on the number of monitored components.

Events – The Events component is the core element of BP-Node. All component and runtime events are sent here. Each event is represented in the system by a static class, so it is handled during compilation time and not during the run-time. Each event has its own static counters that tally its number of occurrences in the past minute and the past 10 minutes. During processing, these counters are checked to prevent Event log overloading. Event notification is suppressed when it becomes too frequent and a new Event is raised to notify the operator that this has occurred.
During “validated” event processing, there is a database insert, but due to the Frequency check, there are a maximum of 10 writes to the database per minute and event type. The database table performance is managed through table partitioning (where multiple tables are aggregated under a UNION VIEW), which allows table retention to be handled by the table dropping rather than by per-record deletion. Front-end enquiries are then hard-code limited to 100 records only. There is a very little impact on database performance, much less on any other operation performed by the Events component. There is no impact on BP-Node transaction performance at all, as all of these operations are conducted asynchronously to transaction processing.

Tracing – allows all messages that pass through the Message Bus to be logged. All passing messages are matched with a listener that is configured in Tracing and are then written into a local SQLite3 database file. BP-Node uses listener command aggregation to reduce the impact of SQL Transactions. However, as the Message Bus processes a large number of messages of all kinds every second (even when not processing any financial transactions), setting a listener to listen to all messages can create a significant load on the HDD IO and will very likely impact the production system’s processing performance. BP-Node’s User guide provides detailed information and recommendations on how to use Tracing safely on production systems.
Because all of the information needed for BP-Node operation and support is available via the tracing and monitoring interfaces provided in Health, External Monitoring, Events, Tracing and Debug Mode, system-wide tracing is not needed in production systems and should only be used as a last resort.
Why did we choose SQLite3? The SQLite3 database is an in-process library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. Its independence on any database engine provides reliable access to SQL-organised data even when any persistence engine fails.

Debug mode – BP-Node has its own internal DEBUG logging which is based on Boost::Log library. This is a very powerful logging library that has a minimal impact on actual system runtime. The debug mode is switched off by default and needs to be switched on upon BP-Node startup (see BP-Node User Guide – Chapter 5. Logging). It is highly recommended to turn debugging off once the debugging session is complete.
When BP-Node is run in Debug mode, all information collected is written straight to HDD IO. It is not intended for use in production systems.

The excerpt above demonstrate some basic techniques BP-Node uses to keep its performance and stability even under high loads. The BP-Node User Guide provides much deeper insight into this and you can always turn to EFTlab’s support team if you have any questions.