Monitoring On-Chain Systems

Most teams ship a smart contract and then stare at Etherscan. That is not monitoring. That is hope. And hope does not page you at 3 AM when someone drains your liquidity pool.

We have operated on chain systems that moved eight figures a month. The single biggest lesson was this: monitoring a smart contract is nothing like monitoring a web server. The failure modes are different. The data model is different. And the cost of missing an alert is wildly higher because you cannot roll back a blockchain transaction.

This is the operational playbook we wish someone had written for us four years ago.

What to alert on

There are roughly five categories of on chain events that deserve alerts. Miss any one and you have a blind spot that will eventually cost you money.

Balance thresholds

Every contract that holds funds needs a balance alert. Not just "balance is zero" but meaningful thresholds. If your vault contract normally holds between 500 and 5,000 ETH, you want an alert at 400 and another at 200. The first is a warning. The second means something is very wrong.

We set these as percentage drops too. If the balance drops 20% in an hour, that is an alert regardless of the absolute value. A 20% drawdown on a $50M vault is $10M. You want to know about that in seconds, not hours.

Same goes for gas wallets. If your relayer wallet drops below 0.5 ETH, your keeper bots stop working. We have seen protocols go offline for hours because nobody noticed the operational wallet ran dry.

Abnormal transaction patterns

Set up anomaly detection on transaction frequency and value. If your contract normally processes 200 transactions per day and suddenly processes 2,000 in an hour, something is happening. Maybe it is legitimate demand. Maybe it is an exploit loop.

We track the ratio of successful to failed transactions too. A spike in reverts often signals that someone is probing your contract for edge cases. Automated exploit scripts generate a lot of failed transactions before they find the one that works.

Privileged function calls

Every onlyOwner or onlyAdmin function call should generate an alert. Period. If someone calls pause(), setFee(), transferOwnership(), or upgradeProxy(), an engineer should see that notification within 60 seconds. If you did not expect the call, you need to be investigating immediately.

We had a case where a compromised multisig signer submitted a malicious transaction. The alert on the submitTransaction call gave the other signers 45 minutes to reject it before the timelock expired. That alert saved roughly $3M.

Oracle staleness and deviation

If your system depends on price feeds, monitor them separately from your own contracts. Track the updatedAt timestamp on every Chainlink feed you consume. If that timestamp is older than the feed's stated heartbeat plus a 10% buffer, you need to know.

Also track price deviation. If ETH/USD moves 15% in five minutes according to your oracle, either the market is crashing or the oracle is broken. Both require human attention.

Mempool activity

This one is harder but worth it for high-value contracts. Monitor the mempool for pending transactions targeting your contracts. If you see a large withdrawal sitting in the mempool followed by a suspiciously timed swap on the same pair, you are looking at a sandwich attack or worse.

Tools like Blocknative and Forta give you mempool visibility. The latency matters. You need sub-second notification to have any chance of responding.

What to log

On chain data is public and permanent. But "public" does not mean "queryable in a useful way." You need your own indexed log pipeline.

Event logs with context

Every emit in your contract should include enough context to reconstruct what happened without decoding calldata. We made the mistake early on of emitting minimal events. A Transfer(address, address, uint256) event tells you what moved. It does not tell you why.

Add a reason or context field to your events where possible. "Withdrawal initiated by user" is different from "withdrawal triggered by liquidation bot." When you are debugging an incident at 4 AM, that extra field saves you 30 minutes of calldata decoding.

Off chain actions tied to on chain state

Your backend makes decisions based on on chain data. Log those decisions explicitly. "Saw balance X on block Y, decided to trigger rebalance Z." When something goes wrong, you need to trace the full chain of causation from the on chain event through your backend logic to the resulting transaction.

We use structured JSON logs with a block_number and tx_hash field on every log line that relates to on chain state. This makes cross-referencing trivial during incidents.

RPC performance

Log your RPC call latency, error rates, and block lag. If your Alchemy node is 3 blocks behind, your monitoring is looking at stale data and your alerts are delayed. We have seen RPC providers silently degrade, returning correct but delayed data. Your monitoring tells you everything is fine while your users see something completely different.

Run at least two RPC providers. Compare their responses. If they disagree, alert on that too.

Building runbooks

A runbook is a document that tells an on call engineer what to do when an alert fires. In traditional infrastructure, runbooks are common. In crypto, almost nobody writes them. This is insane given the stakes involved.

Structure of a good runbook

Every runbook should answer four questions. What is the alert. Why does it matter. What do I check first. What actions can I take.

That last question is the hard one. In traditional ops, you restart the service. In smart contract ops, your options might be limited to calling pause() on the contract, blacklisting an address, or submitting a governance proposal. Document exactly which multisig needs to sign, which safe has the authority, and what the timelock delay is.

The pause decision tree

We build a specific runbook for "should I pause the contract." This gets its own document because it is the highest stakes decision an operator can make. Pausing stops an exploit. It also stops all legitimate users. If you pause a DEX during a market crash, your users cannot exit positions.

Our decision tree asks: Is user funds at active risk right now? If yes, pause immediately. If unsure, escalate to the security lead within 5 minutes. If the security lead is unreachable, pause. You can unpause. You cannot un-drain.

Communication templates

Write your incident communication messages before the incident happens. When your protocol is under attack, you do not want to be wordsmithing a tweet. Have templates for "we have paused the protocol while investigating an issue," "we have identified the issue and are working on a fix," and "the issue is resolved, here is what happened."

Include your Discord, Twitter, and Telegram channels in the runbook. Include the login credentials or the name of the person who has them. During an incident, "who has the password to the official Twitter account" should not be a question anyone is asking.

Escalation paths

Define who gets paged at each severity level. For us, it looks like this. P3 (informational anomaly) goes to Slack. P2 (potential issue, no funds at risk yet) pages the on call engineer. P1 (confirmed issue, funds possibly at risk) pages the on call engineer and the security lead simultaneously. P0 (active exploit, funds draining) pages everyone and triggers an automatic contract pause if that capability exists.

Test these escalation paths. Actually fire a test P1 at 2 AM on a Tuesday. See if the right people respond within the SLA. If they do not, fix the process before you need it for real.

Tooling we actually use

OpenZeppelin Defender for automated transaction execution and monitoring. Forta for agent-based threat detection. Tenderly for transaction simulation and alerting. Grafana plus Prometheus for custom dashboards that combine on chain and off chain metrics on one screen.

No single product covers everything. You will end up with at least three systems feeding into one alerting pipeline. We route everything through PagerDuty so there is one place where alerts land regardless of the source.

Custom indexers are also worth the effort for anything beyond basic monitoring. The Graph works for straightforward event indexing. But if you need to track complex state across multiple contracts, you will end up writing your own indexer with something like Ponder or a plain Node.js process reading from an archive node.

The operational gap nobody talks about

Here is the thing that still surprises me. Teams will spend $200K on a smart contract audit and then spend $0 on operational monitoring. The audit catches bugs in the code. Monitoring catches bugs in the real world. Exploits that the auditors missed, oracle failures, governance attacks, economic attacks that only manifest under specific market conditions.

An audit is a snapshot. Monitoring is continuous. You need both.

We budget 15 to 20% of the initial contract development cost for monitoring infrastructure. That covers the alerting setup, the indexer, the dashboards, the runbooks, and the first three months of on call rotation. After that, the ongoing cost is mostly the on call engineer's time and the RPC provider bills.

The teams that take monitoring seriously are the teams that survive incidents without losing user funds. That is not a coincidence.

Monitoring on chain systems.