Part 3 — Blockchain heuristics through time
By Coinbase Special Investigations Team
In our last post we introduced the cornerstone of scaling up blockchain analysis, commonspend, and its pitfalls. In this blog post we’ll explore more complex and novel blockchain analysis scaling methods, their drawbacks and why time is a critical feature of blockchain analytics.
1. Change prediction
Change prediction is the second most commonly applied UTXO heuristic. It aims to predict which receiving address is controlled by the sender. A hallmark of UTXO blockchains is that when addresses transact, they move all outputs. The surplus amount is normally returned to the sender via a change address.
Consider the transaction below and try spotting the change address that belongs to the sender:
The change address is likely 374jbPUojy5pbmpjLGk8eS413Az4YyzBq6. Why? In this case, prediction logic relies on the fact that the above address is in the same address format as the input addresses (P2SH format, where sender’s addresses start with a “3”).
Among other factors, rounded amounts (i.e. 0.05 or 0.1 BTC) are often recognized as the actual send, with the rest being redirected to the change address. This suggests that change prediction relies not only on technical indicators, but also on elements of human behavior, like our affinity for rounded numbers.
Naturally, a more liberal change prediction logic that takes into account multiple variables in favor of a desired outcome can potentially lead to misattribution and mis-clustering. In particular, blockchain analytics tools can inadvertently fall into the trap of unsupervised change prediction — that’s why it is vital for blockchain investigators to be mindful of the limitations posed by this approach.
2. Change prediction, not a fact
Consider a more challenging example:
We have legacy addresses (starting with a “1”) sending on to two other legacy addresses. So which one is the change address?
The best way to figure out which address is the change address is to look at how each address spends BTC onwards. Usually output addresses receiving rounded amounts are not change addresses — but this could be wrong. So let’s just place our bet on the latter output address:
1Hs6XkSpuLguqaiKwYULH4VZ9cEkHMbsRJ — its next transction is as follows:
At first glance, this sort of looks like the pattern we saw in a previous transaction. The only aspect that stands out is a significant decrease in fees.
Looking at a second output address — 12Y8szPTeVzupEfe5RXs84fRsJJZBVhTgG — we see that its next transaction is distinct from the transaction it previously made:
The fees also look low compared to our initial transaction. And we notice that both our output addresses’ next transactions involve the original 1Hs6XkSpuLguqaiKwYULH4VZ9cEkHMbsRJ address in their outputs. Following the address’s next transaction we arrive to output #1’s next transaction.
To simplify, let’s visualize:
The diamonds in the above graph represent transactions — whereas the circles represent addresses. Notice that input address 15sMm6Rkf9hzz6ZtrrdhxdWZ8jGW12gQ93 commonspends in a transaction with 12Y8szPTeVzupEfe5RXs84fRsJJZBVhTgG. Therefore, output address #2 is in fact our change address!
This example illustrates how complicated change prediction can become leading to erroneous results.
3. Bespoke heuristics are still heuristics
Entities that attempt to preserve privacy in very public blockchains, such as exchanges and dark markets, may go out of their way to create their own wallet infrastructure that makes it difficult for blockchain investigators to identify how they operate. For these cases, blockchain analytics companies will create bespoke heuristics for these particular entities.
Still, no heuristics are foolproof. Parameters and limitations for blockchain analysis depend on how restrictive the scope is — or how much room is left for interpretation. A conservative approach would dictate not attributing anything that cannot be determined with close to 100% certainty; a liberal approach would allow wider attribution, at the cost of expanding the potential margin of error.
This also applies to any bespoke heuristic that is constructed with specific blockchain entities in mind. This is illustrated well by the above mentioned coinjoin Wasabi example. Although the transaction in question highly likely to belongs to Wasabi wallet, we need to ask ourselves what this transaction is displaying:
Most likely this transaction is displaying Wasabi addresses commonspending with other users’ addresses. As complexity increases, the accuracy of attribution decreases — especially if we consider that a user might own one or more addresses in this transaction.
Every blockchain analytics tool will have a different set of parameters and rely on different heuristics. That is why differences between clusters displayed by various tools are so common — for example, the SilkRoad cluster will each time look differently, depending on the blockchain analytics software used to conduct its analysis.
In fact, even with only comonspend applied, we see how the block explorers CryptoID and WalletExplorer both show different sizes of the Local Bitcoins cluster.
4. In blockchain analytics the future can impact the past
Einstein would probably admire blockchains, because they are one of the few examples of where the future can change the past — at least from an attribution perspective. For example, 14FUfzAjb91i7HsvuDGwjuStwhoaWLpGbh received various transactions from a P2P service provider between August and mid-September 2021. So we might think that this address could belong to an unhosted wallet.
But if we check on that address a couple days later on September 30, 3021, we suddenly notice that it’s been tagged as Unicc, a carding shop. What happened? This address commonspent 15 days later with an address we already knew belonged to Unicc — making it a part of the Unicc cluster.
This is a simple example, but you can imagine from a Compliance and market intelligence perspective that these after-the-fact attributions can have some ripple effects.
Blockchain analytics is an increasingly complex field of expertise. It is not as straightforward as it seems and the difficulty is compounded by the fact that conclusions are drawn not only from blockchain, but also from external sources that are often ambiguous.
It is not possible to call blockchain analytics science — after all, scientific experiments can be replicated by unrelated parties who, by following a set scientific methodology, will come to the same conclusions. In blockchain analytics even the ground truth can have multiple facades, meanings and interpretations.
Certainty of attribution is almost scarce and because multiple parties are relying on different tools for conducting transaction tracing on blockchains, it can sometimes yield dramatically different results. That is why educational efforts in this area should continuously emphasize that even the most robust, tooled-up methodologies are prone to errors.
Nothing is infallible — after all, blockchain analytics is more art than science.
Part 3 — Blockchain heuristics through time was originally published in The Coinbase Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.