I came across an article whose author described the process of creating a smart contract and deploying it to the Ethereum mainnet. He also published the source code.
I didn’t have experience with smart contract development and blockchain analysis, but after reading the article I began to wonder:
Is it possible to use this public information in order to find Ethereum address of this person?
It would be fun to see how much money he holds ;-)
TL;DR
Given the smart contract’s source code it is possible to find it using blockchain analysis. I’ve used Google Big Query to search for specific function signatures. In the end, I give some hints on how to avoid deanonymization.
Disclaimer
I don’t want to disclose the personal details of the smart contract’s author. Let’s just call him John. I contacted John and he approved this article before publication.
Don’t use the knowledge you got here (or anywhere else) to hurt others.
Challenge
I can define a challenge as entering unknown territory with a clear goal in mind. In the area of computer security and privacy, this feeling boosts my creativity in problem-solving, forces me to learn new things fast, perform reverse engineering and try to get into the state of mind of a given system creator.
It gives me the satisfaction of a deep understanding of the system and a thrill, both at the same time. With that motivation, I started researching the problem.
Examine the code
My first attempt was to check for any hardcoded Ethereum addresses in John’s code. The common practice is to set contract owner address dynamically during deployment instead of hardcoding it, and it was also the case here. But it costs nothing to check.
constructor() {
owner = msg.sender; // set owner to address of caller
}
Compare bytecode
Every smart contract’s bytecode is available publicly on the blockchain. Therefore my second attempt was to compile John’s code and then use some tool to compare it with every smart contract on Ethereum.
I started by finding a way to query the blockchain for the information I wanted. It appears you can’t search by bytecode in popular blockchain explorers such as Etherscan or Etherchain. Googling bytecode is not a good idea, because Google doesn’t allow such large queries.
BigQuery
However, I’ve found Google created a BigQuery public dataset for Ethereum and updates it daily. From the user’s perspective BigQuery is just a large SQL database similar to Postgresql or Mysql used routinely by developers. Its usage is free (within limits) and there is a web-based query tool called BigQuery Console.
I found it perfect for the job. Looking at schema, there is a contracts
table and it contains a bytecode
column. To see bytecode related to each one of random 10 transactions made on the last day, I could execute the following SQL statement:
SELECT bytecode
FROM
bigquery-public-data.crypto_ethereum.contracts
WHERE
block_timestamp >
CAST(DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) AS TIMESTAMP)
LIMIT 10
Timestamp constrain is optional but recommended. This table is huge. Restricting time interval prevents Google from processing a lot of non-relevant data. I skipped this constrain on my first trials which quickly ate all of my quota.
Knowing the exact bytecode and blog post’s publication date I could find the contract address by executing the following statement:
SELECT address
FROM
bigquery-public-data.crypto_ethereum.contracts
WHERE
block_timestamp > {month before post publication date}
AND
bytecode = {compiled bytecode of smart contract}
LIMIT 10
address
is a column containing contract address.
Now I only needed the bytecode of this smart contract.
Compilation
I decided to use a free web-based IDE called Remix because it doesn’t require setting up the development environment.
I’ve created a .sol
file in Remix and pasted the code of the contract. Then I opened Solidity compiler tab and set Solidity version to match the version specified in the code. Each file has the following line at the top:
pragma solidity SPEC
In place of
SPEC
there is a version specification.
Using the same compiler is important because every version may produce different bytecode and I was trying to generate an exact copy of the bytecode deployed by John.
The code was compiled successfully, and I was able to get bytecode.
Blockchain search
To my surprise, SQL query mentioned above didn’t return any values. I modified WHERE
statement to use SQL LIKE
, which allows finding partial matches:
bytecode LIKE '%
{part of compiled bytecode}
%'
Then I experimented with passing different parts of bytecode. However, there were no meaningful results neither.
Optimization
Then it hit me. Solidity compiler optimizes code, to reduce code size (contract deployment is cheaper) and execution cost (users have to pay less for gas).
Optimization level is controller by a parameter specified during compilation.
This parameter is a number from 0 to 4294967295. I suspect each number produces a slightly different bytecode.
I didn’t know what optimization parameter had been used. I didn’t want to compile the code and execute SQL queries against the blockchain 4294967296 times. After trying to guess common values few times, I gave up. There has to be another way.
Function signatures
I dug into Ethereum dataset. One of the columns in SQL schema got my attention: function_sighashes
.
After some research, I found that, in order for the smart contract’s functions to be called by users or other smart contracts, there needs to be some kind of mapping between the function name and an address in bytecode where the function logic resides.
But what if someone likes to use veryLongFunctionNames
? Does it mean the name of the function has to be embedded in smart contract bytecode, increasing blockchain size and eating precious gas?
This may be one of the reasons Ethereum uses the concept called function signature. The function signature is created by hashing the prototype string and discarding anything after the first 4 bytes. The hash algorithm used is Keccak256 which you may know from SHA-3. However SHA-3 uses a version with slightly different parameters, therefore the output is different.
For example, to get function signature for a function
myFunction
which receives 1 argument of typeuint256
:
Prepare function prototype, let’s say
myFunction(uint256)
.Find an online Keccak256 generator (for example this one) and paste the function prototype there.
You will get
50628c969c386d878aac8a993492e42110c19ba346d377fec055d2d56124b695
.Remove anything after the first 4 bytes and add
0x
prefix to make it clear we are dealing with a hexadecimal number.The result is
0x50628c96
.
Those four bytes along with the mapping method will show you where the function resides in a compiled contract (it is a simplification to make you grasp the general idea, check limitations below).
Back to BigQuery
I’ve calculated function signatures for every function of John’s smart contract. Now I was armed with the information needed to find the address of this contract:
SELECT address
FROM
bigquery-public-data.crypto_ethereum.contracts
WHERE
'0xf68deb93' IN UNNEST(function_sighashes)
AND
block_timestamp > {month before post publication date}
LIMIT 10
This simplified example contains only one function signature (
0xf68deb93
). To find John’s contract I had to add more of them to theWHERE
condition.
Yay! The query gave exactly one result. Is it the contract address I was looking for?
Verification
I used Etherscan to get more information about this address. Etherscan confirmed this is a contract address and allowed me to decompile it using a build-in online tool. The resulting code looked similar to the original code published by John. I found it!
Limitations
Function signatures are extracted from contract bytecode using heuristics. In case the contract’s bytecode doesn’t follow conventions, it may be hard or even impossible to obtain function signatures. Therefore some function signatures may not be available in the BigQuery dataset.
For more details, check out how the Ethereum dataset was built and how those heuristics work.
This is how I understand why some function signatures are not available in the dataset. If you have a more accurate explanation, please share it below.
Extracting information
Let’s see what information I can get knowing the contract address.
The most important data is the Ethereum account address which interacted with the contract. I was able to found it easily in the list of the contract’s transactions on Etherscan. There was only one, therefore I could safely assume it was the address of John.
The rest is simple: when you have someone’s address you can get his entire transactions history and balance. It’s like a bank statement, but public.
Now I could contact John and tell him the exact amount of Ether he holds. His reaction was worth the time I’ve put into this task :)
If you like this story then I can send you my future articles right after publication. This is a privacy-oriented blog - there will be no spam and you can unsubscribe anytime.
How to avoid being doxxed on Ethereum?
Ethereum transactions are pseudonymous, which means the user performing them is private as long as his address (pseudonym) can’t be linked to his real identity. John runs a blog, therefore his identity is public. Smart contract source code is also public and linked to the blog. Therefore anyone who is able to perform the steps I described above, could deanonymize him.
If you want to deploy a contract which source code is public and associated with you, here are some ideas to make connecting your real identity with your main Ethereum address harder:
obfuscate source code before deployment (example obfuscator) - this makes finding the smart contract more difficult,
separate contract deployment from contract usage: deploy contract from address specially generated for this purpose and don’t use it for anything else - others could see who created the contract, but won’t be able to easily connect this information with your transactions (which in most cases reveal your financial situation), especially if you follow next advice,
if the contract is meant to be used by other people - don’t be first to use it, wait for others to transact - hide in a crowd; break up a transaction into many smaller ones and execute them at irregular intervals from many different, unrelated accounts, preferable with different transaction history; this method is similar to techniques employed by anonymity mixers such as Tornado.cash,
make sure all addresses you use (for deployment and transactions) are anonymously funded and not linked to your main account.
I plan to write dedicated article about maintaining anonymity on Ethereum blockchain. I will link it here. Subscribe to get it right after publication.
Final word
Thank you for reading this little deanonymization story, I hope you enjoyed it :)
In case you would like to add something, have a question, or found an error, please comment down below.
Privacy is important. If you think this article will be useful to others, please spread the word.