As a backend developer in the blockchain space, building a crawler is one of the most exciting tasks. In this article, we will explore how to build an EVM crawler to synchronize data from the blockchain (onchain) with a database. This tool is essential for decentralized applications (dApps), especially when handling transaction data or events from smart contracts.
1. General Principles of a Crawler
An EVM crawler acts as a "continuous listener" on the blockchain, collecting data from blocks and smart contract events, then storing or processing that data in a local system (typically a database). The general principles include:
Connecting to the blockchain: Using a node provider (such as Infura, Alchemy) to access blockchain data via JSON-RPC.
Determining the starting point: Deciding the starting block (typically the contract deployment block) and the ending block (usually the latest block).
Batch data collection: To avoid overload, data is collected in groups of blocks (batch size).
Processing events: Crawling event logs from smart contracts (e.g., Deposit, Withdrawal) and extracting necessary information.
Syncing with the database: Storing data in databases like MongoDB or PostgreSQL while updating processing status (e.g., the last crawled block).
Handling reorgs: Since blockchain can experience forks or reorgs, the crawler must detect and handle them to ensure data accuracy.
Confirming transactions: Ensuring transactions reach the required number of confirmations before marking them as complete.
Continuous execution: The crawler runs continuously or periodically to update the latest data while avoiding parallel execution of multiple processes.
2. Common Mistakes When Building a Crawler
When developing an EVM crawler, here are some common mistakes to avoid:
Not handling reorgs:
Blockchain history can change due to forks. If reorgs are not checked, the database may become out of sync with the actual blockchain state.
What is a chain reorg?: A reorganization (reorg) occurs when a temporary block chain (usually due to a fork) is replaced by a longer chain or a chain accepted as the main one. For example, a miner might create block A at height 100, but another miner creates block B at the same height, and B's chain extends faster. In this case, block A and its transactions are discarded and replaced with block B. If the crawler had stored data from block A without detecting the reorg, the database would not reflect the correct onchain state.
Solution: Store hashes of recent blocks and compare them with the blockchain to detect changes. If a block hash changes, roll back data from that block and re-crawl.
Fetching too much data at once:
Retrieving all data from block 0 to the latest block in one go can overload the node or exceed API limits.
Solution: Use a reasonable batch size (e.g., 1000 blocks per fetch) and crawl progressively. Only crawl from the contract deployment block.
Not checking confirmations:
A transaction may revert if it doesn't reach enough confirmations. Updating the database immediately can lead to incorrect data, especially for balance-related updates.
Solution: Wait for sufficient confirmations (usually between 6 and 12) before confirming a transaction and updating balances.
Not handling errors and restarts:
Network failures, node rate limits, or database errors can abruptly stop the crawler.
Solution: Implement retry mechanisms with delays (e.g., 5 seconds) when encountering errors, but avoid recursion or intervals that create excessive processes.
Not optimizing database performance:
Executing too many individual queries instead of bulk operations slows down synchronization.
Solution: Use bulk write/update to process multiple transactions at once.
Lack of flexible configuration:
Hardcoding parameters like batch size, RPC URL, or confirmations makes it difficult to extend the crawler to other chains.
Solution: Create separate configuration files for each chain.
Using recursion or setInterval for polling:
Recursion can overload the system and degrade performance.
Using setInterval for polling can create multiple parallel processes, leading to system overload and inefficiency.
Solution: Use:
setTimeout(()=>this.crawl(),this.RESTART_DELAY);
3. Understanding the Crawler
I will break down the key parts of the Crawler class that i built and explain how it works.
Use recursion with setTimeout for continuous crawling.
Ensure proper control to prevent overlapping processes.
Start the Crawler
publicasyncstart():Promise<void>{try{await mongoose.connect(String(process.env.DATABASE_URI));console.log(`✅ Connected to MongoDB for ${this.chain.name}`);awaitthis.crawl();}catch(error){console.error(`❌ Failed to start crawler for ${this.chain.name}:`, error
);}}
import{ Crawler }from"./crawler.js";const crawler =newCrawler(1);// Chain ID 1 cho Ethereum Mainnetcrawler.start();
Conclusion
This article has provided a detailed explanation of how the crawler works, along with general principles and common pitfalls to avoid. You can extend this crawler by adding multi-chain support and optimizing batch size.
To view the full source code and the latest updates, visit the GitHub repository here: https://github.com/chinhvuong/evm-crawler. If you have any improvement ideas or need further assistance, feel free to contribute or reach out!