Tiếng Việt

As a backend developer in the blockchain space, building a crawler is one of the most exciting tasks. In this article, we will explore how to build an EVM crawler to synchronize data from the blockchain (onchain) with a database. This tool is essential for decentralized applications (dApps), especially when handling transaction data or events from smart contracts.

1. General Principles of a Crawler

An EVM crawler acts as a "continuous listener" on the blockchain, collecting data from blocks and smart contract events, then storing or processing that data in a local system (typically a database). The general principles include:

  1. Connecting to the blockchain: Using a node provider (such as Infura, Alchemy) to access blockchain data via JSON-RPC.
  2. Determining the starting point: Deciding the starting block (typically the contract deployment block) and the ending block (usually the latest block).
  3. Batch data collection: To avoid overload, data is collected in groups of blocks (batch size).
  4. Processing events: Crawling event logs from smart contracts (e.g., Deposit, Withdrawal) and extracting necessary information.
  5. Syncing with the database: Storing data in databases like MongoDB or PostgreSQL while updating processing status (e.g., the last crawled block).
  6. Handling reorgs: Since blockchain can experience forks or reorgs, the crawler must detect and handle them to ensure data accuracy.
  7. Confirming transactions: Ensuring transactions reach the required number of confirmations before marking them as complete.
  8. Continuous execution: The crawler runs continuously or periodically to update the latest data while avoiding parallel execution of multiple processes.

2. Common Mistakes When Building a Crawler

When developing an EVM crawler, here are some common mistakes to avoid:

  1. Not handling reorgs:

    • Blockchain history can change due to forks. If reorgs are not checked, the database may become out of sync with the actual blockchain state.
    • What is a chain reorg?: A reorganization (reorg) occurs when a temporary block chain (usually due to a fork) is replaced by a longer chain or a chain accepted as the main one. For example, a miner might create block A at height 100, but another miner creates block B at the same height, and B's chain extends faster. In this case, block A and its transactions are discarded and replaced with block B. If the crawler had stored data from block A without detecting the reorg, the database would not reflect the correct onchain state.
    • Solution: Store hashes of recent blocks and compare them with the blockchain to detect changes. If a block hash changes, roll back data from that block and re-crawl.
  2. Fetching too much data at once:

    • Retrieving all data from block 0 to the latest block in one go can overload the node or exceed API limits.
    • Solution: Use a reasonable batch size (e.g., 1000 blocks per fetch) and crawl progressively. Only crawl from the contract deployment block.
  3. Not checking confirmations:

    • A transaction may revert if it doesn't reach enough confirmations. Updating the database immediately can lead to incorrect data, especially for balance-related updates.
    • Solution: Wait for sufficient confirmations (usually between 6 and 12) before confirming a transaction and updating balances.
  4. Not handling errors and restarts:

    • Network failures, node rate limits, or database errors can abruptly stop the crawler.
    • Solution: Implement retry mechanisms with delays (e.g., 5 seconds) when encountering errors, but avoid recursion or intervals that create excessive processes.
  5. Not optimizing database performance:

    • Executing too many individual queries instead of bulk operations slows down synchronization.
    • Solution: Use bulk write/update to process multiple transactions at once.
  6. Lack of flexible configuration:

    • Hardcoding parameters like batch size, RPC URL, or confirmations makes it difficult to extend the crawler to other chains.
    • Solution: Create separate configuration files for each chain.
  7. Using recursion or setInterval for polling:

    • Recursion can overload the system and degrade performance.
    • Using setInterval for polling can create multiple parallel processes, leading to system overload and inefficiency.
    • Solution: Use:
      setTimeout(() => this.crawl(), this.RESTART_DELAY);

3. Understanding the Crawler

I will break down the key parts of the Crawler class that i built and explain how it works.

3.1 Detailed Analysis

Constructor

export class Crawler { private chain: Chain; private provider: ethers.JsonRpcProvider; private contract: ethers.Contract; private lastProcessedBlock: number; private blockHashes: Map<number, string>; private readonly BATCH_SIZE: number; private readonly REORG_DEPTH: number; private readonly MIN_CONFIRMATIONS: number; private readonly RESTART_DELAY: number; constructor(chainId: ChainId) { if (!chains[chainId]) { throw new Error(`Chain ${chainId} not configured`); } this.chain = chains[chainId]; this.provider = new ethers.JsonRpcProvider(this.chain.rpcUrl); this.contract = new ethers.Contract( this.chain.contracts.dex.address, this.chain.contracts.dex.abi, this.provider ); this.lastProcessedBlock = 0; this.blockHashes = new Map(); // Chain-specific configs this.BATCH_SIZE = this.chain.config.batchSize; this.REORG_DEPTH = this.chain.config.reorgDepth; this.MIN_CONFIRMATIONS = this.chain.config.minConfirmations; this.RESTART_DELAY = this.chain.config.restartDelay; }
  • Functionality: Initialize the crawler with specific chain information (RPC URL, DEX contract).

  • Key Parameters:

    • BATCH_SIZE: Number of blocks processed per batch.
    • REORG_DEPTH: Depth for reorg detection.
    • MIN_CONFIRMATIONS: Minimum confirmations required for transactions.
    • RESTART_DELAY: Delay before restarting the crawl.
  • Note: blockHashes (Map) stores recent block hashes to detect reorgs.

Load and Update Last Processed Block

private async loadLastProcessedBlock(): Promise<number> { const blockInfo = await Block.findOne({ chain: this.chain.name }); if (blockInfo) { return blockInfo.lastProcessedBlock; } // If no block info exists, create initial entry const startBlock = this.chain.contracts.dex.startBlock; await Block.create({ chain: this.chain.name, lastProcessedBlock: startBlock, updatedAt: new Date(), }); return startBlock; } private async updateLastProcessedBlock(blockNumber: number): Promise<void> { await Block.findOneAndUpdate( { chain: this.chain.name }, { lastProcessedBlock: blockNumber, updatedAt: new Date(), }, { upsert: true } ); this.lastProcessedBlock = blockNumber; }

Functions

  • loadLastProcessedBlock: Retrieve the last processed block from the database or initialize with the contract's starting block if none exists.
  • updateLastProcessedBlock: Update the last processed block in the database.

Benefits

Ensures the crawler continues from the last stopping point, avoiding unnecessary duplicate work.

Checking for Reorg

private async getBlockHash(blockNumber: number): Promise<string | null> { try { const block = await this.provider.getBlock(blockNumber); return block?.hash ?? null; } catch (err) { console.error(`❌ Block ${blockNumber} fetch failed:`, err); return null; } } private async checkForReorg(latestBlock: number): Promise<number | null> { if (this.lastProcessedBlock < latestBlock - this.REORG_DEPTH) { return null; } const checks: Promise<number | null>[] = []; for (let i = 1; i <= this.REORG_DEPTH; i++) { checks.push( this.getBlockHash(this.lastProcessedBlock - i).then((currentHash) => { const storedHash = this.blockHashes.get(this.lastProcessedBlock - i); if (storedHash && currentHash && storedHash !== currentHash) { console.warn( `⚠️ Reorg detected at block ${this.lastProcessedBlock - i}` ); return this.lastProcessedBlock - i; } return null; }) ); } const results = await Promise.all(checks); return results.find((block) => block !== null) ?? null; }

Functionality

  • Detect Reorg: Check if a reorg has occurred by comparing the stored block hash with the current hash on the chain.

How It Works

  • If a reorg is detected, return the block number to roll back to.

Handling Chain Reorg

private async handleReorg(reorgBlock: number): Promise<void> { await Promise.all([ Transaction.deleteMany({ chain: this.chain.name, blockNumber: { $gte: reorgBlock }, }), Block.findOneAndUpdate( { chain: this.chain.name }, { lastProcessedBlock: reorgBlock } ), ]); }

Functionality

  • Handle Reorg: Delete transaction data from the block where the reorg occurred onward and update lastProcessedBlock.

Note

  • Ensure database data matches the blockchain after a reorg.

Event Collection and Storage

private async fetchAndStoreEvents( fromBlock: number, toBlock: number, latestBlock: number ): Promise<void> { try { console.log(`📡 Fetching events: blocks ${fromBlock}${toBlock}`); const reorgThreshold = latestBlock - this.REORG_DEPTH; const [depositEvents, withdrawEvents] = await Promise.all([ this.contract.queryFilter("Deposit", fromBlock, toBlock) as Promise< ethers.EventLog[] >, this.contract.queryFilter("Withdrawal", fromBlock, toBlock) as Promise< ethers.EventLog[] >, ]); // Get block hashes for all blocks in the batch to avoid reorgs const blockHashesArray = await Promise.all( Array.from({ length: toBlock - fromBlock + 1 }, (_, i) => { const blockNumber = fromBlock + i; return blockNumber > reorgThreshold ? this.getBlockHash(blockNumber) : null; }) ); const transactions: ITransaction[] = []; for (const log of [...depositEvents, ...withdrawEvents]) { const { transactionHash, blockNumber } = log; const timestamp = new Date(); if ( log.fragment.name === "Deposit" || log.fragment.name === "Withdraw" ) { const user = log.args[0]; const amount = ethers.formatUnits(log.args[1], 18); const isDeposit = log.fragment.name === "Deposit"; transactions.push({ chain: this.chain.name, transactionHash, blockNumber, user, amount, type: isDeposit ? "deposit" : "withdraw", minConfirmationsRequired: this.MIN_CONFIRMATIONS, currentConfirmations: 0, isConfirmed: false, timestamp, } as ITransaction); } } if (transactions.length > 0) { await Transaction.insertMany(transactions, { ordered: false }); } console.log(`✅ Saved ${transactions.length} transactions`); // Store only relevant block hashes blockHashesArray.forEach((hash, i) => { const blockNumber = fromBlock + i; if (hash) this.blockHashes.set(blockNumber, hash); }); // Trim stored block hashes [...this.blockHashes.keys()].forEach((block) => { if (block < reorgThreshold) this.blockHashes.delete(block); }); } catch (error) { console.error(`❌ Fetch error [${fromBlock}${toBlock}]:`, error); throw error; } }
  • Functionality: Retrieve Deposit and Withdrawal events from the contract within a specified block range, then store them in the database.
  • Note: Use insertMany to optimize performance when saving multiple transactions.

Transaction Confirmation

private async processConfirmations(latestBlock: number): Promise<void> { const pendingTransactions = await Transaction.find({ chain: this.chain.name, isConfirmed: false, }); const bulkOps: any[] = []; for (const tx of pendingTransactions) { const confirmations = latestBlock - tx.blockNumber; if (confirmations >= tx.minConfirmationsRequired) { bulkOps.push({ updateOne: { filter: { _id: tx._id, transactionHash: tx.transactionHash }, update: { isConfirmed: true, currentConfirmations: tx.minConfirmationsRequired, }, }, }); if (tx.type === "deposit") { bulkOps.push({ updateOne: { filter: { chain: this.chain.name, user: tx.user }, update: { $inc: { balance: parseFloat(tx.amount) } }, upsert: true, }, }); } else if (tx.type === "withdraw") { bulkOps.push({ updateOne: { filter: { chain: this.chain.name, user: tx.user }, update: { $pull: { pendingWithdraws: { hash: tx.transactionHash } }, }, }, }); } } } if (bulkOps.length > 0) { await Transaction.bulkWrite(bulkOps); console.log(`✅ Processed ${bulkOps.length} transactions`); } }
  • Function: Check the number of confirmations for pending transactions, update the status and user balance if requirements are met.
  • Note: Use bulkWrite to optimize database performance.

Main function crawl()

private async crawl(): Promise<void> { try { const latestBlock = await this.provider.getBlockNumber(); let fromBlock = await this.loadLastProcessedBlock(); let toBlock = Math.min(fromBlock + this.BATCH_SIZE, latestBlock); const reorgBlock = await this.checkForReorg(latestBlock); if (reorgBlock) { await this.handleReorg(reorgBlock); fromBlock = reorgBlock; } while (fromBlock < latestBlock) { await this.fetchAndStoreEvents(fromBlock, toBlock, latestBlock); await this.updateLastProcessedBlock(toBlock); fromBlock = toBlock + 1; toBlock = Math.min(fromBlock + this.BATCH_SIZE, latestBlock); } await this.processConfirmations(latestBlock); setTimeout(() => this.crawl(), this.RESTART_DELAY); } catch (error) { console.error("❌ Critical crawler error:", error); setTimeout(() => this.crawl(), 5000); } }

Functionality

  • Coordinate the entire crawling process:
    • Fetch the latest block.
    • Check and handle reorgs if any.
    • Collect data in batches.
    • Confirm transactions.
    • Repeat after RESTART_DELAY using setTimeout.

Note

  • Use recursion with setTimeout for continuous crawling.
  • Ensure proper control to prevent overlapping processes.

Start the Crawler

public async start(): Promise<void> { try { await mongoose.connect(String(process.env.DATABASE_URI)); console.log(`✅ Connected to MongoDB for ${this.chain.name}`); await this.crawl(); } catch (error) { console.error( `❌ Failed to start crawler for ${this.chain.name}:`, error ); } }

Functionality

  • Connect to MongoDB and start crawling.

How to Use the Crawler

Configure the Chain

Create the file config/chains.js.

export const chains = { 1: { // Ethereum Mainnet name: "ethereum", rpcUrl: "https://mainnet.infura.io/v3/YOUR_INFURA_KEY", contracts: { dex: { address: "0xYourContractAddress", abi: [ "event Deposit(address indexed user, uint256 amount)", "event Withdrawal(address indexed user, uint256 amount)", ], startBlock: 12345678, }, }, config: { batchSize: 1000, reorgDepth: 10, minConfirmations: 6, restartDelay: 5000, }, }, };

MongoDB Schema

Create the file models/index.js:

import mongoose, { Document, Schema } from "mongoose"; import { ChainId } from "../config/chains"; export interface ITransaction extends Document { chain: ChainId; transactionHash: string; blockNumber: number; user: string; amount: string; type: "deposit" | "withdraw"; minConfirmationsRequired: number; currentConfirmations: number; isConfirmed: boolean; timestamp: Date; } export interface IPendingWithdraw { hash: string; amount: number; } export interface IUserBalance extends Document { chain: ChainId; user: string; balance: number; pendingWithdraws: IPendingWithdraw[]; } export interface IBlock extends Document { chain: ChainId; lastProcessedBlock: number; updatedAt: Date; } const transactionSchema = new Schema<ITransaction>({ chain: { type: String, required: true, index: true }, transactionHash: { type: String, required: true }, blockNumber: { type: Number, required: true, index: true }, user: { type: String, required: true }, amount: { type: String, required: true }, type: { type: String, enum: ["deposit", "withdraw"], required: true }, minConfirmationsRequired: { type: Number, required: true }, currentConfirmations: { type: Number, required: true }, isConfirmed: { type: Boolean, required: true }, timestamp: { type: Date, required: true }, }); const userBalanceSchema = new Schema<IUserBalance>({ chain: { type: String, required: true, index: true }, user: { type: String, required: true, index: true }, balance: { type: Number, required: true, default: 0 }, pendingWithdraws: [ { hash: { type: String, required: true }, amount: { type: Number, required: true }, }, ], }); const blockSchema = new Schema<IBlock>({ chain: { type: String, required: true, unique: true }, lastProcessedBlock: { type: Number, required: true }, updatedAt: { type: Date, required: true, default: Date.now }, }); export const Transaction = mongoose.model<ITransaction>( "Transaction", transactionSchema ); export const UserBalance = mongoose.model<IUserBalance>( "UserBalance", userBalanceSchema ); export const Block = mongoose.model<IBlock>("Block", blockSchema);

Run crawler

import { Crawler } from "./crawler.js"; const crawler = new Crawler(1); // Chain ID 1 cho Ethereum Mainnet crawler.start();

Conclusion

This article has provided a detailed explanation of how the crawler works, along with general principles and common pitfalls to avoid. You can extend this crawler by adding multi-chain support and optimizing batch size.

To view the full source code and the latest updates, visit the GitHub repository here: https://github.com/chinhvuong/evm-crawler. If you have any improvement ideas or need further assistance, feel free to contribute or reach out!