Building an EVM Crawler to Sync Onchain Data with a Database

Tiếng Việt

As a backend developer in the blockchain space, building a crawler is one of the most exciting tasks. In this article, we will explore how to build an EVM crawler to synchronize data from the blockchain (onchain) with a database. This tool is essential for decentralized applications (dApps), especially when handling transaction data or events from smart contracts.

1. General Principles of a Crawler

An EVM crawler acts as a "continuous listener" on the blockchain, collecting data from blocks and smart contract events, then storing or processing that data in a local system (typically a database). The general principles include:

Connecting to the blockchain: Using a node provider (such as Infura, Alchemy) to access blockchain data via JSON-RPC.
Determining the starting point: Deciding the starting block (typically the contract deployment block) and the ending block (usually the latest block).
Batch data collection: To avoid overload, data is collected in groups of blocks (batch size).
Processing events: Crawling event logs from smart contracts (e.g., Deposit, Withdrawal) and extracting necessary information.
Syncing with the database: Storing data in databases like MongoDB or PostgreSQL while updating processing status (e.g., the last crawled block).
Handling reorgs: Since blockchain can experience forks or reorgs, the crawler must detect and handle them to ensure data accuracy.
Confirming transactions: Ensuring transactions reach the required number of confirmations before marking them as complete.
Continuous execution: The crawler runs continuously or periodically to update the latest data while avoiding parallel execution of multiple processes.

2. Common Mistakes When Building a Crawler

When developing an EVM crawler, here are some common mistakes to avoid:

Not handling reorgs:
- Blockchain history can change due to forks. If reorgs are not checked, the database may become out of sync with the actual blockchain state.
- What is a chain reorg?: A reorganization (reorg) occurs when a temporary block chain (usually due to a fork) is replaced by a longer chain or a chain accepted as the main one. For example, a miner might create block A at height 100, but another miner creates block B at the same height, and B's chain extends faster. In this case, block A and its transactions are discarded and replaced with block B. If the crawler had stored data from block A without detecting the reorg, the database would not reflect the correct onchain state.
- Solution: Store hashes of recent blocks and compare them with the blockchain to detect changes. If a block hash changes, roll back data from that block and re-crawl.
Fetching too much data at once:
- Retrieving all data from block 0 to the latest block in one go can overload the node or exceed API limits.
- Solution: Use a reasonable batch size (e.g., 1000 blocks per fetch) and crawl progressively. Only crawl from the contract deployment block.
Not checking confirmations:
- A transaction may revert if it doesn't reach enough confirmations. Updating the database immediately can lead to incorrect data, especially for balance-related updates.
- Solution: Wait for sufficient confirmations (usually between 6 and 12) before confirming a transaction and updating balances.
Not handling errors and restarts:
- Network failures, node rate limits, or database errors can abruptly stop the crawler.
- Solution: Implement retry mechanisms with delays (e.g., 5 seconds) when encountering errors, but avoid recursion or intervals that create excessive processes.
Not optimizing database performance:
- Executing too many individual queries instead of bulk operations slows down synchronization.
- Solution: Use bulk write/update to process multiple transactions at once.
Lack of flexible configuration:
- Hardcoding parameters like batch size, RPC URL, or confirmations makes it difficult to extend the crawler to other chains.
- Solution: Create separate configuration files for each chain.
Using recursion or setInterval for polling:
- Recursion can overload the system and degrade performance.
- Using setInterval for polling can create multiple parallel processes, leading to system overload and inefficiency.
- Solution: Use:
```
setTimeout(() => this.crawl(), this.RESTART_DELAY);
```

3. Understanding the `Crawler`

I will break down the key parts of the Crawler class that i built and explain how it works.

3.1 Detailed Analysis

Constructor

export class Crawler {
  private chain: Chain;
  private provider: ethers.JsonRpcProvider;
  private contract: ethers.Contract;
  private lastProcessedBlock: number;
  private blockHashes: Map<number, string>;
  private readonly BATCH_SIZE: number;
  private readonly REORG_DEPTH: number;
  private readonly MIN_CONFIRMATIONS: number;
  private readonly RESTART_DELAY: number;

  constructor(chainId: ChainId) {
    if (!chains[chainId]) {
      throw new Error(`Chain ${chainId} not configured`);
    }

    this.chain = chains[chainId];
    this.provider = new ethers.JsonRpcProvider(this.chain.rpcUrl);
    this.contract = new ethers.Contract(
      this.chain.contracts.dex.address,
      this.chain.contracts.dex.abi,
      this.provider
    );

    this.lastProcessedBlock = 0;
    this.blockHashes = new Map();

    // Chain-specific configs
    this.BATCH_SIZE = this.chain.config.batchSize;
    this.REORG_DEPTH = this.chain.config.reorgDepth;
    this.MIN_CONFIRMATIONS = this.chain.config.minConfirmations;
    this.RESTART_DELAY = this.chain.config.restartDelay;
  }

Functionality: Initialize the crawler with specific chain information (RPC URL, DEX contract).
Key Parameters:
- BATCH_SIZE: Number of blocks processed per batch.
- REORG_DEPTH: Depth for reorg detection.
- MIN_CONFIRMATIONS: Minimum confirmations required for transactions.
- RESTART_DELAY: Delay before restarting the crawl.
Note: blockHashes (Map) stores recent block hashes to detect reorgs.

Load and Update Last Processed Block

private async loadLastProcessedBlock(): Promise<number> {
    const blockInfo = await Block.findOne({ chain: this.chain.name });
    if (blockInfo) {
      return blockInfo.lastProcessedBlock;
    }

    // If no block info exists, create initial entry
    const startBlock = this.chain.contracts.dex.startBlock;
    await Block.create({
      chain: this.chain.name,
      lastProcessedBlock: startBlock,
      updatedAt: new Date(),
    });
    return startBlock;
  }

  private async updateLastProcessedBlock(blockNumber: number): Promise<void> {
    await Block.findOneAndUpdate(
      { chain: this.chain.name },
      {
        lastProcessedBlock: blockNumber,
        updatedAt: new Date(),
      },
      { upsert: true }
    );
    this.lastProcessedBlock = blockNumber;
  }

Functions

loadLastProcessedBlock: Retrieve the last processed block from the database or initialize with the contract's starting block if none exists.
updateLastProcessedBlock: Update the last processed block in the database.

Benefits

Ensures the crawler continues from the last stopping point, avoiding unnecessary duplicate work.

Checking for Reorg

  private async getBlockHash(blockNumber: number): Promise<string | null> {
    try {
      const block = await this.provider.getBlock(blockNumber);
      return block?.hash ?? null;
    } catch (err) {
      console.error(`❌ Block ${blockNumber} fetch failed:`, err);
      return null;
    }
  }

  private async checkForReorg(latestBlock: number): Promise<number | null> {
    if (this.lastProcessedBlock < latestBlock - this.REORG_DEPTH) {
      return null;
    }

    const checks: Promise<number | null>[] = [];
    for (let i = 1; i <= this.REORG_DEPTH; i++) {
      checks.push(
        this.getBlockHash(this.lastProcessedBlock - i).then((currentHash) => {
          const storedHash = this.blockHashes.get(this.lastProcessedBlock - i);
          if (storedHash && currentHash && storedHash !== currentHash) {
            console.warn(
              `⚠️ Reorg detected at block ${this.lastProcessedBlock - i}`
            );
            return this.lastProcessedBlock - i;
          }
          return null;
        })
      );
    }

    const results = await Promise.all(checks);
    return results.find((block) => block !== null) ?? null;
  }

Functionality

Detect Reorg: Check if a reorg has occurred by comparing the stored block hash with the current hash on the chain.

How It Works

If a reorg is detected, return the block number to roll back to.

Handling Chain Reorg

  private async handleReorg(reorgBlock: number): Promise<void> {
    await Promise.all([
      Transaction.deleteMany({
        chain: this.chain.name,
        blockNumber: { $gte: reorgBlock },
      }),
      Block.findOneAndUpdate(
        { chain: this.chain.name },
        { lastProcessedBlock: reorgBlock }
      ),
    ]);
  }

Functionality

Handle Reorg: Delete transaction data from the block where the reorg occurred onward and update lastProcessedBlock.

Note

Ensure database data matches the blockchain after a reorg.

Event Collection and Storage

  private async fetchAndStoreEvents(
    fromBlock: number,
    toBlock: number,
    latestBlock: number
  ): Promise<void> {
    try {
      console.log(`📡 Fetching events: blocks ${fromBlock} → ${toBlock}`);

      const reorgThreshold = latestBlock - this.REORG_DEPTH;

      const [depositEvents, withdrawEvents] = await Promise.all([
        this.contract.queryFilter("Deposit", fromBlock, toBlock) as Promise<
          ethers.EventLog[]
        >,
        this.contract.queryFilter("Withdrawal", fromBlock, toBlock) as Promise<
          ethers.EventLog[]
        >,
      ]);

      // Get block hashes for all blocks in the batch to avoid reorgs
      const blockHashesArray = await Promise.all(
        Array.from({ length: toBlock - fromBlock + 1 }, (_, i) => {
          const blockNumber = fromBlock + i;
          return blockNumber > reorgThreshold
            ? this.getBlockHash(blockNumber)
            : null;
        })
      );
      const transactions: ITransaction[] = [];

      for (const log of [...depositEvents, ...withdrawEvents]) {
        const { transactionHash, blockNumber } = log;
        const timestamp = new Date();

        if (
          log.fragment.name === "Deposit" ||
          log.fragment.name === "Withdraw"
        ) {
          const user = log.args[0];
          const amount = ethers.formatUnits(log.args[1], 18);
          const isDeposit = log.fragment.name === "Deposit";

          transactions.push({
            chain: this.chain.name,
            transactionHash,
            blockNumber,
            user,
            amount,
            type: isDeposit ? "deposit" : "withdraw",
            minConfirmationsRequired: this.MIN_CONFIRMATIONS,
            currentConfirmations: 0,
            isConfirmed: false,
            timestamp,
          } as ITransaction);
        }
      }

      if (transactions.length > 0) {
        await Transaction.insertMany(transactions, { ordered: false });
      }

      console.log(`✅ Saved ${transactions.length} transactions`);

      // Store only relevant block hashes
      blockHashesArray.forEach((hash, i) => {
        const blockNumber = fromBlock + i;
        if (hash) this.blockHashes.set(blockNumber, hash);
      });

      // Trim stored block hashes
      [...this.blockHashes.keys()].forEach((block) => {
        if (block < reorgThreshold) this.blockHashes.delete(block);
      });
    } catch (error) {
      console.error(`❌ Fetch error [${fromBlock} → ${toBlock}]:`, error);
      throw error;
    }
  }

Functionality: Retrieve Deposit and Withdrawal events from the contract within a specified block range, then store them in the database.
Note: Use insertMany to optimize performance when saving multiple transactions.

Transaction Confirmation

  private async processConfirmations(latestBlock: number): Promise<void> {
    const pendingTransactions = await Transaction.find({
      chain: this.chain.name,
      isConfirmed: false,
    });

    const bulkOps: any[] = [];

    for (const tx of pendingTransactions) {
      const confirmations = latestBlock - tx.blockNumber;

      if (confirmations >= tx.minConfirmationsRequired) {
        bulkOps.push({
          updateOne: {
            filter: { _id: tx._id, transactionHash: tx.transactionHash },
            update: {
              isConfirmed: true,
              currentConfirmations: tx.minConfirmationsRequired,
            },
          },
        });

        if (tx.type === "deposit") {
          bulkOps.push({
            updateOne: {
              filter: { chain: this.chain.name, user: tx.user },
              update: { $inc: { balance: parseFloat(tx.amount) } },
              upsert: true,
            },
          });
        } else if (tx.type === "withdraw") {
          bulkOps.push({
            updateOne: {
              filter: { chain: this.chain.name, user: tx.user },
              update: {
                $pull: { pendingWithdraws: { hash: tx.transactionHash } },
              },
            },
          });
        }
      }
    }

    if (bulkOps.length > 0) {
      await Transaction.bulkWrite(bulkOps);
      console.log(`✅ Processed ${bulkOps.length} transactions`);
    }
  }

Function: Check the number of confirmations for pending transactions, update the status and user balance if requirements are met.
Note: Use bulkWrite to optimize database performance.

Main function `crawl()`

  private async crawl(): Promise<void> {
    try {
      const latestBlock = await this.provider.getBlockNumber();
      let fromBlock = await this.loadLastProcessedBlock();
      let toBlock = Math.min(fromBlock + this.BATCH_SIZE, latestBlock);

      const reorgBlock = await this.checkForReorg(latestBlock);
      if (reorgBlock) {
        await this.handleReorg(reorgBlock);
        fromBlock = reorgBlock;
      }

      while (fromBlock < latestBlock) {
        await this.fetchAndStoreEvents(fromBlock, toBlock, latestBlock);
        await this.updateLastProcessedBlock(toBlock);

        fromBlock = toBlock + 1;
        toBlock = Math.min(fromBlock + this.BATCH_SIZE, latestBlock);
      }

      await this.processConfirmations(latestBlock);
      setTimeout(() => this.crawl(), this.RESTART_DELAY);
    } catch (error) {
      console.error("❌ Critical crawler error:", error);
      setTimeout(() => this.crawl(), 5000);
    }
  }

Functionality

Coordinate the entire crawling process:
- Fetch the latest block.
- Check and handle reorgs if any.
- Collect data in batches.
- Confirm transactions.
- Repeat after RESTART_DELAY using setTimeout.

Note

Use recursion with setTimeout for continuous crawling.
Ensure proper control to prevent overlapping processes.

Start the Crawler

  public async start(): Promise<void> {
    try {
      await mongoose.connect(String(process.env.DATABASE_URI));
      console.log(`✅ Connected to MongoDB for ${this.chain.name}`);
      await this.crawl();
    } catch (error) {
      console.error(
        `❌ Failed to start crawler for ${this.chain.name}:`,
        error
      );
    }
  }

Functionality

Connect to MongoDB and start crawling.

How to Use the Crawler

Configure the Chain

Create the file config/chains.js.

export const chains = {
  1: {
    // Ethereum Mainnet
    name: "ethereum",
    rpcUrl: "https://mainnet.infura.io/v3/YOUR_INFURA_KEY",
    contracts: {
      dex: {
        address: "0xYourContractAddress",
        abi: [
          "event Deposit(address indexed user, uint256 amount)",
          "event Withdrawal(address indexed user, uint256 amount)",
        ],
        startBlock: 12345678,
      },
    },
    config: {
      batchSize: 1000,
      reorgDepth: 10,
      minConfirmations: 6,
      restartDelay: 5000,
    },
  },
};

MongoDB Schema

Create the file models/index.js:

import mongoose, { Document, Schema } from "mongoose";
import { ChainId } from "../config/chains";

export interface ITransaction extends Document {
  chain: ChainId;
  transactionHash: string;
  blockNumber: number;
  user: string;
  amount: string;
  type: "deposit" | "withdraw";
  minConfirmationsRequired: number;
  currentConfirmations: number;
  isConfirmed: boolean;
  timestamp: Date;
}

export interface IPendingWithdraw {
  hash: string;
  amount: number;
}

export interface IUserBalance extends Document {
  chain: ChainId;
  user: string;
  balance: number;
  pendingWithdraws: IPendingWithdraw[];
}

export interface IBlock extends Document {
  chain: ChainId;
  lastProcessedBlock: number;
  updatedAt: Date;
}

const transactionSchema = new Schema<ITransaction>({
  chain: { type: String, required: true, index: true },
  transactionHash: { type: String, required: true },
  blockNumber: { type: Number, required: true, index: true },
  user: { type: String, required: true },
  amount: { type: String, required: true },
  type: { type: String, enum: ["deposit", "withdraw"], required: true },
  minConfirmationsRequired: { type: Number, required: true },
  currentConfirmations: { type: Number, required: true },
  isConfirmed: { type: Boolean, required: true },
  timestamp: { type: Date, required: true },
});

const userBalanceSchema = new Schema<IUserBalance>({
  chain: { type: String, required: true, index: true },
  user: { type: String, required: true, index: true },
  balance: { type: Number, required: true, default: 0 },
  pendingWithdraws: [
    {
      hash: { type: String, required: true },
      amount: { type: Number, required: true },
    },
  ],
});

const blockSchema = new Schema<IBlock>({
  chain: { type: String, required: true, unique: true },
  lastProcessedBlock: { type: Number, required: true },
  updatedAt: { type: Date, required: true, default: Date.now },
});

export const Transaction = mongoose.model<ITransaction>(
  "Transaction",
  transactionSchema
);
export const UserBalance = mongoose.model<IUserBalance>(
  "UserBalance",
  userBalanceSchema
);
export const Block = mongoose.model<IBlock>("Block", blockSchema);

Run crawler

import { Crawler } from "./crawler.js";

const crawler = new Crawler(1); // Chain ID 1 cho Ethereum Mainnet
crawler.start();

Conclusion

This article has provided a detailed explanation of how the crawler works, along with general principles and common pitfalls to avoid. You can extend this crawler by adding multi-chain support and optimizing batch size.

To view the full source code and the latest updates, visit the GitHub repository here: https://github.com/chinhvuong/evm-crawler. If you have any improvement ideas or need further assistance, feel free to contribute or reach out!

1. General Principles of a Crawler

2. Common Mistakes When Building a Crawler

3. Understanding the Crawler

3.1 Detailed Analysis

Constructor

Load and Update Last Processed Block

Functions

Benefits

Checking for Reorg

Functionality

How It Works

Handling Chain Reorg

Functionality

Note

Event Collection and Storage

Transaction Confirmation

Main function crawl()

Functionality

Note

Start the Crawler

Functionality

How to Use the Crawler

Configure the Chain

MongoDB Schema

Run crawler

Conclusion

3. Understanding the `Crawler`

Main function `crawl()`