Là một backend trong mảng blockchain build một crawler có lẽ là một trong những công việc thú vị nhất. Trong bài viết này, chúng ta sẽ tìm hiểu cách xây dựng một EVM crawler để đồng bộ hóa dữ liệu từ blockchain (onchain) với một database. Đây là công cụ quan trọng trong các ứng dụng phi tập trung (dApps), đặc biệt khi cần xử lý dữ liệu giao dịch hoặc sự kiện từ smart contract.

1. Nguyên Lý Chung Của Một Crawler

Một EVM crawler hoạt động như một "người liên tục lắng nghe" trên blockchain, thu thập dữ liệu từ các block và event của smart contract, sau đó lưu trữ hoặc xử lý dữ liệu đó trong một hệ thống cục bộ (thường là database). Nguyên lý chung bao gồm các bước sau:

Kết nối với blockchain: Sử dụng một node provider (như Infura, Alchemy) để truy cập dữ liệu blockchain qua JSON-RPC.
Xác định điểm bắt đầu: Quyết định block bắt đầu (thường là block triển khai hợp đồng) và block kết thúc (thường là block mới nhất).
Thu thập dữ liệu theo batch: Để tránh quá tải, dữ liệu được thu thập theo từng nhóm block (batch size).
Xử lý event: Crawl các event log từ smart contract (ví dụ: Deposit, Withdrawal) và trích xuất thông tin cần thiết.
Đồng bộ với database: Lưu dữ liệu vào database như MongoDB, PostgreSQL, đồng thời cập nhật trạng thái xử lý (ví dụ: block cuối cùng đã crawl).
Kiểm tra reorg: Blockchain có thể xảy ra fork hoặc reorg, crawler cần phát hiện và xử lý để đảm bảo dữ liệu chính xác.
Xác nhận giao dịch: Đảm bảo giao dịch đạt đủ số confirmations trước khi đánh dấu là hoàn tất.
Lặp lại liên tục: Crawler chạy liên tục hoặc định kỳ để cập nhật dữ liệu mới nhất, nhưng tránh chạy nhiều tiến trình song song.

2. Các Sai Lầm Hay Mắc Phải Khi Xây Dựng Crawler

Khi phát triển một EVM crawler, có một số sai lầm phổ biến mà chúng ta nên tránh:

Không xử lý reorg:
- Blockchain có thể thay đổi lịch sử block gần đây khi xảy ra fork. Nếu không kiểm tra reorg, dữ liệu trong database có thể không đồng bộ với blockchain thực tế.
- Chain reorg là gì?: Reorg (reorganization) xảy ra khi một chuỗi block tạm thời (thường do fork) bị thay thế bởi một chuỗi block khác dài hơn hoặc được mạng chấp nhận là chuỗi chính. Ví dụ, một miner có thể tạo ra block A ở độ cao 100, nhưng một miner khác tạo block B ở cùng độ cao và chuỗi của B được mở rộng nhanh hơn. Khi đó, block A và các giao dịch trong đó bị hủy bỏ, thay bằng block B. Nếu crawler đã lưu dữ liệu từ block A mà không phát hiện reorg, dữ liệu trong database sẽ không phản ánh đúng trạng thái onchain.
- Cách khắc phục: Lưu trữ hash của các block gần đây và so sánh với blockchain để phát hiện thay đổi. Nếu hash của block đã thay đổi, cần rollback dữ liệu từ block đó và crawl lại.
Thu thập quá nhiều dữ liệu cùng lúc:
- Lấy toàn bộ dữ liệu từ block 0 đến block mới nhất trong một lần có thể gây quá tải node hoặc vượt giới hạn API.
- Cách khắc phục: Sử dụng batch size hợp lý (ví dụ: 1000 block/lần) và crawl dần dần, Chỉ crawl từ block contract deploy.
Không kiểm tra confirmations:
- Một giao dịch có thể bị revert nếu không đạt đủ confirmations. Nếu cập nhật database ngay lập tức, dữ liệu có thể sai lệch. đặc biệt cập nhật liên quan tới balance
- Cách khắc phục: Đợi đủ số confirmations (thường từ 6 đến 12) trước khi xác nhận giao dịch, cập nhật balance.
Không xử lý lỗi và khởi động lại:
- Lỗi mạng, rate limit của node, hoặc lỗi database có thể làm crawler dừng đột ngột.
- Cách khắc phục: Thêm cơ chế retry với delay (ví dụ: 5 giây) khi gặp lỗi, nhưng tránh dùng đệ quy hoặc interval để không tạo nhiều tiến trình.
Không tối ưu hiệu suất database:
- Thực hiện quá nhiều truy vấn đơn lẻ thay vì bulk operation sẽ làm chậm quá trình đồng bộ.
- Cách khắc phục: Sử dụng bulk write/update để xử lý nhiều giao dịch cùng lúc.
Thiếu cấu hình linh hoạt:
- Hardcode các thông số như batch size, RPC URL hoặc confirmations khiến crawler khó mở rộng sang chain khác.
- Cách khắc phục: Tạo file cấu hình riêng biệt cho từng chain.
Sử dụng đệ quy hoặc setInterval để gọi lại:
- Sử dụng đệ quy có thể gây quá tải cho hệ thống và làm giảm hiệu suất.
- Sử dụng setInterval để thực hiện các cuộc gọi lại có thể dẫn đến việc tạo ra nhiều tiến trình song song, gây quá tải cho hệ thống và làm giảm hiệu suất.
- Cách khắc phục: Gọi lại bằng:
```
setTimeOut(() => this.crawl(), this.RESTART_DELAY);
```

3. Giải Thích `Cralwer`

Tôi sẽ bóc tách từng phần chính của class Crawler mà tôi đã xây dựng và giải thích cách nó hoạt động.

3.1 Phân tích chi tiết

Contructor

export class Crawler {
  private chain: Chain;
  private provider: ethers.JsonRpcProvider;
  private contract: ethers.Contract;
  private lastProcessedBlock: number;
  private blockHashes: Map<number, string>;
  private readonly BATCH_SIZE: number;
  private readonly REORG_DEPTH: number;
  private readonly MIN_CONFIRMATIONS: number;
  private readonly RESTART_DELAY: number;

  constructor(chainId: ChainId) {
    if (!chains[chainId]) {
      throw new Error(`Chain ${chainId} not configured`);
    }

    this.chain = chains[chainId];
    this.provider = new ethers.JsonRpcProvider(this.chain.rpcUrl);
    this.contract = new ethers.Contract(
      this.chain.contracts.dex.address,
      this.chain.contracts.dex.abi,
      this.provider
    );

    this.lastProcessedBlock = 0;
    this.blockHashes = new Map();

    // Chain-specific configs
    this.BATCH_SIZE = this.chain.config.batchSize;
    this.REORG_DEPTH = this.chain.config.reorgDepth;
    this.MIN_CONFIRMATIONS = this.chain.config.minConfirmations;
    this.RESTART_DELAY = this.chain.config.restartDelay;
  }

Chức năng: Khởi tạo crawler với thông tin chain cụ thể (RPC URL, contract DEX).
Các thông số quan trọng:
- BATCH_SIZE: Số block xử lý mỗi lần.
- REORG_DEPTH: Độ sâu kiểm tra reorg.
- MIN_CONFIRMATIONS: Số confirmations tối thiểu cho giao dịch.
- RESTART_DELAY: Thời gian chờ trước khi crawl lại.
Lưu ý: blockHashes (Map) dùng để lưu hash của các block gần đây nhằm phát hiện reorg.

Load và Update Last Processed Block

  private async loadLastProcessedBlock(): Promise<number> {
    const blockInfo = await Block.findOne({ chain: this.chain.name });
    if (blockInfo) {
      return blockInfo.lastProcessedBlock;
    }

    // If no block info exists, create initial entry
    const startBlock = this.chain.contracts.dex.startBlock;
    await Block.create({
      chain: this.chain.name,
      lastProcessedBlock: startBlock,
      updatedAt: new Date(),
    });
    return startBlock;
  }

  private async updateLastProcessedBlock(blockNumber: number): Promise<void> {
    await Block.findOneAndUpdate(
      { chain: this.chain.name },
      {
        lastProcessedBlock: blockNumber,
        updatedAt: new Date(),
      },
      { upsert: true }
    );
    this.lastProcessedBlock = blockNumber;
  }

Chức năng:
- loadLastProcessedBlock: Lấy block cuối cùng đã xử lý từ database, hoặc khởi tạo với block bắt đầu của contract nếu chưa có.
- updateLastProcessedBlock: Cập nhật block cuối cùng đã xử lý vào database.
Lợi ích Đảm bảo crawler tiếp tục từ điểm dừng trước đó, tránh lặp lại công việc không cần thiết.

Kiểm tra Reorg

  private async getBlockHash(blockNumber: number): Promise<string | null> {
    try {
      const block = await this.provider.getBlock(blockNumber);
      return block?.hash ?? null;
    } catch (err) {
      console.error(`❌ Block ${blockNumber} fetch failed:`, err);
      return null;
    }
  }

  private async checkForReorg(latestBlock: number): Promise<number | null> {
    if (this.lastProcessedBlock < latestBlock - this.REORG_DEPTH) {
      return null;
    }

    const checks: Promise<number | null>[] = [];
    for (let i = 1; i <= this.REORG_DEPTH; i++) {
      checks.push(
        this.getBlockHash(this.lastProcessedBlock - i).then((currentHash) => {
          const storedHash = this.blockHashes.get(this.lastProcessedBlock - i);
          if (storedHash && currentHash && storedHash !== currentHash) {
            console.warn(
              `⚠️ Reorg detected at block ${this.lastProcessedBlock - i}`
            );
            return this.lastProcessedBlock - i;
          }
          return null;
        })
      );
    }

    const results = await Promise.all(checks);
    return results.find((block) => block !== null) ?? null;
  }

Chức năng: Kiểm tra xem có reorg xảy ra không bằng cách so sánh hash block đã lưu với hash hiện tại trên chain.
Cách hoạt động: Nếu phát hiện reorg, trả về block cần rollback đến.

Xử lý chain reorg

  private async handleReorg(reorgBlock: number): Promise<void> {
    await Promise.all([
      Transaction.deleteMany({
        chain: this.chain.name,
        blockNumber: { $gte: reorgBlock },
      }),
      Block.findOneAndUpdate(
        { chain: this.chain.name },
        { lastProcessedBlock: reorgBlock }
      ),
    ]);
  }

Chức năng: Xóa dữ liệu giao dịch từ block xảy ra reorg trở đi và cập nhật lại lastProcessedBlock.
Lưu ý: Đảm bảo dữ liệu trong database khớp với blockchain sau reorg.

Thu thập và lưu Event

  private async fetchAndStoreEvents(
    fromBlock: number,
    toBlock: number,
    latestBlock: number
  ): Promise<void> {
    try {
      console.log(`📡 Fetching events: blocks ${fromBlock} → ${toBlock}`);

      const reorgThreshold = latestBlock - this.REORG_DEPTH;

      const [depositEvents, withdrawEvents] = await Promise.all([
        this.contract.queryFilter("Deposit", fromBlock, toBlock) as Promise<
          ethers.EventLog[]
        >,
        this.contract.queryFilter("Withdrawal", fromBlock, toBlock) as Promise<
          ethers.EventLog[]
        >,
      ]);

      // Get block hashes for all blocks in the batch to avoid reorgs
      const blockHashesArray = await Promise.all(
        Array.from({ length: toBlock - fromBlock + 1 }, (_, i) => {
          const blockNumber = fromBlock + i;
          return blockNumber > reorgThreshold
            ? this.getBlockHash(blockNumber)
            : null;
        })
      );
      const transactions: ITransaction[] = [];

      for (const log of [...depositEvents, ...withdrawEvents]) {
        const { transactionHash, blockNumber } = log;
        const timestamp = new Date();

        if (
          log.fragment.name === "Deposit" ||
          log.fragment.name === "Withdraw"
        ) {
          const user = log.args[0];
          const amount = ethers.formatUnits(log.args[1], 18);
          const isDeposit = log.fragment.name === "Deposit";

          transactions.push({
            chain: this.chain.name,
            transactionHash,
            blockNumber,
            user,
            amount,
            type: isDeposit ? "deposit" : "withdraw",
            minConfirmationsRequired: this.MIN_CONFIRMATIONS,
            currentConfirmations: 0,
            isConfirmed: false,
            timestamp,
          } as ITransaction);
        }
      }

      if (transactions.length > 0) {
        await Transaction.insertMany(transactions, { ordered: false });
      }

      console.log(`✅ Saved ${transactions.length} transactions`);

      // Store only relevant block hashes
      blockHashesArray.forEach((hash, i) => {
        const blockNumber = fromBlock + i;
        if (hash) this.blockHashes.set(blockNumber, hash);
      });

      // Trim stored block hashes
      [...this.blockHashes.keys()].forEach((block) => {
        if (block < reorgThreshold) this.blockHashes.delete(block);
      });
    } catch (error) {
      console.error(`❌ Fetch error [${fromBlock} → ${toBlock}]:`, error);
      throw error;
    }
  }

Chức năng: Lấy event Deposit và Withdrawal từ contract trong khoảng block chỉ định, sau đó lưu vào database.
Lưu ý: Sử dụng insertMany để tối ưu hiệu suất khi lưu nhiều giao dịch.

Xác nhận Giao dịch

  private async processConfirmations(latestBlock: number): Promise<void> {
    const pendingTransactions = await Transaction.find({
      chain: this.chain.name,
      isConfirmed: false,
    });

    const bulkOps: any[] = [];

    for (const tx of pendingTransactions) {
      const confirmations = latestBlock - tx.blockNumber;

      if (confirmations >= tx.minConfirmationsRequired) {
        bulkOps.push({
          updateOne: {
            filter: { _id: tx._id, transactionHash: tx.transactionHash },
            update: {
              isConfirmed: true,
              currentConfirmations: tx.minConfirmationsRequired,
            },
          },
        });

        if (tx.type === "deposit") {
          bulkOps.push({
            updateOne: {
              filter: { chain: this.chain.name, user: tx.user },
              update: { $inc: { balance: parseFloat(tx.amount) } },
              upsert: true,
            },
          });
        } else if (tx.type === "withdraw") {
          bulkOps.push({
            updateOne: {
              filter: { chain: this.chain.name, user: tx.user },
              update: {
                $pull: { pendingWithdraws: { hash: tx.transactionHash } },
              },
            },
          });
        }
      }
    }

    if (bulkOps.length > 0) {
      await Transaction.bulkWrite(bulkOps);
      console.log(`✅ Processed ${bulkOps.length} transactions`);
    }
  }

Chức năng: Kiểm tra số confirmations của các giao dịch chưa xác nhận, cập nhật trạng thái và số dư người dùng nếu đạt yêu cầu.
Lưu ý: Sử dụng bulkWrite để tối ưu hóa hiệu suất database.

Hàm chính `crawl()`

  private async crawl(): Promise<void> {
    try {
      const latestBlock = await this.provider.getBlockNumber();
      let fromBlock = await this.loadLastProcessedBlock();
      let toBlock = Math.min(fromBlock + this.BATCH_SIZE, latestBlock);

      const reorgBlock = await this.checkForReorg(latestBlock);
      if (reorgBlock) {
        await this.handleReorg(reorgBlock);
        fromBlock = reorgBlock;
      }

      while (fromBlock < latestBlock) {
        await this.fetchAndStoreEvents(fromBlock, toBlock, latestBlock);
        await this.updateLastProcessedBlock(toBlock);

        fromBlock = toBlock + 1;
        toBlock = Math.min(fromBlock + this.BATCH_SIZE, latestBlock);
      }

      await this.processConfirmations(latestBlock);
      setTimeout(() => this.crawl(), this.RESTART_DELAY);
    } catch (error) {
      console.error("❌ Critical crawler error:", error);
      setTimeout(() => this.crawl(), 5000);
    }
  }

Chức năng: Điều phối toàn bộ quá trình crawl:
- Lấy block mới nhất.
- Kiểm tra và xử lý reorg nếu có.
- Thu thập dữ liệu theo batch.
- Xác nhận giao dịch.
- Lặp lại sau RESTART_DELAY bằng setTimeout.
Lưu ý: Sử dụng đệ quy thông qua setTimeout để crawl liên tục, cần cẩn thận để tránh chồng lấn tiến trình nếu không kiểm soát tốt.

Khởi động Crawler

  public async start(): Promise<void> {
    try {
      await mongoose.connect(String(process.env.DATABASE_URI));
      console.log(`✅ Connected to MongoDB for ${this.chain.name}`);
      await this.crawl();
    } catch (error) {
      console.error(
        `❌ Failed to start crawler for ${this.chain.name}:`,
        error
      );
    }
  }

Chức năng: Kết nối MongoDB và bắt đầu crawl.

Cách sử dụng Crawler

Cấu hình chain

Tạo file config/chains.js.

export const chains = {
  1: {
    // Ethereum Mainnet
    name: "ethereum",
    rpcUrl: "https://mainnet.infura.io/v3/YOUR_INFURA_KEY",
    contracts: {
      dex: {
        address: "0xYourContractAddress",
        abi: [
          "event Deposit(address indexed user, uint256 amount)",
          "event Withdrawal(address indexed user, uint256 amount)",
        ],
        startBlock: 12345678,
      },
    },
    config: {
      batchSize: 1000,
      reorgDepth: 10,
      minConfirmations: 6,
      restartDelay: 5000,
    },
  },
};

Schema Mongodb

Tạo file models/index.js:

import mongoose, { Document, Schema } from "mongoose";
import { ChainId } from "../config/chains";

export interface ITransaction extends Document {
  chain: ChainId;
  transactionHash: string;
  blockNumber: number;
  user: string;
  amount: string;
  type: "deposit" | "withdraw";
  minConfirmationsRequired: number;
  currentConfirmations: number;
  isConfirmed: boolean;
  timestamp: Date;
}

export interface IPendingWithdraw {
  hash: string;
  amount: number;
}

export interface IUserBalance extends Document {
  chain: ChainId;
  user: string;
  balance: number;
  pendingWithdraws: IPendingWithdraw[];
}

export interface IBlock extends Document {
  chain: ChainId;
  lastProcessedBlock: number;
  updatedAt: Date;
}

const transactionSchema = new Schema<ITransaction>({
  chain: { type: String, required: true, index: true },
  transactionHash: { type: String, required: true },
  blockNumber: { type: Number, required: true, index: true },
  user: { type: String, required: true },
  amount: { type: String, required: true },
  type: { type: String, enum: ["deposit", "withdraw"], required: true },
  minConfirmationsRequired: { type: Number, required: true },
  currentConfirmations: { type: Number, required: true },
  isConfirmed: { type: Boolean, required: true },
  timestamp: { type: Date, required: true },
});

const userBalanceSchema = new Schema<IUserBalance>({
  chain: { type: String, required: true, index: true },
  user: { type: String, required: true, index: true },
  balance: { type: Number, required: true, default: 0 },
  pendingWithdraws: [
    {
      hash: { type: String, required: true },
      amount: { type: Number, required: true },
    },
  ],
});

const blockSchema = new Schema<IBlock>({
  chain: { type: String, required: true, unique: true },
  lastProcessedBlock: { type: Number, required: true },
  updatedAt: { type: Date, required: true, default: Date.now },
});

export const Transaction = mongoose.model<ITransaction>(
  "Transaction",
  transactionSchema
);
export const UserBalance = mongoose.model<IUserBalance>(
  "UserBalance",
  userBalanceSchema
);
export const Block = mongoose.model<IBlock>("Block", blockSchema);

Chạy crawler

import { Crawler } from "./crawler.js";

const crawler = new Crawler(1); // Chain ID 1 cho Ethereum Mainnet
crawler.start();

Kết

Bài viết này đã giải thích chi tiết cách hoạt động của crawler, đồng thời cung cấp nguyên lý chung và các sai lầm cần tránh. Bạn có thể mở rộng crawler này bằng cách thêm hỗ trợ multi-chain, tối ưu batch size. Để xem toàn bộ source code và các cập nhật mới nhất, hãy ghé thăm repository GitHub tại đây: https://github.com/chinhvuong/evm-crawler. Nếu bạn có ý tưởng cải tiến hoặc cần hỗ trợ thêm, đừng ngần ngại đóng góp hoặc liên hệ!