监测 Linux 进程的实时 IO 情况

作为系统管理员和 VPS 服务商,经常会碰到服务器或者 VPS 磁盘 IO 繁忙的时候,VPSee 通常都会用一些工具来检测,其中一个常用的工具就是自己写的 iotop 脚本,可以很方便看到哪个进程在频繁 IO. 上周五收到一位网友的邮件和留言,问到这篇文章:如何查看进程 IO 读写情况?里的 WRITE 为什么会出现是 0 的情况,这是个好问题,VPSee 在这里好好解释一下。首先看看我们怎么样才能实时监测不同进程的 IO 活动状况。

block_dump

Linux 内核里提供了一个 block_dump 参数用来把 block 读写(WRITE/READ)状况 dump 到日志里,这样可以通过 dmesg 命令来查看,具体操作步骤是:

# sysctl vm.block_dump=1
or
# echo 1 > /proc/sys/vm/block_dump

然后就可以通过 dmesg 就可以观察到各个进程 IO 活动的状况了:

# dmesg -c
kjournald(542): WRITE block 222528 on dm-0
kjournald(542): WRITE block 222552 on dm-0
bash(18498): dirtied inode 5892488 (ld-linux-x86-64.so.2) on dm-0
bash(18498): dirtied inode 5892482 (ld-2.5.so) on dm-0
dmesg(18498): dirtied inode 11262038 (ld.so.cache) on dm-0
dmesg(18498): dirtied inode 5892496 (libc.so.6) on dm-0
dmesg(18498): dirtied inode 5892489 (libc-2.5.so) on dm-0

问题

一位细心的网友提到这样一个问题:为什么会有 WRITE block 0 的情况出现呢?VPSee 跟踪了一段时间,发现确实有 WRITE 0 的情况出现,比如:

# dmesg -c
...
pdflush(23123): WRITE block 0 on sdb1
pdflush(23123): WRITE block 16 on sdb1
pdflush(23123): WRITE block 104 on sdb1
pdflush(23123): WRITE block 40884480 on sdb1
...

答案

原来我们把 WRITE block 0,WRITE block 16, WRITE block 104 这里面包含的数字理解错了,这些数字不是代表写了多少 blocks,是代表写到哪个 block,为了寻找真相,VPSee 追到 Linux 2.6.18 内核代码里,在 ll_rw_blk.c 里找到了答案:

$ vi linux-2.6.18/block/ll_rw_blk.c

void submit_bio(int rw, struct bio *bio)
{
        int count = bio_sectors(bio);

        BIO_BUG_ON(!bio->bi_size);
        BIO_BUG_ON(!bio->bi_io_vec);
        bio->bi_rw |= rw;
        if (rw & WRITE)
                count_vm_events(PGPGOUT, count);
        else
                count_vm_events(PGPGIN, count);

        if (unlikely(block_dump)) {
                char b[BDEVNAME_SIZE];
                printk(KERN_DEBUG "%s(%d): %s block %Lu on %s\n",
                        current->comm, current->pid,
                        (rw & WRITE) ? "WRITE" : "READ",
                        (unsigned long long)bio->bi_sector,
                        bdevname(bio->bi_bdev,b));
        }

        generic_make_request(bio);
}

很明显从上面代码可以看出 WRITE block 0 on sdb1,这里的 0 是 bio->bi_sector,是写到哪个 sector,不是 WRITE 了多少 blocks 的意思。还有,如果 block 设备被分成多个区的话,这个 bi_sector(sector number)是从这个分区开始计数,比如 block 0 on sdb1 就是 sdb1 分区上的第0个 sector 开始。

评论 (9 Comments)

  1. 也就是说,通过iotop的脚步只能看到,每一秒内,有哪些进程在读写,可以知道进程操作io的频繁程度,并不能知道每一次读写的大小

  2. 嗯,对的。

  3. 今天又在网上搜了下,只要linux升级到2.6.20以上就有办法监控进程的IO了,新的内核支持在/PROC/PID/IO的操作,只要有PID号,就可以通过cat /proc/pid号/io,显示出读写的数据,取2秒的值,取差值就是一秒内的读写数据

  4. 嗯,对的,2.6.20 以上编译内核时需要打开 TASK_DELAY_ACCT 和 TASK_IO_ACCOUNTING 选项。

  5. proc/PID/io里的参数,能帮我看看是什么意思吗,我只能看懂大概,您能翻译下吗,特别是这个rchar,wchar和后面的read_bytes,write_byters的区别
    +test:/tmp # cat /proc/3828/io
    +rchar: 323934931
    +wchar: 323929600
    +syscr: 632687
    +syscw: 632675
    +read_bytes: 0
    +write_bytes: 323932160
    +cancelled_write_bytes: 0
    +
    +Description
    +———–
    +rchar
    +—–
    +I/O counter: chars read
    +The number of bytes which this task has caused to be read from storage. This
    +is simply the sum of bytes which this process passed to read() and pread().
    +It includes things like tty IO and it is unaffected by whether or not actual
    +physical disk IO was required (the read might have been satisfied from
    +pagecache)
    +
    +wchar
    +—–
    +I/O counter: chars written
    +The number of bytes which this task has caused, or shall cause to be written
    +to disk. Similar caveats apply here as with rchar.
    +
    +syscr
    +—–
    +I/O counter: read syscalls
    +Attempt to count the number of read I/O operations, i.e. syscalls like read()
    +and pread().
    +
    +syscw
    +—–
    +I/O counter: write syscalls
    +Attempt to count the number of write I/O operations, i.e. syscalls like
    +write() and pwrite().
    +
    +read_bytes
    +———-
    +I/O counter: bytes read
    +Attempt to count the number of bytes which this process really did cause to
    +be fetched from the storage layer. Done at the submit_bio() level, so it is
    +accurate for block-backed filesystems.
    +
    +write_bytes
    +———–
    +I/O counter: bytes written
    +Attempt to count the number of bytes which this process caused to be sent to
    +the storage layer. This is done at page-dirtying time.
    +
    +cancelled_write_bytes
    +———————
    +The big inaccuracy here is truncate. If a process writes 1MB to a file and
    +then deletes the file, it will in fact perform no writeout. But it will have
    +been accounted as having caused 1MB of write.
    +In other words: The number of bytes which this process caused to not happen,
    +by truncating pagecache. A task can cause “negative” IO too. If this task
    +truncates some dirty pagecache, some IO which another task has been accounted
    +for (in it’s write_bytes) will not be happening. We _could_ just subtract that
    +from the truncating task’s write_bytes, but there is information loss in doing
    +that.

  6. at page-dirtying time 很难理解,还有这个read()和pread(),tty io,貌似专业用语

  7. 你上面给的相关英文解释已经说得很明白啊。page-dirtying time 应该是页正被修改、写回磁盘之前的时间段,页弄脏(被修改)以后是需要写回磁盘的,这些都是操作系统的基本概念,找一本操作系统的书应该都能明白。read(), pread() 都是常见的系统调用(system calls),tty io 是 terminal io,都是很基本的东西,不是专业用语 ^^

  8. 见笑了··我是半路出家的

  9. 厉害,都跟踪到linux内核代码了。
    我只是使用linux,看到这些代码就不明白了.

发表评论