vpsee.com » Blog Archive » 监测 Linux 进程的实时 IO 情况

监测 Linux 进程的实时 IO 情况

2010年07月26日 | 标签: io activity, kernel, monitoring | 作者：vpsee

作为系统管理员和 VPS 服务商，经常会碰到服务器或者 VPS 磁盘 IO 繁忙的时候，VPSee 通常都会用一些工具来检测，其中一个常用的工具就是自己写的 iotop 脚本，可以很方便看到哪个进程在频繁 IO. 上周五收到一位网友的邮件和留言，问到这篇文章：如何查看进程 IO 读写情况？里的 WRITE 为什么会出现是 0 的情况，这是个好问题，VPSee 在这里好好解释一下。首先看看我们怎么样才能实时监测不同进程的 IO 活动状况。

block_dump

Linux 内核里提供了一个 block_dump 参数用来把 block 读写（WRITE/READ）状况 dump 到日志里，这样可以通过 dmesg 命令来查看，具体操作步骤是：

# sysctl vm.block_dump=1
or
# echo 1 > /proc/sys/vm/block_dump

然后就可以通过 dmesg 就可以观察到各个进程 IO 活动的状况了：

# dmesg -c
kjournald(542): WRITE block 222528 on dm-0
kjournald(542): WRITE block 222552 on dm-0
bash(18498): dirtied inode 5892488 (ld-linux-x86-64.so.2) on dm-0
bash(18498): dirtied inode 5892482 (ld-2.5.so) on dm-0
dmesg(18498): dirtied inode 11262038 (ld.so.cache) on dm-0
dmesg(18498): dirtied inode 5892496 (libc.so.6) on dm-0
dmesg(18498): dirtied inode 5892489 (libc-2.5.so) on dm-0

问题

一位细心的网友提到这样一个问题：为什么会有 WRITE block 0 的情况出现呢？VPSee 跟踪了一段时间，发现确实有 WRITE 0 的情况出现，比如：

# dmesg -c
...
pdflush(23123): WRITE block 0 on sdb1
pdflush(23123): WRITE block 16 on sdb1
pdflush(23123): WRITE block 104 on sdb1
pdflush(23123): WRITE block 40884480 on sdb1
...

答案

原来我们把 WRITE block 0，WRITE block 16, WRITE block 104 这里面包含的数字理解错了，这些数字不是代表写了多少 blocks，是代表写到哪个 block，为了寻找真相，VPSee 追到 Linux 2.6.18 内核代码里，在 ll_rw_blk.c 里找到了答案：

$ vi linux-2.6.18/block/ll_rw_blk.c

void submit_bio(int rw, struct bio *bio)
{
        int count = bio_sectors(bio);

        BIO_BUG_ON(!bio->bi_size);
        BIO_BUG_ON(!bio->bi_io_vec);
        bio->bi_rw |= rw;
        if (rw & WRITE)
                count_vm_events(PGPGOUT, count);
        else
                count_vm_events(PGPGIN, count);

        if (unlikely(block_dump)) {
                char b[BDEVNAME_SIZE];
                printk(KERN_DEBUG "%s(%d): %s block %Lu on %s\n",
                        current->comm, current->pid,
                        (rw & WRITE) ? "WRITE" : "READ",
                        (unsigned long long)bio->bi_sector,
                        bdevname(bio->bi_bdev,b));
        }

        generic_make_request(bio);
}

很明显从上面代码可以看出 WRITE block 0 on sdb1，这里的 0 是 bio->bi_sector，是写到哪个 sector，不是 WRITE 了多少 blocks 的意思。还有，如果 block 设备被分成多个区的话，这个 bi_sector（sector number）是从这个分区开始计数，比如 block 0 on sdb1 就是 sdb1 分区上的第0个 sector 开始。

发表评论(9 Comments) 分类：Linux | BSD | Solaris, Site Reliability | Performance

评论 (9 Comments)

hanson - July 26th, 2010 9:25 am

也就是说，通过iotop的脚步只能看到，每一秒内，有哪些进程在读写，可以知道进程操作io的频繁程度，并不能知道每一次读写的大小
vpsee - July 26th, 2010 1:23 pm

嗯，对的。
hanson - July 26th, 2010 4:41 pm

今天又在网上搜了下，只要linux升级到2.6.20以上就有办法监控进程的IO了，新的内核支持在/PROC/PID/IO的操作，只要有PID号，就可以通过cat /proc/pid号/io，显示出读写的数据，取2秒的值，取差值就是一秒内的读写数据
vpsee - July 26th, 2010 7:33 pm

嗯，对的，2.6.20 以上编译内核时需要打开 TASK_DELAY_ACCT 和 TASK_IO_ACCOUNTING 选项。
hanson - July 27th, 2010 4:09 pm

proc/PID/io里的参数，能帮我看看是什么意思吗，我只能看懂大概，您能翻译下吗，特别是这个rchar，wchar和后面的read_bytes，write_byters的区别
+test:/tmp # cat /proc/3828/io
+rchar: 323934931
+wchar: 323929600
+syscr: 632687
+syscw: 632675
+read_bytes: 0
+write_bytes: 323932160
+cancelled_write_bytes: 0
+
+Description
+———–
+rchar
+—–
+I/O counter: chars read
+The number of bytes which this task has caused to be read from storage. This
+is simply the sum of bytes which this process passed to read() and pread().
+It includes things like tty IO and it is unaffected by whether or not actual
+physical disk IO was required (the read might have been satisfied from
+pagecache)
+
+wchar
+—–
+I/O counter: chars written
+The number of bytes which this task has caused, or shall cause to be written
+to disk. Similar caveats apply here as with rchar.
+
+syscr
+—–
+I/O counter: read syscalls
+Attempt to count the number of read I/O operations, i.e. syscalls like read()
+and pread().
+
+syscw
+—–
+I/O counter: write syscalls
+Attempt to count the number of write I/O operations, i.e. syscalls like
+write() and pwrite().
+
+read_bytes
+———-
+I/O counter: bytes read
+Attempt to count the number of bytes which this process really did cause to
+be fetched from the storage layer. Done at the submit_bio() level, so it is
+accurate for block-backed filesystems.
+
+write_bytes
+———–
+I/O counter: bytes written
+Attempt to count the number of bytes which this process caused to be sent to
+the storage layer. This is done at page-dirtying time.
+
+cancelled_write_bytes
+———————
+The big inaccuracy here is truncate. If a process writes 1MB to a file and
+then deletes the file, it will in fact perform no writeout. But it will have
+been accounted as having caused 1MB of write.
+In other words: The number of bytes which this process caused to not happen,
+by truncating pagecache. A task can cause “negative” IO too. If this task
+truncates some dirty pagecache, some IO which another task has been accounted
+for (in it’s write_bytes) will not be happening. We _could_ just subtract that
+from the truncating task’s write_bytes, but there is information loss in doing
+that.
hanson - July 27th, 2010 4:16 pm

at page-dirtying time 很难理解，还有这个read（）和pread（），tty io，貌似专业用语
vpsee - July 27th, 2010 8:58 pm

你上面给的相关英文解释已经说得很明白啊。page-dirtying time 应该是页正被修改、写回磁盘之前的时间段，页弄脏（被修改）以后是需要写回磁盘的，这些都是操作系统的基本概念，找一本操作系统的书应该都能明白。read(), pread() 都是常见的系统调用（system calls），tty io 是 terminal io，都是很基本的东西，不是专业用语 ^^
hanson - July 28th, 2010 2:42 am

见笑了··我是半路出家的
babo - October 24th, 2010 10:18 am

厉害,都跟踪到linux内核代码了。
我只是使用linux,看到这些代码就不明白了.

监测 Linux 进程的实时 IO 情况

block_dump

问题

答案

评论 (9 Comments)

发表评论

分类

随机

评论

友链

关于