监测 Linux 进程的实时 IO 情况
2010年07月26日 | 标签: io activity, kernel, monitoring | 作者:vpsee
作为系统管理员和 VPS 服务商,经常会碰到服务器或者 VPS 磁盘 IO 繁忙的时候,VPSee 通常都会用一些工具来检测,其中一个常用的工具就是自己写的 iotop 脚本,可以很方便看到哪个进程在频繁 IO. 上周五收到一位网友的邮件和留言,问到这篇文章:如何查看进程 IO 读写情况?里的 WRITE 为什么会出现是 0 的情况,这是个好问题,VPSee 在这里好好解释一下。首先看看我们怎么样才能实时监测不同进程的 IO 活动状况。
block_dump
Linux 内核里提供了一个 block_dump 参数用来把 block 读写(WRITE/READ)状况 dump 到日志里,这样可以通过 dmesg 命令来查看,具体操作步骤是:
# sysctl vm.block_dump=1 or # echo 1 > /proc/sys/vm/block_dump
然后就可以通过 dmesg 就可以观察到各个进程 IO 活动的状况了:
# dmesg -c kjournald(542): WRITE block 222528 on dm-0 kjournald(542): WRITE block 222552 on dm-0 bash(18498): dirtied inode 5892488 (ld-linux-x86-64.so.2) on dm-0 bash(18498): dirtied inode 5892482 (ld-2.5.so) on dm-0 dmesg(18498): dirtied inode 11262038 (ld.so.cache) on dm-0 dmesg(18498): dirtied inode 5892496 (libc.so.6) on dm-0 dmesg(18498): dirtied inode 5892489 (libc-2.5.so) on dm-0
问题
一位细心的网友提到这样一个问题:为什么会有 WRITE block 0 的情况出现呢?VPSee 跟踪了一段时间,发现确实有 WRITE 0 的情况出现,比如:
# dmesg -c ... pdflush(23123): WRITE block 0 on sdb1 pdflush(23123): WRITE block 16 on sdb1 pdflush(23123): WRITE block 104 on sdb1 pdflush(23123): WRITE block 40884480 on sdb1 ...
答案
原来我们把 WRITE block 0,WRITE block 16, WRITE block 104 这里面包含的数字理解错了,这些数字不是代表写了多少 blocks,是代表写到哪个 block,为了寻找真相,VPSee 追到 Linux 2.6.18 内核代码里,在 ll_rw_blk.c 里找到了答案:
$ vi linux-2.6.18/block/ll_rw_blk.c void submit_bio(int rw, struct bio *bio) { int count = bio_sectors(bio); BIO_BUG_ON(!bio->bi_size); BIO_BUG_ON(!bio->bi_io_vec); bio->bi_rw |= rw; if (rw & WRITE) count_vm_events(PGPGOUT, count); else count_vm_events(PGPGIN, count); if (unlikely(block_dump)) { char b[BDEVNAME_SIZE]; printk(KERN_DEBUG "%s(%d): %s block %Lu on %s\n", current->comm, current->pid, (rw & WRITE) ? "WRITE" : "READ", (unsigned long long)bio->bi_sector, bdevname(bio->bi_bdev,b)); } generic_make_request(bio); }
很明显从上面代码可以看出 WRITE block 0 on sdb1,这里的 0 是 bio->bi_sector,是写到哪个 sector,不是 WRITE 了多少 blocks 的意思。还有,如果 block 设备被分成多个区的话,这个 bi_sector(sector number)是从这个分区开始计数,比如 block 0 on sdb1 就是 sdb1 分区上的第0个 sector 开始。
也就是说,通过iotop的脚步只能看到,每一秒内,有哪些进程在读写,可以知道进程操作io的频繁程度,并不能知道每一次读写的大小
嗯,对的。
今天又在网上搜了下,只要linux升级到2.6.20以上就有办法监控进程的IO了,新的内核支持在/PROC/PID/IO的操作,只要有PID号,就可以通过cat /proc/pid号/io,显示出读写的数据,取2秒的值,取差值就是一秒内的读写数据
嗯,对的,2.6.20 以上编译内核时需要打开 TASK_DELAY_ACCT 和 TASK_IO_ACCOUNTING 选项。
proc/PID/io里的参数,能帮我看看是什么意思吗,我只能看懂大概,您能翻译下吗,特别是这个rchar,wchar和后面的read_bytes,write_byters的区别
+test:/tmp # cat /proc/3828/io
+rchar: 323934931
+wchar: 323929600
+syscr: 632687
+syscw: 632675
+read_bytes: 0
+write_bytes: 323932160
+cancelled_write_bytes: 0
+
+Description
+———–
+rchar
+—–
+I/O counter: chars read
+The number of bytes which this task has caused to be read from storage. This
+is simply the sum of bytes which this process passed to read() and pread().
+It includes things like tty IO and it is unaffected by whether or not actual
+physical disk IO was required (the read might have been satisfied from
+pagecache)
+
+wchar
+—–
+I/O counter: chars written
+The number of bytes which this task has caused, or shall cause to be written
+to disk. Similar caveats apply here as with rchar.
+
+syscr
+—–
+I/O counter: read syscalls
+Attempt to count the number of read I/O operations, i.e. syscalls like read()
+and pread().
+
+syscw
+—–
+I/O counter: write syscalls
+Attempt to count the number of write I/O operations, i.e. syscalls like
+write() and pwrite().
+
+read_bytes
+———-
+I/O counter: bytes read
+Attempt to count the number of bytes which this process really did cause to
+be fetched from the storage layer. Done at the submit_bio() level, so it is
+accurate for block-backed filesystems.
+
+write_bytes
+———–
+I/O counter: bytes written
+Attempt to count the number of bytes which this process caused to be sent to
+the storage layer. This is done at page-dirtying time.
+
+cancelled_write_bytes
+———————
+The big inaccuracy here is truncate. If a process writes 1MB to a file and
+then deletes the file, it will in fact perform no writeout. But it will have
+been accounted as having caused 1MB of write.
+In other words: The number of bytes which this process caused to not happen,
+by truncating pagecache. A task can cause “negative” IO too. If this task
+truncates some dirty pagecache, some IO which another task has been accounted
+for (in it’s write_bytes) will not be happening. We _could_ just subtract that
+from the truncating task’s write_bytes, but there is information loss in doing
+that.
at page-dirtying time 很难理解,还有这个read()和pread(),tty io,貌似专业用语
你上面给的相关英文解释已经说得很明白啊。page-dirtying time 应该是页正被修改、写回磁盘之前的时间段,页弄脏(被修改)以后是需要写回磁盘的,这些都是操作系统的基本概念,找一本操作系统的书应该都能明白。read(), pread() 都是常见的系统调用(system calls),tty io 是 terminal io,都是很基本的东西,不是专业用语 ^^
见笑了··我是半路出家的
厉害,都跟踪到linux内核代码了。
我只是使用linux,看到这些代码就不明白了.