Linuxカーネルのファイルアクセスの処理を追いかける (12) block: submit_bio

関連記事

概要
はじめに
BIOの概要
writepages関数から呼ばれるsubmit_bio関数
write_inode関数から呼ばれるsubmit_bio関数
おわりに
変更履歴
参考

概要

QEMUの vexpress-a9 (arm) で Linux 5.15を起動させながら、ファイル書き込みのカーネル処理を確認していく。

本章では、submit_bio関数を確認した。

はじめに

ユーザプロセスはファイルシステムという機構によって記憶装置上のデータをファイルという形式で書き込み・読み込みすることができる。
本調査では、ユーザプロセスがファイルに書き込み要求を実行したときにLinux カーネルではどのような処理が実行されるかを読み解いていく。

調査対象や環境などはPart 1: 環境セットアップを参照。

本記事では、writebackカーネルスレッドがsubmit_bio関数を呼び出すところから、blk_mq_submit_bio関数を呼ぶところまでを確認する。

BIOの概要

Linuxでは、ブロックデバイスにIOのやり取りをするために BIO と呼ばれるデータ構造を用いる。
ファイルシステムからファイルへの書き込みをすることで、writepages関数とwrite_inode関数が呼ばれ、それぞれの関数で次のようなbioが生成する。

(実データの書き込みとinodeの書き込みはそれぞれ別のwriteback kthreadで生成される)

また、bio構造体には、どのような操作するかどうかをbi_opfで管理している。
Linux v5.15では次のような操作群が定義されている。

Operation	Value	Description
REQ_OP_READ	0	read sectors from the device
REQ_OP_WRITE	1	write sectors to the device
REQ_OP_FLUSH	2	flush the volatile write cache
REQ_OP_DISCARD	3	discard sectors
REQ_OP_SECURE_ERASE	5	securely erase sectors
REQ_OP_WRITE_SAME	7	write the same sector many times
REQ_OP_WRITE_ZEROES	9	write the zero filled sector many times
REQ_OP_ZONE_OPEN	10	Open a zone
REQ_OP_ZONE_CLOSE	11	Close a zone
REQ_OP_ZONE_FINISH	12	Transition a zone to full
REQ_OP_ZONE_APPEND	13	write data at the current zone write pointer
REQ_OP_ZONE_RESET	15	reset a zone write pointer
REQ_OP_ZONE_RESET_ALL	17	reset all the zone present on the device
REQ_OP_DRV_IN	34	Driver private requests
REQ_OP_DRV_OUT	35	Driver private requests

今回のケースでは、ファイルへの追記書き込みのみとなっているので、bi_opfには1 (REQ_OP_WRITE)が設定されている。

また、writepages関数とwrite_inode関数でそれぞれsubmit_bio関数が呼び出されるため、双方の処理を順番に追っていく。

writepages関数から呼ばれるsubmit_bio関数

submit_bio関数

submit_bio関数は、bi_opfのフラグよりI/Oスケジューラにリクエストとして追加する。
submit_bio関数は下記の通りとなっている。

// 1057:
blk_qc_t submit_bio(struct bio *bio)
{
    if (blkcg_punt_bio_submit(bio))
        return BLK_QC_T_NONE;

    /*
    * If it's a regular read/write or a barrier with data attached,
    * go through the normal accounting stuff before submission.
    */
    if (bio_has_data(bio)) {
        unsigned int count;

        if (unlikely(bio_op(bio) == REQ_OP_WRITE_SAME))
            count = queue_logical_block_size(
                    bio->bi_bdev->bd_disk->queue) >> 9;
        else
            count = bio_sectors(bio);

        if (op_is_write(bio_op(bio))) {
            count_vm_events(PGPGOUT, count);
        } else {
            task_io_account_read(bio->bi_iter.bi_size);
            count_vm_events(PGPGIN, count);
        }
    }

    /*
    * If we're reading data that is part of the userspace workingset, count
    * submission time as memory stall.  When the device is congested, or
    * the submitting cgroup IO-throttled, submission can be a significant
    * part of overall IO time.
    */
    if (unlikely(bio_op(bio) == REQ_OP_READ &&
        bio_flagged(bio, BIO_WORKINGSET))) {
        unsigned long pflags;
        blk_qc_t ret;

        psi_memstall_enter(&pflags);
        ret = submit_bio_noacct(bio);
        psi_memstall_leave(&pflags);

        return ret;
    }

    return submit_bio_noacct(bio);
}

submit_bio関数では、blkcg_punt_bio_submit関数でcgroup処理に加えてsubmit_bio_noacct関数を呼び出す。

本環境では、CONFIG_BLK_CGROUP=nなので、blkcg_punt_bio_submit関数はfalseを返すインライン関数となっている。

REQ_OP_WRITE: vm_event_states.event[PGPGOUT]をcountだけ増加させる。
REQ_OP_WRITE_SAME: countを調整したうえで、REQ_OP_WRITEと同様に実施する。
REQ_OP_READ: プロセス毎のIO統計情報/proc/<pid>/ioのread_bytesの増加、vm_event_states.event[PGPGIN]をcountだけ増加させる。

bio_has_data関数では、submit予定のbioにデータがあることを確認する。

// 61:
static inline bool bio_has_data(struct bio *bio)
{
    if (bio &&
        bio->bi_iter.bi_size &&
        bio_op(bio) != REQ_OP_DISCARD &&
        bio_op(bio) != REQ_OP_SECURE_ERASE &&
        bio_op(bio) != REQ_OP_WRITE_ZEROES)
        return true;

    return false;
}

基本的には、bioのサイズでデータの有無を確認することができるが、DISCARDやSECURE_ERASEといった操作はデータを必要としないのでfalseを返す。

直後に、REQ_OP_WRITE_SAMEとの比較する処理があるが、Linuxでは0埋めにしか使われておらず、Linux 5.18で削除されている。

listman.redhat.com

そのため、今回はelse文で実行されるbio_sectorsを確認する。

// 49:
#define bio_sectors(bio)    bvec_iter_sectors((bio)->bi_iter)

bvec_iter_sectorsはbio構造体のbi_size (ブロックのサイズ)から 512で割った値(セクタ数)を返す。

// 46:
#define bvec_iter_sectors(iter) ((iter).bi_size >> 9)

また、op_is_write関数でwrite操作であるかどうかを判定し、

// 444:
static inline bool op_is_write(unsigned int op)
{
    return (op & 1);
}

writeの場合はPGPGOUT、readの場合はPGPGINの値を更新する。

▶ Linuxカーネル v5.14より前の実装について

Linux v5.14以降では削除されているが、sysfsインターフェースとして`block_dump`が提供されている。 patchwork.kernel.org `block_dump`はBIOのデバッグインターフェースであり、カーネルメッセージに次のような情報が出力される。


     root@10:~# echo 1 > /proc/sys/vm/block_dump 
     root@10:~# write.sh 
     [   73.033618] write.sh(93): READ block 4632 on mmcblk0 (8 sectors)
     [   73.038820] write.sh(93): dirtied inode 12 (FILE) on mmcblk0
     [   73.039578] write.sh(93): READ block 270336 on mmcblk0 (8 sectors)
     [   73.040878] write.sh(93): dirtied inode 12 (FILE) on mmcblk0
     [   73.041152] write.sh(93): dirtied inode 12 (FILE) on mmcblk0
     root@10:~# [  102.109639] kworker/u8:2(37): WRITE block 270336 on mmcblk0 (8 sectors)
     [  131.170521] kworker/u8:2(37): WRITE block 536 on mmcblk0 (8 sectors)

writeの操作の場合では、submit_bio_noacc関数を呼び出す。

submit_bio_noacct関数

submit_bio_noacct関数は、__submit_bio_noacct関数のラッパ関数となっている。

submit_bio_noacct関数の定義は次のようになっている。

// 1025:
blk_qc_t submit_bio_noacct(struct bio *bio)
{
    /*
    * We only want one ->submit_bio to be active at a time, else stack
    * usage with stacked devices could be a problem.  Use current->bio_list
    * to collect a list of requests submited by a ->submit_bio method while
    * it is active, and then process them after it returned.
    */
    if (current->bio_list) {
        bio_list_add(&current->bio_list[0], bio);
        return BLK_QC_T_NONE;
    }

    if (!bio->bi_bdev->bd_disk->fops->submit_bio)
        return __submit_bio_noacct_mq(bio);
    return __submit_bio_noacct(bio);
}

MD(Multi-Device)/DM(Device Mapper)のようなstacked block deviceの場合は、bio_list_add関数によって現在のプロセスが持つbio_listにbioを追加する。
block deviceがblk-mq対応の場合、__submit_bio_noacct_mq関数を呼び出す。
上記に該当しない場合、__submit_bio_noacct関数を呼び出す。

今回の環境はシンプルな構成(MDやDMなどを使用しない) であり、blk-mq対応ドライバ(下記のblock_device_operationsを参照)となっている。

// 823:
static const struct block_device_operations mmc_bdops = {
    .open           = mmc_blk_open,
    .release        = mmc_blk_release,
    .getgeo         = mmc_blk_getgeo,
    .owner          = THIS_MODULE,
    .ioctl          = mmc_blk_ioctl,
#ifdef CONFIG_COMPAT
    .compat_ioctl       = mmc_blk_compat_ioctl,
#endif
    .alternative_gpt_sector = mmc_blk_alternative_gpt_sector,
};

__submit_bio_noacct_mq関数

__submit_bio_noacct_mq関数は、ブロックデバイスに対する要求 (request) の作成・依頼する bio_mq_submit_bio関数のラッパ関数となっている。

ここで、それぞれのデータ構造の関係性を図示してみる。

bio構造体とrequest_queue構造体は、1パーティションを管理するgendisk構造体によって紐づいている。

さらに、stacked block deviceの対応として、この処理の間はcurrnet->bio_listに処理中のbioを入れておく。

__submit_bio_noacct_mq関数の定義は次の通りとなっている。

// 1001:
static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
{
    struct bio_list bio_list[2] = { };
    blk_qc_t ret;

    current->bio_list = bio_list;

    do {
        ret = __submit_bio(bio);
    } while ((bio = bio_list_pop(&bio_list[0])));

    current->bio_list = NULL;
    return ret;
}

__submit_bio_noacct_mq関数は、bio_listにある各bioに対して__submit_bio関数を実行する。

__submit_bio関数

__submit_bio関数は、発行するbioを準備したうえでdiskに対応するsubmit_bio操作を実行する。

// 915:
static blk_qc_t __submit_bio(struct bio *bio)
{
    struct gendisk *disk = bio->bi_bdev->bd_disk;
    blk_qc_t ret = BLK_QC_T_NONE;

    if (unlikely(bio_queue_enter(bio) != 0))
        return BLK_QC_T_NONE;

    if (!submit_bio_checks(bio) || !blk_crypto_bio_prep(&bio))
        goto queue_exit;
    if (disk->fops->submit_bio) {
        ret = disk->fops->submit_bio(bio);
        goto queue_exit;
    }
    return blk_mq_submit_bio(bio);

queue_exit:
    blk_queue_exit(disk->queue);
    return ret;
}

__submit_bio関数は、リクエストをブロックデバイスに投げる前に、次のような処理を実行する。

リクエストキューのリファレンスカウントを増やす
ブロックデバイスが対応している操作かどうか確認する
ブロックレイヤでインライン暗号化を使用する

まず初めに、リファレンスカウントを増やすbio_queue_enter関数について確認する。

リファレンスカウントを増やす

bio_queue_enter関数の定義は次のようになっている。

// 471:
static inline int bio_queue_enter(struct bio *bio)
{
    struct gendisk *disk = bio->bi_bdev->bd_disk;
    struct request_queue *q = disk->queue;

    while (!blk_try_enter_queue(q, false)) {
        if (bio->bi_opf & REQ_NOWAIT) {
            if (test_bit(GD_DEAD, &disk->state))
                goto dead;
            bio_wouldblock_error(bio);
            return -EBUSY;
        }

        /*
        * read pair of barrier in blk_freeze_queue_start(), we need to
        * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
        * reading .mq_freeze_depth or queue dying flag, otherwise the
        * following wait may never return if the two reads are
        * reordered.
        */
        smp_rmb();
        wait_event(q->mq_freeze_wq,
               (!q->mq_freeze_depth &&
                blk_pm_resume_queue(false, q)) ||
               test_bit(GD_DEAD, &disk->state));
        if (test_bit(GD_DEAD, &disk->state))
            goto dead;
    }

    return 0;
dead:
    bio_io_error(bio);
    return -ENODEV;
}

bio_queue_enter関数では、blk_try_enter_queue関数でキューが利用可能であるまでwaitする。
ただし、REQ_NOWAITの場合や複数回ループしてしまった場合には-ENODEVを返す。

blk_try_entry_queue関数の定義は以下の通りとなっている。

// 415:
static bool blk_try_enter_queue(struct request_queue *q, bool pm)
{
    rcu_read_lock();
    if (!percpu_ref_tryget_live(&q->q_usage_counter))
        goto fail;

    /*
    * The code that increments the pm_only counter must ensure that the
    * counter is globally visible before the queue is unfrozen.
    */
    if (blk_queue_pm_only(q) &&
        (!pm || queue_rpm_status(q) == RPM_SUSPENDED))
        goto fail_put;

    rcu_read_unlock();
    return true;

fail_put:
    percpu_ref_put(&q->q_usage_counter);
fail:
    rcu_read_unlock();
    return false;
}

// 284:
static inline bool percpu_ref_tryget_live(struct percpu_ref *ref)
{
    unsigned long __percpu *percpu_count;
    bool ret = false;

    rcu_read_lock();

    if (__ref_is_percpu(ref, &percpu_count)) {
        this_cpu_inc(*percpu_count);
        ret = true;
    } else if (!(ref->percpu_count_ptr & __PERCPU_REF_DEAD)) {
        ret = atomic_long_inc_not_zero(&ref->data->count);
    }

    rcu_read_unlock();

    return ret;
}

詳細は省くが、blk_try_entry_queue関数は、そのキューが使用可能かどうかを確認し、使用可能であればq_usage_counterをインクリメントする。
q_usage_counterが0(使用不可能)である場合、whileループを抜ける。

ブロックデバイスが対応しているかどうか確認する

この時、submit_bio_checks関数は、発行するbioが妥当性を確認する。(例えば、「ストレージデバイスが対応していない操作を実行する」など)

submit_bio_checks関数の定義は次の通りとなっている。

// 797:
static noinline_for_stack bool submit_bio_checks(struct bio *bio)
{
    struct block_device *bdev = bio->bi_bdev;
    struct request_queue *q = bdev->bd_disk->queue;
    blk_status_t status = BLK_STS_IOERR;
    struct blk_plug *plug;

    might_sleep();

    plug = blk_mq_plug(q, bio);
    if (plug && plug->nowait)
        bio->bi_opf |= REQ_NOWAIT;

    /*
    * For a REQ_NOWAIT based request, return -EOPNOTSUPP
    * if queue does not support NOWAIT.
    */
    if ((bio->bi_opf & REQ_NOWAIT) && !blk_queue_nowait(q))
        goto not_supported;

    if (should_fail_bio(bio))
        goto end_io;
    if (unlikely(bio_check_ro(bio)))
        goto end_io;
    if (!bio_flagged(bio, BIO_REMAPPED)) {
        if (unlikely(bio_check_eod(bio)))
            goto end_io;
        if (bdev->bd_partno && unlikely(blk_partition_remap(bio)))
            goto end_io;
    }

    /*
    * Filter flush bio's early so that bio based drivers without flush
    * support don't have to worry about them.
    */
    if (op_is_flush(bio->bi_opf) &&
        !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
        bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
        if (!bio_sectors(bio)) {
            status = BLK_STS_OK;
            goto end_io;
        }
    }

    if (!test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
        bio_clear_hipri(bio);

    switch (bio_op(bio)) {
    case REQ_OP_DISCARD:
        if (!blk_queue_discard(q))
            goto not_supported;
        break;
    case REQ_OP_SECURE_ERASE:
        if (!blk_queue_secure_erase(q))
            goto not_supported;
        break;
    case REQ_OP_WRITE_SAME:
        if (!q->limits.max_write_same_sectors)
            goto not_supported;
        break;
    case REQ_OP_ZONE_APPEND:
        status = blk_check_zone_append(q, bio);
        if (status != BLK_STS_OK)
            goto end_io;
        break;
    case REQ_OP_ZONE_RESET:
    case REQ_OP_ZONE_OPEN:
    case REQ_OP_ZONE_CLOSE:
    case REQ_OP_ZONE_FINISH:
        if (!blk_queue_is_zoned(q))
            goto not_supported;
        break;
    case REQ_OP_ZONE_RESET_ALL:
        if (!blk_queue_is_zoned(q) || !blk_queue_zone_resetall(q))
            goto not_supported;
        break;
    case REQ_OP_WRITE_ZEROES:
        if (!q->limits.max_write_zeroes_sectors)
            goto not_supported;
        break;
    default:
        break;
    }

    /*
    * Various block parts want %current->io_context, so allocate it up
    * front rather than dealing with lots of pain to allocate it only
    * where needed. This may fail and the block layer knows how to live
    * with it.
    */
    if (unlikely(!current->io_context))
        create_task_io_context(current, GFP_ATOMIC, q->node);

    if (blk_throtl_bio(bio)) {
        blkcg_bio_issue_init(bio);
        return false;
    }

    blk_cgroup_bio_start(bio);
    blkcg_bio_issue_init(bio);

    if (!bio_flagged(bio, BIO_TRACE_COMPLETION)) {
        trace_block_bio_queue(bio);
        /* Now that enqueuing has been traced, we need to trace
        * completion as well.
        */
        bio_set_flag(bio, BIO_TRACE_COMPLETION);
    }
    return true;

not_supported:
    status = BLK_STS_NOTSUPP;
end_io:
    bio->bi_status = status;
    bio_endio(bio);
    return false;
}

submit_bio_checks関数では以下のような状態を確認する。

リクエストキューがQUEUE_FLAG_NOWAITに対応していない
パーティションのremapが不正
ブロックデバイスがRead-Only
デバイス・パーティションの境界値超える
writeback cacheが載っていない
デバイスがIO ポーリング未対応
デバイスがSECURE_ERASE未対応
同じセクタに対する書き込み回数制限
Zoneブロックデバイスの未サポート

ブロックレイヤのインライン暗号化

ブロックレイヤのインライン暗号化は、Linux v5.8から導入されたカーネルの機能の一つとなっている。

docs.kernel.org

// 176:
config BLK_INLINE_ENCRYPTION
    bool "Enable inline encryption support in block layer"
    help
      Build the blk-crypto subsystem. Enabling this lets the
      block layer handle encryption, so users can take
      advantage of inline encryption hardware if present.

CONFIG_BLK_INLINE_ENCRYPTION=yの場合には、インライン暗号化の準備するために__blk_crypto_bio_prep関数を呼び出す。

// 122:
bool __blk_crypto_bio_prep(struct bio **bio_ptr);
static inline bool blk_crypto_bio_prep(struct bio **bio_ptr)
{
    if (bio_has_crypt_ctx(*bio_ptr))
        return __blk_crypto_bio_prep(bio_ptr);
    return true;
}