How to make ENOSPC handling not be such a performance drag?
The problem is we have a single resource, the free space available on the disk, which needs to be accessed by many people at the same time, people writing to the disk. This only refers to the actual reservation system, since the actual low level allocator is quite good at having multiple allocators at the same time.
- Reservation: This is based on the number of items you will insert/modify in a leaf. So for example if you chmod an inode that will result in the modification of the inode item, which will be 1 item, which results in a 96 kilobyte reservation on a 4 kilobyte leafsize file system.
- Space Info: This is just the internal structure that keeps track of all space for a given block group type, either data or metadata, but for the purposes of this article we're going to be talking about the metadata one.
Types of Reservations
- Normal operation reservations, things create/link/unlink etc. These will reserve some space in the transaction, do their operation, free up their reservation. Nice and simple and doesn't put too much pressure on the reservation system.
- Delayed inode operations. These are a little trickier. We will allocate a delayed inode operation structure and hold the reservation we need to make the update for the life of the delayed operation. Once we actually update the tree we will drop this reservation. So this reservation can last well past the actual operation that created the delayed operation, but they are pretty easy to evict, just update the tree and drop the reservation and we are good to go.
- Delayed allocation, the biggest problem we have. Since we only update the metadata once our data IO completes we will hold a metadata reservation for every extent we have from the moment we do a write() call until the data is on disk and we can add the file extent item to the tree, at which point we drop our reservation. That means if we have all of our reservations tied up in delalloc reservations we have to flush that data and wait for it to complete before we can reclaim the space to make new reservations.
What we currently do
- See if we can overcommit. Since we way over estimate for our reservations we assume that we will never actually use all of the space for a reservation, and since we only allocate chunks of space on the disk on demand there could be a whole lot of free space that hasn't been allocated for either data or metadata, which means we have even more wiggle room. So depending on the type of allocation this is we will allow overcommits of up to 1/2 of the unallocated chunks on the disk.
- Flush. When we flush we block everybody else from being able to make a reservation until we have completed our reservation. This is to keep other people from coming in and stealing our reservation while we're doing the flushing work, this is how we keep early ENOSPC errors from happening. We flush 3 things in a couple of different stages
- Delalloc flushing. The first pass through we just start writeout and sleep for a little bit hoping that we managed to free up some space so we can continue. If this doesn't work we will explicitly wait on ordered extents to complete, since that is the only way we can know for sure the reservations have been freed up.
- Delayed inode operations. We will run through an arbitrary number of delayed inode operations and run them so that their reservations are freed up. If this doesn't work we will flush all existing delayed inode operations. We don't have to worry about this looping forever since nobody can add a delayed inode operation without making a reservation anyway, and those are all locked out by the flusher.
- Committing the transaction. Since we pin space we have freed in the current transaction we can actually reclaim space by committing the transaction. We try to avoid this if at all possible, but in some cases, like rm -rf /some/big/directory, this actually ends up being quite useful. Once you get extremely full however this starts to hurt, since we will fill up the last 0.01% by constantly committing the transaction to reclaim the small amount of pinned space we may have.
So this all works pretty well as far as correctness is concerned, if there are still issues it is mostly because of corner cases where we don't do the right thing, compression being a recent example.
We get by pretty well with performance as well, except in a few cases, such as random writes or many writers. Once you fill up the free space in the chunks you have and you've filled up the overcommit area you hit this constant give and take with the flushers and reservers. So here are the problems as I see them, and below each a list of things I think we can do to fix them. Ideas are welcome, simply add a new bullet point below the problem with your solution and your irc nick so we know who's adding the suggestions so we can argue about it in IRC.
- Over reservation. This sucks but there is only so much we can do, since if we under reserve we are screwed.
- One way to fix this was the overcommit work, which helped quite a bit, and I think that is about all we can do for this case. josef
- One thought I had was to pre-reserve more space in the global block reserve and then only reserve 1 leafsize worth of space per item, and so once we hit the actual leaf we are going to modify we use our reservation, and we lean on the global block reserve for the rest of the allocations that happen on the way down to our leaf. The drawback is when our metadata size is larger than the free space in our metadata chunks, we'll have to switch back to the original reservation system, or maybe trust the logic to commit the transaction when we've used too much of the global block rsv, which I'm not too keen to do. josef
- (not working) Attempt to address how to flush less stated below. Ther over-reservation of a 4k block can go up to 96k as the worst case calculation (see above). This accounts for splitting the full tree path from 8th level root down to the leaf plus the node splits. My question: how often do we need to go up to the level N+1 from current level N? for levels 0 and 1 it may happen within one transaction, maybe not so often for level 2 and with exponentially decreasing frequency for the higher levels. Therefore, is it possible to check the tree level first and adapt the calculation according to that? Let's say we can reduce the 4k reservation size from 96k to 32k on average (for a many-gigabyte filesystem), thus increasing the space available for reservations by some factor. The expected gain is less pressure to the flusher because more reservations will succeed immediately.
The idea behind is to make the initial reservation more accurate to current state than blindly overcommitting by some random factor (1/2). Another hint to the tree root level may be the usage of the root node: eg. if the root is less than half full, splitting will not happen unless there are K concurrent reservations running where K is proportional to overwriting the whole subtree (same exponential decrease with increasing level) and this will not be possible within one transaction or there will not be enough space to satisfy all reservations. (This attempts to fine-tune the currently hardcoded level 8 up to the best value). The safe value for the level in the calculations would be like N+1, ie. as if all the possible splits happen with respect to current tree height. kdave
- One flusher at a time. Only one guy trying to make a reservation can flush delalloc/delayed nodes at a time, and everybody else is blocked up behind him. The flusher also has to wait for all of the work that he started to be done, rather than just being woken up when enough space has been freed up
- First thing I want to do is make it so that we run delayed inode operations first instead of flushing delalloc. Flushing delalloc means we have to wait for the IO to complete before we get our reservation back, but running delayed inode operations all happens right then and so is likely to induce less latency. This won't be the case for a streaming write or something like that, but I'm not convinced it will cause a degredation in the streaming write workload, it will be something that I check however. josef
- Move the flushing to a thread. I know we used to do this and I think we stopped because we were having other problems with early enospc. Now that we have the early enospc problems all nailed down I think we can go back to a threaded approach. This will be tricky in the actual out of space case, since we will need some way of telling the ones making the reservations that we truly are out of space. We can add a waitqueue that gets kicked anytime space_info->reservation_progress get's kicked, and that way waiters just wait until their space is free'ed. We just have a list that the waiters add themselves to a global list and they only get to take their reservation if they're the first one in the list, everybody else just goes to sleep. We can have the flusher set some bit saying it did everything it could so the first guy will know it is screwed and exit with ENOSPC. josef
- We have to wait for IO to complete in order to reclaim delalloc reservations.
- I'm not sure there is anything we can do about this, since it's how we maintain data ordered. I had thoughts of creating some sort of crazy tree that holds the metadata that gets spliced in when the IO completes so we can maintain data ordered, but I don't think it's possible without having to have another class of reservations. josef
- Basically I'm looking forward to elaborating on global_rsv's calculation, a preciser one, a more reasonable one. liubo
Those are the big problems, no matter what we do above the flushing level we will always hit a point where we have to flush, so figuring out how to flush less and flush more intelligently will give us the biggest performance return.
Nice to have
Here are some other ideas I've had that will likely give us some sort of boost, but likely just in the best case scenarios, feel free to add ideas here for the higher level issues
- Per-cpu space pools. We figure out how much free space we have every transaction and we divide it up into per-cpu pools. This will reduce lock contention on space_info->lock since we can just drain space from our space pools. Then when we flush we evenly distribute the space among the pools. The tricky part is when we just have one guy doing stuff, we will have a whole bunch of space spread across pools that aren't being used, so we'd have to come up with some sort of weighting algorithm. The other thing we could do is make it so you can steal reservations from other cpu pools. This would run into problems once you actually fill up the disk since all the CPU's would then just be contending on each others per-cpu pool locks to try and steal reservations. So it could be nice, but it could be awful, and again would really only be helpful in the best-case scenarios, if we had to flush we'd all be pretty screwed. josef
- Make requests to helper threads well dispatched. Make full use of multiple async helper threads to gain performance, and this may depends on how and when we start another new one. My prediction is more threads we have, faster we can be. Still I'll check if it is true however. liubo