Adding a Timeout feature to Rustgem

Date: 22nd March 2023

After my last blogpost, I kept developing the Rust version of the VGEM driver, also known as rustgem for now. Previously, I had developed two important features of the driver: the ability to attach a fence and the ability to signal a fence. Still one important feature is still missing: the ability to prevent hangs. Currently, if the fence is not signaled, the driver will simply hang. So, we can create a callback that signals the fence when the fence is not signaled by the user for more than 10 seconds.

In order to create this callback, we need to have a Timer that will trigger it after the specified amount of time. Gladly, the Linux kernel provides us with a Timer that can be set with a callback and a timeout. But, to use it in the Rust code, we need to have a safe abstraction, that will ensure that the code is safe under some assumptions.

First Attempt: writing a Timer abstraction

Initially, I was developing an abstraction on my own as I checked the RfL tree and there were no Timer abstractions available.

The most important question here is “how can we guarantee access to other objects inside the callback?”. The callback only has receives a pointer to the struct timer_list as its single argument. Naturally, we can think about using a container_of macro. In order to make the compatibility layer between Rust and the C callback, I decided to store the object inside the Timer. Yep, I didn’t like that a lot, but it was the solution I came up with at the time. The struct looked something like this:

/// A driver-specific Timer Object
//
// # Invariants
// timer is a valid pointer to a struct timer_list and we own a reference to it.
[repr(C)]
pub struct UniqueTimer<T: TimerOps<Inner = D>, D> {
   timer: *bindings::timer_list,
   inner: D,
   _p: PhantomData<T>,
}

Moreover, the second important question I had was “how can the user pass a callback function to the timer?”. There were two possibilities: using a closure and using a Trait. I decided to go through the trait path. Things would be kind of similar if I decided to go into the closure path.

/// Trait which must be implemented by driver-specific timer objects.
pub trait TimerOps: Sized {
    /// Type of the Inner data inside the Timer
    type Inner;

    /// Timer callback
    fn timer_callback(timer: &UniqueTimer<Self, Self::Inner>);
}

With those two questions solved, it seems that we are all set and good to go. So, we can create methods to initialize the timer and modify the timer’s timeout, implement the Drop trait, and use the following callback by default:

unsafe extern "C" fn timer_callback<T: TimerOps<Inner = D>, D: Sized>(
    timer: *mut bindings::timer_list,
) {
    let timer = crate::container_of!(timer, UniqueTimer<T, D>, timer)
			    as *mut UniqueTimer<T, D>;

    // SAFETY: The caller is responsible for passing a valid timer_list subtype
    T::timer_callback(unsafe { &mut *timer });
}

All should work, right? Well… No, I didn’t really mention how I was allocating memory. And let’s say I was initially allocating it wrongly and therefore, the container_of macro was pointing to the wrong memory space.

Initially, I was allocating only timer with the kernel memory allocator krealloc and allocating the rest of the struct with Rust’s memory allocator. By making such a mess, container_of wasn’t able to point to the right memory address.

I had to change things a bit to allocate the whole struct UniqueTimer with the kernel’s memory allocator. However, krealloc returns a raw pointer and it would be nice for the final user to get a raw pointer to the object. I wrapped up inside another struct that could be dereferenced into the UniqueTimer object.

/// A generic Timer Object
///
/// This object should be instantiated by the end user, as it holds
/// a unique reference to the UniqueTimer struct. The UniqueTimer
/// methods can be used through it.
pub struct Timer<T: TimerOps<Inner = D>, D>(*mut UniqueTimer<T, D>);

impl<T: TimerOps<Inner = D>, D> Timer<T, D> {
    /// Create a timer for its first use
    pub fn setup(inner: D) -> Self {
        let t = unsafe {
            bindings::krealloc(
                core::ptr::null_mut(),
                core::mem::size_of::<UniqueTimer<T, D>>(),
                bindings::GFP_KERNEL | bindings::__GFP_ZERO,
            ) as *mut UniqueTimer<T, D>
        };

        // SAFETY: The pointer is valid, so pointers to members are too.
        // After this, all fields are initialized.
        unsafe {
            addr_of_mut!((*t).inner).write(inner);
            bindings::timer_setup(addr_of_mut!((*t).timer), Some(timer_callback::<T, D>), 0)
        };

        Self(t)
    }
}

And then the container_of macro started working! Now, I could setup a Timer for each fence and keep the fence inside the timer. Finally, I could use the fence inside the timer to signal it when it was not signaled by the user for more than 10 seconds.

impl TimerOps for VgemFenceOps {
    type Inner = UniqueFence<Self>;

    fn timer_callback(timer: &UniqueTimer<Self, UniqueFence<Self>>) {
        let _ = timer.inner().signal();
    }
}

So, I tested the driver with IGT using the vgem_slow test and it was now passing! All IGT tests were passing and it looked like the driver was practically completed (some FIXME problems notwithstanding). But, let’s see if this abstraction is really safe…

Second Attempt: using a Timer abstraction

First, let’s inspect the struct timer_list in the C code.

struct timer_list {
    struct hlist_node   entry;
    unsigned long       expires;
    void            (*function)(struct timer_list *);
    u32         flags;
};

By looking at this struct, we can see a problem in my abstraction: a timer can point to a timer through a list. If you are not familiar with Rust, this can seem normal, but self-referential types can lead to undefined behavior (UB).

Let’s say we have an example type with two fields: u32 and a pointer to this u32 value. Initially, everything looks fine, the pointer field points to the value field in memory address A, which contains a valid u32, and all pointers are valid. But Rust has the freedom to move values around memory. For example, if we pass this struct into another function, it might get moved to a different memory address. So, the once valid pointer is no longer valid, because when we move the struct, the struct’s fields change their address, but not their value. Now, the pointer fields still point to the memory address A, although the value field is located at the memory address B now. This is really bad and can lead to UB.

The solution is to make timer_list implement the !Unpin trait. This means that to use this type safely, we can’t use regular pointers for self-reference. Instead, we use special pointers that “pin” their values into place, ensuring they can’t be moved.

Still looking at the struct timer_list, it is possible to notice that a timer can queue itself in the timer function. This functionality is not covered by my current abstraction.

Moreover, I was using jiffies to modify the timeout duration and I was adding a Duration to the jiffies. This is problematic, because it can cause a data races. Reading jiffies and adding a duration to them should be an atomic operation.

Huge thanks to the RfL folks that pointed the errors in my implementation!

With all these problems pointed out, it is time to fix them! I could have reimplemented my safe abstraction, but the RfL folks pointed me to a Timer abstraction that they are developing in a downstream tree. Therefore, I decided to use their Timer abstraction.

There were two options to implement a Timer abstraction:

To implement the Timeout trait to the VgemFence struct
To use the FnTimer abstraction

In the end, I decided to go with the second approach. The FnTimer receives a closure that will be executed at the timeout. The closure can return an enum that indicated if the timer is done or if it should be rescheduled.

When implementing the timer, I had a lot of borrow checker problems. See… I need to use the Fence object inside the callback and also move the Fence object at the end of the function. So, I got plenty of “cannot move out of fence because it is borrowed” errors. Also, I needed the Timer to be dropped at the same time as the fence, so I needed to store the Timer inside the VgemFence struct.

The solution to the problems: smart pointers! I boxed the FnTimer and the closure inside the FnTimer so that I could store it inside the VgemFence struct. Then, the second problem got fixed. But, I still cannot use the fence inside the closure, because it wasn’t encapsulated inside a smart pointer. So, I used an Arc to box Fence, clone it, and move it to the scope of the closure.

pub(crate) struct VgemFence {
	fence: Arc<UniqueFence<Fence>>,
	_timer: Box<FnTimer<Box<dyn FnMut() -> Result<Next> + Sync>>>,
}

impl VgemFence {
	pub(crate) fn create() -> Result<Self> {
		let fence_ctx = FenceContexts::new(1, QUEUE_NAME, &QUEUE_CLASS_KEY)?;
		let fence = Arc::try_new(fence_ctx.new_fence(0, Fence {})?)?;

		// SAFETY: The caller calls [`FnTimer::init_timer`] before using the timer.
		let t = Box::try_new(unsafe {
			FnTimer::new(Box::try_new({
				let fence = fence.clone();
				move || {
					let _ = fence.signal();
					Ok(Next::Done)
				}
			})? as Box<_>)
		})?;

		// SAFETY: As FnTimer is inside a Box, it won't be moved.
		let ptr = unsafe { core::pin::Pin::new_unchecked(&*t) };

		timer_init!(ptr, 0, "vgem_timer");

		// SAFETY: Duration.as_millis() returns a valid total number of whole milliseconds.
		let timeout =
			unsafe { bindings::msecs_to_jiffies(Duration::from_secs(10).as_millis().try_into()?) };

		// We force the fence to expire within 10s to prevent driver hangs
		ptr.raw_timer().schedule_at(jiffies_later(timeout));

		Ok(Self { fence, _timer: t })
	}
}

You can observe in this code that the initialization of the FnTimer uses an unsafe operation. This happens because we still don’t have Safe Pinned Initialization. But the RfL folks are working hard to land this feature and improve ergonomics when using Pin.

Now, running again the vgem_slow IGT test, you can see that all IGT tests are now passing!

Next Steps

During this time, many improvements landed in the driver: all the objects are being properly dropped, including the DRM device; all error cases are returning the correct error; the SAFETY comments are properly written and most importantly, the timeout feature was introduced. With that, all IGT tests are passing and the driver is functional!

Now, the driver is in a good shape, apart from one FIXME problem: currently, the IOCTL abstraction doesn’t support any drivers that the IOCTLs don’t start in 0x00 and the VGEM driver starts its IOCTLs with 0x01. I don’t know yet how to bypass this problem without adding a dummy IOCTL as 0x00, but I hope to get a solution to it soon.

The progress of this project can be followed in this PR and I hope to see this project being integrated upstream in the future.