I had some time to continue hacking my CUDA deinterlacer today. I finally got all the pieces together into a simple bob deinterlacer. The decoded frame is sent to the hardware, where the fields are split and then rendered using my RGB conversion pixel shader.
Unfortunately, I've run into performance problems already. Mapping and unmapping the graphics buffers between CUDA and OpenGL takes 8.6ms for standard definition, and 35.4ms for high def. There can only be a 16ms pause between frames! The actual field splitting only takes <1ms.
Now, this is a naive implementation that uses 9 pixel buffer objects (original frame, even fields, and odd fields each with Y, Cr, and Cb data). But given how the times scaled with the resolution, I'm not sure cramming everything into one large buffer would help. So I'm either doing something wrong or it's a driver problem.