Start with a three layer sandwich in the timeline. Your original footage in the bottom track, the caption in the middle and another (synchronised) copy of the lower layer on the top.
This top one is the one you’re going to create the masking on to cover your text. There are many ways to do this, but the simplest way would be to use that clips ‘event pan/crop’ to draw/feather and animate the mask as required.
Open the top clip’s ‘event pan/crop’ control by clicking on the upper icon on the rh end of the clip.
At the bottom of the dialogue click the ‘mask’ track to activate it, then use the tools on the lhs to draw and feather a mask for the frame you are looking at. Move to another frame, reposition the mask vertices and carry on animating like this until the required result. The online help has all the details about the relevant tools.
Good luck
Alan