My advice is take a few large photos of the individual polaroids. Take them into Photoshop or AE and mask out the background. It’ll be easy, they are rectangles. Then, in AE, you can build your scene however you see fit. You can also precomp each polaroid and place the moving video inside the precomp. Turn those comps into 3D layers and you can manipulate them however you want without struggling to match perspective.
I think you may have a harder time pulling a key than just creating your own masks.