Ok, I'm not sure I'm fully understanding the problem.
The normal reason for missing masks is if you have done manual adjustments to the landmarks, in which case any NN generated masks will be deleted (they have to be, the aligned face that generated them is no longer valid). These missing masks can be regenerated with the mask tool.
"There is a mismatch between the number of frames found in the video file (6143) and the number of frames found in the alignments file (12286)."
This is odd to me, but it may be down to how the video file has stored 3D frames. I don't work much with 3D, but I would assume that a 3D frame would have the left and right image in a single frame. However, manual tool is reporting exactly double the number of frames in the alignments file.
This would suggest to me that the video file has somehow packed the left and right images into separate frames.
The manual tool is a tricky beast. It must be frame accurate, even being out by 1 frame will destroy any work you do. I have worked long and hard to get it as accurate as possible, and I would say it works 99% of the time. Unfortunately videos come in all sorts of codecs and formats, so it is not possible to cover 100% of eventualities.
When extracting, the video is simply iterated through from start to end, generating faces for the alignments file. This iteration generated 12,286 frames worth of data.
When using the manual tool, the video must be parsed end to end, to get the frame count. We do this using multiple techniques, including evaluating for missing keyframes, handling duped frames for 3:2 pull down, and trying to handle variable framerates, so we can get an accurate framecount of the video, plus the correct Presentation Timestamp to be able to jump accurately to any requested frame. If any one of these (and other) variables are incorrect, then the process will fail.
Ultimately, some variable in what your video is reporting is not playing nice with what the video actually generates when you iterate from start to end.
Re-encoding the video would be my first port of call. I don't know in what manner, as I do not work with 3D video.