The high-resolution videos play back at much lower volume than the
original ones. This adds hard-coded values for how much to amplify
each cutscene. It's all done by ear, and it does introduce some
clipping, but I think it should be acceptable.
Of course, it could also be a problem with the audio decoder, so
this may be the wrong approach entirely.