Abstract: Visual Speech Recognition (VSR), also known as lip-reading, requires accurate mouth tracking and alignment to achieve high performance. Conventional approaches rely on complex pipelines ...