Voice-activated intelligent entertainment systems are prevalent in modern TVs. These systems require accurate automatic speech recognition (ASR) models to transcribe voice queries for further downstream language understanding tasks. Currently, labeling audio data for training is the main bottleneck in deploying accurate machine learning ASR models, especially when these models require up-to-date training data to adapt to the shifting customer needs. We present an auto-annotation system, which provides high quality training data without any hand-labeled audios by detecting speech recognition errors and providing possible fixes. Through our algorithm, the auto-annotated training data reaches an overall word error rate (WER) of 0.002; furthermore, we obtained a reduction of 0.907 in WER after applying the auto-suggested fixes.