Grounding DINO: Open Vocabulary Object Detection on Videos