Publisher's Synopsis
Video captioning, the task of describing the content of a video in natural language, is a popular
task both in computer vision and natural language
processing. In the beginning, researchers try to generate sentence-level captions for short video clips
(Venugopalan et al., 2015). Krishna et al. (2017)
propose the task of dense video captioning. The
system needs to detect event segments first and
then generate captions. Park et al. (2019) propose
the task of video paragraph captioning: they use
ground-truth event segments and focus on generating coherent paragraphs. Lei et al. (2020) follow the task setting and propose a recurrent transformer model that can generate more coherent and
less repetitive paragraphs. Considering the groundtruth event segments are often unavailable in practice, our goal is to generate paragraph captions
without ground-truth segments.