Show, Adapt and Tell:
Adversarial Training of Cross-domain Image Captioner

Tseng-Hung Chen1     Yuan-Hong Liao1     Ching-Yao Chuang1     Wan-Ting Hsu1     Jianlong Fu2     Min Sun1    

1National Tsing Hua University    2Microsoft Research

IEEE International Conference on Computer Vision (ICCV) 2017

[Video Overview]  [Examples]  [Sentence Style Transfer]

Abstract: Impressive image captioning results are achieved in domains with plenty of training image and sentence pairs (e.g., MSCOCO). However, transferring to a target domain with significant domain shifts but no paired training data (referred to as cross-domain image captioning) remains largely unexplored. We propose a novel adversarial training procedure to leverage unpaired data in the target domain. Two critic networks are introduced to guide the captioner, namely domain critic and multi-modal critic. The domain critic assesses whether the generated sentences are indistinguishable from sentences in the target domain. The multi-modal critic assesses whether an image and its generated sentence are a valid pair. During training, the critics and captioner act as adversaries -- captioner aims to generate indistinguishable sentences, whereas critics aim at distinguishing them. The assessment improves the captioner through policy gradient updates. During inference, we further propose a novel critic-based planning method to select high-quality sentences without additional supervision (e.g., tags). To evaluate, we use MSCOCO as the source domain and four other datasets (CUB-200-2011, Oxford-102, TGIF, and Flickr30k) as the target domains. Our method consistently performs well on all datasets. In particular, on CUB-200-2011, we achieve 21.8% CIDEr-D improvement after adaptation. Utilizing critics during inference further gives another 4.5% boost.

  title={Show, Adapt and Tell: Adversarial Training of Cross-domain Image Captioner}, 
  author={Chen, Tseng-Hung and Liao, Yuan-Hong and Chuang, Ching-Yao and Hsu, Wan-Ting and Fu, Jianlong and Sun, Min}, 
  journal={arXiv preprint arXiv:1705.00930}, 

 Code  Paper  Supplementary

Video Overview

Critic-based Planning


Our method can adapt the sentence style from source to target domain without the need of paired image-sentence training data in the target domain. Here we show the captions before and after domain adaptation for CUB, TGIF and Flickr30k. Click on and to see the captions generated before/after adaptation.


A yellow and yellow bird is sitting on a branch.
This is a yellow bird with a black head and a small beak.

A bird flying through the air with a sky background.
A large bird with a long tail and a long beak.

A small bird sitting on a branch of a tree.
This is a black bird with a white belly and a small beak.

A bird is standing on a table with flowers.
A small bird with a white belly and a black head.


A cat is standing in a room with a cat.
A cat is playing with a toy in a room.

A baseball player is a ball on a field.
A group of men are playing soccer on a field.

A man in a black shirt and a tie.
A man in a suit is singing into a microphone.

A woman in a black shirt and a white shirt and a blue tie.
A woman is dancing with a crowd of people.

MSCOCO → Flickr30k

A man rowing a boat with a dog on it.
A man in a canoe in the water.

A person riding a horse on a dirt road.
A woman is riding a horse in a rodeo.

A little boy holding a snowboard in the snow.
A child in a red jacket is standing in the snow.

A woman in a tennis court holding a tennis racket.
A woman in a white dress is playing tennis.

Sentence Style Transfer

Here we show the sentences generated from different models, i.e, , , , . Click on the buttons to see and listen to the generated captions.

A large air plane on a run way.
A large white and black airplane with a large beak.
A plane is flying over a field.
A large airplane is sitting on a runway.

A jet airplane flying through the sky with a cloud of smoke.
A small white and black plane flying through the air.
A jet is flying through the air on a clear day.
A jet airplane is flying in the air.

A man riding a skateboard up the side of a ramp.
A man riding a skateboard on a white ramp.
A man is doing a trick on a skateboard.
A man in a blue shirt is doing a trick on a skateboard.

A man is typing on a laptop computer.
A person with a black and white laptop and a black computer.
A man is sitting at a desk with a laptop.
A man is working on a computer.

A traffic light is seen in front of a large building.
A yellow traffic light with a yellow light.
A traffic light is hanging on a pole.
A street sign is lit up in the dark.

A black dog sitting on the ground next to a window.
A black and white dog with a black head.
A dog is looking at something in the mirror.
A black dog is looking out of the window.

A group of cupcakes with a blue and white frosting.
A couple of small pieces of cake.
A group of three cakes sitting on top of a table.
A group of cupcakes are sitting on a table.

A bird sitting on a rock in the water.
This is a white bird with a black head and a black beak.
A bird is standing on a rock in the water.
A black bird standing on a rock in the water.

Related Papers

  • Towards Diverse and Natural Image Descriptions via a Conditional GAN
  • by Bo Dai et al.

  • Recurrent Topic-Transition GAN for Visual Paragraph Generation
  • by Xiaodan Liang et al.

    ContactTseng-Hung Chen