Semi-automated Annotation model , Polygon RNN, Polygon RNN++

In this blog, I will introduce Polygon RNN and Polygon RNN ++, which automate / semi-automate the semantic segmentation’s annotations.

If you wanted to use machine learning, you need both annotations and datas, ,unless you’re doing an unsupervised approach. The needed annotations are depending on your task such as “classification”, “object detection”, and “semantic segmentation”.

In the case of classification task, the annotation is a name which indicate image’s category ( in this case “dog”) or label which is an integer value corresponding to the category. In the case of object detection task , an object (in this case, a dog) in the image is enclosed by a rectangle, and (x, y) coordinates corresponding to each corner become annotations. In the case of the semantic segmentation task, an object (in this case, a dog) is enclosed by a polygon. The annotations are the (x, y) coordinates of each vertex of the polygon surrounding the an object. Of course, the highest cost annotation is that of “semantic segmentation”.

In this article, I introduce Polygon RNN and Polygon RNN++, which support automatic / semi-automatic annotation for semantic segmentation.

Polygon RNN

polygon RNN was proposed in 2017 by Lluis Castrejon et al. in 2017. Polygon RNN is …

  1. A model that can help with annotation by semi-automating region segmentation annotation
  2. Solving semantic segmentation problems as that of “searching object vertices”
  3. Searching for new vertices by extracting features ​​with CNN and predict new vertices with RNN by using vertices that have already been searched as time series.
  4. able to include human modifications

Architecture of Polygon RNN

Overview of Polygon RNN

The first point of Polygon RNN is that images’ features are extracted with VGG. It uses not only the feature map of the final layer but also feature map of other resolution levels. (Red frame in the figure below)

Feature extractor based on VGG

The second one is that new vertices are predicted with RNN by using vertices that have already been searched as time series. For example, to predict the rightmost vertex of the dog in the lower row of the figure below, place the vertices t-2 and t-1 in the figure into RNN and output the vertex of t.

Results of Polygon RNN

The left side of the figure below is the correct answer (Ground Truth), and the right side is the prediction result. Although it is a little rough, it turns out that the area can be predicted well.

More precise annotations can be made with human modifications. The right side of the figure below shows the result of a person correcting a part of the predicted vertex. You can see that the side mirror (yellow circle) is a little more precise.

The quantitative evaluation results are shown in the table below. T = 1 to 4 at the bottom of the table indicates the level of human modification. When the distance between Ground Truth and the predicted vertex is greater than a certain distance threshold, human can modify predictions. At “Ours (Automatic)”, which is without human intervention, IoU (the degree of overlap between Ground Truth and the prediction region) is 73.3 (%). At T = 4 , which indicates the smallest human intervention (the distance threshold is large, so the number of interventions is small), IoU has improved to 82.2, and the annotation speed can be annotated 7.31 times faster than the the human annotation speed. The annotation speed is 3.61 times faster even with T = 1, which is the most frequent human intervention, and IoU shows the highest value of 87.7%.

Problems of Polygon RNN

Polygon RNN looks pretty good at first glance, but has some problems.

First, Polygon RNN solves the position of the new vertex as a classification problem. With this setting, even if the outline of the object is captured, a penalty will be incurred if it is not the correct answer for the annotation. Is not directly correlated with IOU ”.

Second, the resolution of the vertices are coarse because these depend on feature map resolution (In this case, 28x28).

Polygon RNN ++, which is introduced below, addresses these issues.

Polygon RNN++

Polygon RNN ++ was improved on the basis of polygon RNN by David Acuna et al. in 2018. PolygonRNN is…

  • Maximizing IOU as reward by using reinforcement learning
  • predicting vortices with high resolution by using Graph Neural Networks.

Maximizing IOU by using reinforcement learning

After learning the network by solving the classification problem of vertices like Polygon RNN, reinforcement learning starts with using it as the initial value.

They treat network parameters as policies and maximize IoU through reinforcement learning In order to maximize IoU, a loss function in which the sign of IoU is changed is considered.

However, since IoU cannot be differentiated, the expected value of the gradient is calculated using the REINFORCE trick (Williams et al. (1992)). r is Reward (IoU), and p_θ is policy (network parameter).

This is acceptable, but it is known that learning is not stable. Therefore, we use the self critical method (Rennie et al. (2017)) in which they set a baseline as follows. The baseline uses the previous maximum reward, and if it improves (Reward exceeds previous Reward), the value in the parenthesis takes a positive value, and if there is no improvement, the value becomes zero or less. It makes the learning stable.

Evaluator Network

Evaluator Network predicts IoU values with three inputs: output of CNN, hidden layer state of RNN, and predicted polygons (object region).

At inference, a predicted object region corresponding to the multiple initial vertex candidates are calculated, the IoU is evaluated by this network. We can select best polygon (and initial vortex) by selecting the largest IoU candidate.

Evaluator Network learns after RL learning has converged, and plays an active role in selecting initial vertex candidates during inference. Note that this network is not used when learning Encoder / Decoder using RL.。

Gated Graph Neural Networks

Input a polygon with a midpoint added to the time-series Graph Neural Networks , Gated Graph Neural Networks (Li et al. (2015)). It solves as a classification problem of which direction to move.

ResNet Encoder

The Encoder has been changed from VGG to ResNet to get better quality features.

Results of Polygon RNN++

Compared with the Polygon RNN, the resolution of the output vertices is improved.

It can also be seen the effectiveness of RL, Evaluator Network, and Gated Graph Neural Network.


In this blog, we introduced Polygon RNN and Polygon RNN++, which automate / semi-automate semantic segmentation. In both cases, the annotation is regarded as a problem of outputting vertices, and the vertices are output sequentially in time series.
Since creating annotations is very important in expanding the use of machine learning, I think that such research has great significance.

Twitter account : @AkiraTOSEI


  • L. Castrejon et al. Annotating object instances with a polygon-rnn.
  • D. Acuna et al. Efficient Interactive Annotation of Segmentation Datasets with Polygon-RNN++
  • R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning
  • S. J. Rennie et al. Self-critical sequence training for image captioning