OCHuman Dataset: A dataset focused on heavily occluded human

This dataset focus on heavily occluded human with comprehensive annotations including bounding-box, humans pose and instance mask. This dataset contains 13360 elaborately annotated human instances within 5081 images. With average 0.573 MaxIoU of each person, OCHuman is the most complex and challenging dataset related to human. Through this dataset, we want to emphasize occlusion as a challenging problem for researchers to study.

@InProceedings {zhang2018pose2seg,
  title={Pose2Seg: Detection Free Human Instance Segmentation},
  author={Zhang, Song-Hai and Li, Ruilong and Dong, Xin and Rosin, Paul and Cai, Zixi and Han, Xi and Yang, Dingcheng and Huang, Hao-Zhi and Hu, Shi-Min},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},

Tsinghua-Tencent Traffic Light

Tsinghua-Tencent Traffic Light: Traffic signal detection and classification in street views using an attention model

Detecting small objects is a challenging task. We focus on a special case: the detection and classification of traffic signals in street views. We present a novel framework that utilizes a visual attention model to make detection more efficient, without loss of accuracy, and which generalizes. The attention model is designed to generate a small set of candidate regions at a suitable scale so that small targets can be better located and classified. In order to evaluate our method in the context of traffic signal detection, we have built a traffic light benchmark with over 15,000 traffic light instances, based on Tencent street view panoramas. We have tested our method both on the dataset we have built and the Tsinghua-Tencent 100K (TT100K) traffic sign benchmark. Experiments show that our method has superior detection performance and is quicker than the general faster RCNN object detection framework on both datasets. It is competitive with state-of-the-art specialist traffic sign detectors on TT100K, but is an order of magnitude faster. To show generality, we tested it on the LISA dataset without tuning, and obtained an average precision in excess of 90%.

  author="Lu, Yifan and Lu, Jiaming and Zhang, Songhai and Hall, Peter",
  title="Traffic signal detection and classification in street views using an attention model",
  journal="Computational Visual Media", 


Human10: A real-world human body dataset

Human10 is a real-world human body dataset, which contains 10 4D sequences of human actions and 10054 frames in total. For each frame, there are depthmaps and masks of 4 views along with corresponding triangle meshs. In addition, 4 fixed camera calibration parameters are provided in each sequence.

  author = {Cao, Yan-Pei and Liu, Zheng-Ning and Kuang, Zheng-Fei and Kobbelt, Leif and Hu, Shi-Min},
  title = {Learning to Reconstruct High-quality 3D Shapes with Cascaded Fully Convolutional Networks},
  booktitle = {The European Conference on Computer Vision (ECCV)},
  month = {September},
  year = {2018}


CTW Dataset: Chinese Text in the Wild

We provide a newly created dataset of Chinese text with about 1 million Chinese characters annotated by experts in over 30 thousand street view images. This is a challenging dataset with good diversity. It contains planar text, raised text, text in cities, text in rural areas, text under poor illumination, distant text, partially occluded text, etc. For each character in the dataset, the annotation includes its underlying character, its bounding box, and 6 attributes. The attributes indicate whether it has complex background, whether it is raised, whether it is handwritten or printed, etc.

Tsinghua-Tencent 100k

Tsinghua-Tencent 100k: Traffic-Sign Detection and Classification in the Wild

Although promising results have been achieved in the areas of traffic-sign detection and classification, few works have provided simultaneous solutions to these two tasks for realistic real world images. We make two contributions to this problem. Firstly, we have created a large traffic-sign benchmark from 100000 Tencent Street View panoramas, going beyond previous benchmarks. We call this benchmark Tsinghua-Tencent 100K. It provides 100000 images containing 30000 traffic-sign instances. These images cover large variations in illuminance and weather conditions. Each traffic-sign in the benchmark is annotated with a class label, its bounding box and pixel mask. Secondly, we demonstrate how a robust end-to-end convolutional neural network (CNN) can simultaneously detect and classify traffic-signs. Most previous CNN image processing solutions target objects that occupy a large proportion of an image, and such networks do not work well for target objects occupying only a small fraction of an image like the traffic-signs here. Experimental results show the robustness of our network and its superiority to alternatives. The benchmark, source code and the CNN model introduced in this paper is publicly available.

  author = {Zhu, Zhe and Liang, Dun and Zhang, Songhai and Huang, Xiaolei and Li, Baoli and Hu, Shimin}, 
  title = {Traffic-Sign Detection and Classification in the Wild}, 
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, 
  year = {2016}