We placed 5th in our first ever BigData Cup Challenge. Here’s what we learned.
As a fun exercise, the CurioCat team competed in our first ever IEEE BigData Cup Challenge, entitled “Building Extraction”. The premise of this competition was to design a neural network to extract building footprints. Specifically, given a satellite image, the neural network must do three things: (a) detect individual buildings, (b) fit each building’s footprint to a polygon, and (c) list the vertices of every polygon. Below is a preview of a test image and our answer. Looks good so far!
What is Instance Segmentation?
In Computer Vision (CV), this task is known as instance segmentation. “Instance” indicates that each building is to be segmented individually. Semantic segmentation, on the other hand, treats objects of the same class as one entity, meaning that a cluster of buildings is segmented as a giant blob.
The best way to understand the task is to look at the training data. Our training data consisted of 9000 image pairs. The input is a satellite image and the output is the corresponding mask of building footprints. Below, we show an example of one image pair. On the right is satellite image of a Chicago neighborhood. On the left is the mask showing three classes: background (purple), building interior (green), and building outline (yellow). Eagle-eyed readers will note that the mask has some noticeable errors! This is because a human had to segment this image by hand!
Creating a Neural Network for Image Mapping
So, our task boils down to creating a neural network that can “map” the input image to the output image. This task is challenging because the test images consists of buildings in a variety of settings: (1) rural, (2) suburban, (3) urban, (4) commercial, (5) cityscape, (6) industrial, and (7) seaport. To be effective, a neural network must “generalize” well over all potential settings.
To do this, we created a neural network using a Resnet50 backbone. Resnet50 is a type of U-net architecture that is specifically designed for mapping image pairs. After our neural network predicts the output mask, we can easily detect and extract building footprints using the Python function cv2.findContours. Below are some more of our test results for different settings:
Why Did Our AI Hit a Ceiling??
So, how did our neural network perform overall? Our neural network was excellent in rural, suburban, and urban settings. However, our neural network was relatively weak in commercial, cityscape, industrial, and seaport areas. Why is this the case? Where did we go wrong?
Luckily, the answer is simple. Remember, a neural network can only go as far as its training data allows! So, if you want your neural network to learn seaports, you need to show it many seaports. Alas, due to time constraints, the training set we built was limited to certain types of areas. Specifically, we trained our neural network on the INRIA dataset for Tyrol Austria, Kitsap WA, and Chicago IL, which consists almost exclusively of rural, suburban, and urban areas. Therefore, it makes sense that our neural network would outperform in these areas.
Lessons Learned:
If we could do this BigData competition over, here is what we would change:
- We would have added more commercial, city, industrial, and seaport areas into our training set.
- We would have trained our neural network for more than 3 epochs. Research suggests 10-100 epochs.
- Our neural network was fooled by objects that look like buildings, including shipping containers, satellite dishes, grain silos, storage tanks, swimming pools, large air conditioning units, and even tennis courts. With this knowledge, we would have added images of these objects to inform our neural network of what not to segment.
- Our neural network was fooled by shadows. In images where the sun was low, peaked rooftops possessed different shades: the side facing toward the sun and the side facing away from the sun were lighter and darker, respectively. Instead of segmenting the building as one building, the neural network mistakenly “saw” two buildings. So, our training set should have included satellite images from different times of day.
- Our training dataset consisted of moderate resolution images. However, the testing set contained some grainy images. To strengthen our neural network’s ability to generalize, we could have injected Gaussian noise into our training set.
- Our training images were more zoomed-in than our testing images. So interpolating down our training images may have been helpful.
- For our multiclass classification, we gave our three classes, i.e., background, interior, and outline, weights of 0.25, 0.25, and 0.50, respectively. However, given that there are far fewer outline pixels, perhaps we should have given the outline an even greater weight.
- We would have experimented with different backbones. Instead of Resnet50, we could have tried DeepLab, Mask R-CNN, or YOLO.
- Finally, we could have tried a new technology like vision transformers.