Visual Discovery at Pinterest

Presented on May 27, 2017
Presenter: Davina

Preview

Davina will give us a look into Pinterest’s Visual Discovery system compared to their previous projects and how it has improved performance in searching and recommendations for their service.

Summary

Key Points From the Paper: - Pinterest Flashlight - object recognition in photos as means to browse. - Pinterest Lens - catalogue items within images as well as images themselves - Large amounts of data, and a lot of metadata due to large user base. - Note domain-specific image retrieval. We focus on interactive retrieval. - Representation Learning: convnet architectures pave the way for large-scale image classification, that can be applied to related topics such as object detection and semantic segmentation. - Deep learning is the hot way to do detection. Pinterest claims that they’re the first to create an end-to-end detection system for large scale visual discovery systems. - Use binarized representation of features. Not only does if have a small memory footprint, but it’s effective at separating noise from images and maintain the label cluster, better at retrieval. Use L2 and L1 distances. - Note that these are tested on Caffe and multi-GPU machines. So this is not tested on actual mobile devices (which will use TensorFlow) - Modify classifier to gear towards Pinterest images over ImageNet images (hq stock vs personal). But use similar weights. - Training: millions of images, 20k classes. Testing: 100k images, same classes. Randomly sample 20k classes from top 100k text search queries (where the query becomes the label). Filter all images for these 20k classes, and then randomly sample from them for training and testing. - Faster R-CNN is the best/fast while precise it also relies on aggressive caching/GPU cost. - Speedup is important so try SSMD which is much faster than R-CNN. The VGG architecture was similar so they just changed the threshold (to be tighter bound), among other things. The OHEM method was good for large and difficult datasets but led to overfitting on smaller ones. - Compared, the overall precision of SSD was better, but recall a little worse. But latency was MUCH smaller. Since SSD is smaller, simple, and smaller latency it’s preferred. - Trying to increase engagement: re-rank using VGG fc8 (visual similarity) on top of features used in control (linear model with existing features). Saw a considerable bump in engagement, especially in the visual categories. - Increase engagment even more by capturing the dependency between category and usefulnes of visual features into the model: crease 32 new features (one for each category). - Users can be interested in certain objects within an image. If we can detect objects, then we can do more targeted visual similarity features. We describe what a dominant visual object is and try a few experiments: if in query we detect dominant object, compute visual similarity (VGG fc6) on just the object, + change ranking model by increasing weight for visual simliarity 5x (if dominant object is present, then visual similarity is more important), or we can just keep the same features but also increasing weights for visual similarity 5x - The first two aren’t so great, probably the tight bounding boxes don’t provide enough context for image representation models. But the last one was the best. It indicates that just the presence of a visual object in general means that visual similarity should be weighed more heavily. - Give Flashlight a proposal: detection, or user-crop (to make up for poor coverage of objects, now you can search any object). The search is then using retrieval features. They knew that finding similar objects would increase engagement.

Attachments