In this blog, I will be going through the Microsoft Vision model, how to achieve the transfer learning using that, and finally will move into implementation of the Microsoft Vision model for classifying fruit-360 with 98% test accuracy.

Microsoft Vision Model

Microsoft Vision Model is a large pre-trained Vision Model that uses ResNet-50 architecture design created by the Multimedia Group at Microsoft Bing. The model is built using the search engine’s web-scale image data to power its Image Search and Visual Search.

  1. ImageNet-22k,
  2. Microsoft COCO, and
  3. Two Web-supervised datasets (containing 40 million image-label pairs collected from image search engines)

Using Microsoft vision

Installation of Microsoft vision

pip install microsoftvision

Use of Microsoft vision for transfer learning

Flow diagram for Transfer Learning with Microsoft Vision


  1. The input images have to be in the BGR format which has the shape of (3 X H X W), where the H — height and W — Width is recommended to be 224 X 224.
  2. The images have to be normalized to have a value between 0 and 1 using the
    a. mean = [0.485, 0.456, 0.406]
    b. Std = [0.299, 0.224, 0.255]

Transfer learning

For transfer learning, we followed the simple method of passing an image and getting the features of the image, then packaging it with the respective label. This was followed for all the images we pass through the Vision Model and slice them with a corresponding label.

Transfer Learning

Vision Model as Feature Extractor

The Vision Model can be used as a feature extractor also, the output of the Vision Model is the 2048 feature which can be stored in the database which in turn can be used in different applications.


I have created Google collaboratory to play with the Microsoft vision model for the classification of the Fruits-360 dataset with 81 classes.


