Microsoft Vision Transfer Learning

Abhishek Maheshwarappa
4 min readAug 1, 2021

Microsoft Vision ResNet 50, transfer learning to classify fruit-360 data with 98% Accuracy


In this blog, I will be going through the Microsoft Vision model, how to achieve the transfer learning using that, and finally will move into implementation of the Microsoft Vision model for classifying fruit-360 with 98% test accuracy.

If you are interested in the script directly go to my Google Colab and you can play with it here.

For others who are interested in the full story, read the entire blog:

In practice, very few train the entire network from the scratch i.e., with random initial weights, because of less availability of data, training time, and availability of the state-of-the-art pre-trained model. The widespread practice nowadays is to use the pre-trained model and fine-tune them to the specific task required which often leads to better performance than training the network from the scratch, i.e using Transfer Learning.

There are many pre-trained models available that are trained on a large dataset like ImageNet and Microsoft COCO. A pre-trained model may not be 100% accurate in your application, but it saves huge efforts required to re-invent the wheel.

In this blog, I will be using the Microsoft Vision model which provides us an exceptionally good accuracy compared to other pre-trained models.

Microsoft Vision Model

Microsoft Vision Model is a large pre-trained Vision Model that uses ResNet-50 architecture design created by the Multimedia Group at Microsoft Bing. The model is built using the search engine’s web-scale image data to power its Image Search and Visual Search.

Credits — Microsoft Research Blog

To achieve state-of-the-art performance in a cost-sensitive production setting, the researchers at Microsoft used multi-task learning with hard parameter sharing and optimized it separately for four datasets:

  1. ImageNet-22k,
  2. Microsoft COCO, and
  3. Two Web-supervised datasets (containing 40 million image-label pairs collected from image search engines)

Using Microsoft vision

Installation of Microsoft vision

pip install microsoftvision

Use of Microsoft vision for transfer learning

Flow diagram for Transfer Learning with Microsoft Vision

Let us take the example of building a classifier for classifying Apple and Orange. The above flow diagram shows the flow for transfer learning using the Microsoft Vision model. The images are first passed to the preprocessor which converts the image into the format required for the Vision Model consumption. The preprocessed images are then passed into the Vision Model which produces 2048 features or embeddings for each image that goes through the model. Then these features can be packaged to apply a custom fully connected neural network or simple logistic regression-based application and needs.


  1. The input images have to be in the BGR format which has the shape of (3 X H X W), where the H — height and W — Width is recommended to be 224 X 224.
  2. The images have to be normalized to have a value between 0 and 1 using the
    a. mean = [0.485, 0.456, 0.406]
    b. Std = [0.299, 0.224, 0.255]

Transfer learning

For transfer learning, we followed the simple method of passing an image and getting the features of the image, then packaging it with the respective label. This was followed for all the images we pass through the Vision Model and slice them with a corresponding label.

Transfer Learning

These packaged or sliced datasets can be further used for training the fully connected layers based on the needs of the problem we are having. These features can be stored in the database, which in turn can be useful in many cases. One such application is when we want to add class to the current setup, we use this.

For example, if we want to add Banana as a class, then you need to pass the images of Banana into the Vision Model and then preprocess them by packaging them with the label and to them add the packaged features of dogs and cats stored previously in the database, give this combined dataset to the fully connected layers to train for 3 classes, in this way we can make use of the class features previously saved.

Vision Model as Feature Extractor

The Vision Model can be used as a feature extractor also, the output of the Vision Model is the 2048 feature which can be stored in the database which in turn can be used in different applications.


I have created Google collaboratory to play with the Microsoft vision model for the classification of the Fruits-360 dataset with 81 classes.

Results look amazing with a test accuracy of 98% for 81 different classes of fruits.

Time is taken for the Training — 25.94 seconds for 10 Epochs with 41322 images and 81 classes.

Google Colab Link here and Github link here.

Find my other blogs here.


  1. Microsoft Research Blog
  2. Github
  4. Data — 10.17632/rp73yg93n8.1#file-56487963–3773–495e-a4fc-c1862b6daf91