Microsoft Vision Transfer Learning

Microsoft Vision ResNet 50, transfer learning to classify fruit-360 data with 98% Accuracy


If you are interested in the script directly go to my Google Colab and you can play with it here.

For others who are interested in the full story, read the entire blog:

In practice, very few train the entire network from the scratch i.e., with random initial weights, because of less availability of data, training time, and availability of the state-of-the-art pre-trained model. The widespread practice nowadays is to use the pre-trained model and fine-tune them to the specific task required which often leads to better performance than training the network from the scratch, i.e using Transfer Learning.

There are many pre-trained models available that are trained on a large dataset like ImageNet and Microsoft COCO. A pre-trained model may not be 100% accurate in your application, but it saves huge efforts required to re-invent the wheel.

In this blog, I will be using the Microsoft Vision model which provides us an exceptionally good accuracy compared to other pre-trained models.

Microsoft Vision Model

Credits — Microsoft Research Blog

To achieve state-of-the-art performance in a cost-sensitive production setting, the researchers at Microsoft used multi-task learning with hard parameter sharing and optimized it separately for four datasets:

  1. ImageNet-22k,
  2. Microsoft COCO, and
  3. Two Web-supervised datasets (containing 40 million image-label pairs collected from image search engines)

Using Microsoft vision

pip install microsoftvision

Use of Microsoft vision for transfer learning

Flow diagram for Transfer Learning with Microsoft Vision

Let us take the example of building a classifier for classifying Apple and Orange. The above flow diagram shows the flow for transfer learning using the Microsoft Vision model. The images are first passed to the preprocessor which converts the image into the format required for the Vision Model consumption. The preprocessed images are then passed into the Vision Model which produces 2048 features or embeddings for each image that goes through the model. Then these features can be packaged to apply a custom fully connected neural network or simple logistic regression-based application and needs.


  1. The images have to be normalized to have a value between 0 and 1 using the
    a. mean = [0.485, 0.456, 0.406]
    b. Std = [0.299, 0.224, 0.255]

Transfer learning

Transfer Learning

These packaged or sliced datasets can be further used for training the fully connected layers based on the needs of the problem we are having. These features can be stored in the database, which in turn can be useful in many cases. One such application is when we want to add class to the current setup, we use this.

For example, if we want to add Banana as a class, then you need to pass the images of Banana into the Vision Model and then preprocess them by packaging them with the label and to them add the packaged features of dogs and cats stored previously in the database, give this combined dataset to the fully connected layers to train for 3 classes, in this way we can make use of the class features previously saved.

Vision Model as Feature Extractor


Results look amazing with a test accuracy of 98% for 81 different classes of fruits.

Time is taken for the Training — 25.94 seconds for 10 Epochs with 41322 images and 81 classes.

Google Colab Link here and Github link here.

Find my other blogs here.


  1. Github
  3. Data — 10.17632/rp73yg93n8.1#file-56487963–3773–495e-a4fc-c1862b6daf91



Data Scientist | ML-Ops| | |

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store