top of page

Defining “Vision” in “Computer Vision”

Updated: Sep 20, 2022

Computer Vision also referred as Vision is the recent cutting edge field within computer science that deals with enabling computers, devices or machines, in general, to see, understand, interpret or manipulate what is being seen.

Computer Vision technology implements deep learning techniques and in few cases also employs Natural Language Processing techniques as a natural progression of steps to analyze extracted text from images.

With all the advancements of deep learning, building functions like image classification, object detection, tracking, and image manipulation has become more simpler and accurate thus leading way to exploring more complex autonomous applications like self-driving cars, humanoids or drones.

With deep learning, we can now manipulate images, for example superimpose Tom Cruise’s features onto another face. Or convert a picture into a sketch mode or water color painting mode. We can eliminate the background noise of a picture and highlight the subject in focus or even with most shaky hands a stable photograph can be clicked. We can estimate the closeness of, structure and shape of objects, and estimate the textures of a surface too. With different lights or camera exposure, we can identify objects and recognize an object that we have seen before.

In Computer Vision, by saying “enabling computers see”, we mean enabling machine or devices to process digital visual data that can range from images taken from traditional cameras to a graphical representation of a location, videos, and a heat intensity map of any data and beyond.

With the above elaboration of definition, we can see Computer Vision applications becoming ubiquitous in our day-to-day life. We can now think of finding an object or a face in a video and this can happen in a live video feed, understand motion and patterns within a video, increase decrease, size, brightness or sharpness of an image.

To understand what constitutes computer vision, let’s look at the image below:

Though we are looking this image for the first time, we can tell that this image of a sport cricket., being played on a a bright day, it is a match between teams Australia and South Africa and that Australia won the match. The overall mood is that of a celebration and a few players can be named either by recognizing their facial features or by reading the names imprinted over their shirts.

This information we observed is complex for a Computer Vision application this could be a set of multiple inferences. Let us now map the whole human driven interpretation to machine’s vision processes.

1. Firstly, we observe objects like grass/ground, people, cricket equipment advertisements, sports uniforms. These objects are then grouped into categories. This process of extracting information is referred to as “image detection and classification”.

2. At a high level, there is a ground and there is a pitch, while it is difficult to exactly point the boundaries, making the markings based on the occupancy of the object within the image is possible and this process is referred to as “Image segmentation

3. Taking this to the next level, we can get smarter and smaller boundaries that could help identify specific people and objects in the image. This can be observed as small boxes marked around each potential unique object as shown in the image below.

4. Now, within each box, there could be people or different cricket related objects. At a next level we can detect and tag what each box contains. This process is called “Object detection”.

5. Extending this further, we can look closely at the people and in specific detect faces and through “Face Recognition” process exactly determine who the player is. We also can observe that each person is of different height and build.

6. Names inscribed at the back of the shirts of the players can be another source for determining who the player is. Optical Character Recognition (OCR) based hand writing recognition process can recognize shapes and lines and infer letters or characters.

7. Depending on the color of the uniform, we can infer what type of match it is and what teams are playing. Identifying the colors of the pixels again is a part of “Image detection and manipulation” process.

8. In the process of playing the game, movement of the ball can be tracked and the speed at which the ball strikes the bat can be computed or determined. The path that ball potentially takes can be determined as well. A few important calculations like how many ball serves have hit what position on the pitch can be computed. This is possible using a process called “Motion Tracking”.

9. There are cases where the determination of whether the player is “in” or “out” is determined on his leg position while he / she is striking the ball. To accurately determine this, images from different angles by different cameras taken at the moment need to analyzed to identify the accurate position of the players leg. This process is called “Image reconstruction” where an object is reconstructed from different tomographic projects of the same object in different angles.

18 views0 comments

Recent Posts

See All


bottom of page