Computer Vision API: What does it see?

August 29, 2016

Topics: Accessibility, WordPress.

Tweet this post

Microsoft has a service called the Computer Vision API. It’s a simple system: feed it an image, and it analyzes the image and returns text feedback. This is not a simple analysis task, but the actual process for a user is amazingly streamlined.

Naturally, very little time passed before this service was called up to provide automatic alternative text. And, honestly, this seems like an awesome way to do this! Who needs to take the time and effort to manually type in a description of your images if a machine can do it for you?

Holy crap! Does this really work?

Can a machine really do this for you? Is the content produced by a machine actually a text equivalent for the image?

To give away the punchline – no, it really can’t. I’m not talking about Microsoft’s computer vision API and whether it’s doing the job it claims – I don’t believe that’s actually relevant. I’m arguing that no matter how effectively automation can describe an image, there is always information missing.

Oh, darn. It doesn’t work.

Wait one second! I didn’t say that. It does work. It can work. But it can’t work for all images and all usages.

Ask yourself if the data produced from the literal content of an image suffices as an alternative to the text. Sometimes, it does. For a picture of a white teapot, with no other content, there’s a good chance that the automatic data is sufficient. If you have processing with enough data, you might even be able to identify the model and brand. If your image is stock photography where the only relevant information is “a man in a blue business suit”, then this also might be enough.

(Just for clarity: you will need to double check the descriptions. They aren’t going to be 100% accurate.)

What about significant images?

Significant images may be art, memory, or information, but the image is important on its own, without any surrounding content.

Let’s consider an example where automatic processing falls short. Imagine a photo of your grandmother at a picnic. We can start with a basic idea of what an automated description might produce. As a first pass, I’m imagining something like “Old woman in a park.”

We’ll take our example further, and assume that the machine recognizes this is a picnic. Let’s also assume that they’re able to farm personal data and identify the specific person in the image. Let’s assume that they look at the image metadata and know when and where the photo was taken. Now the automatic alternate text could be “Jane Doe, my grandmother, at a picnic in Grand Forks, North Dakota on May 15th, 2016”.

Is that good alternate text? Probably, yes. But I have two concerns: first, you’ve given up your privacy to allow software to know that much about you. Second, this alternate text is not aware of what this picture means.

Perhaps, the alt attribute you want for this image is “The last time I saw my grandmother, at our annual family reunion at the family farm.” That emotional context may not be in any way present in the photo itself.

I’m not saying that it’s impossible for automatic processing to come up with valid, effective alternative text. There are some definite use cases where this is valid. I am saying, however, that automated image descriptions are fundamentally limited to the objects in view. When you’re describing an image, sometimes the most important part of the image is what’s missing.

All right. I’m confused. Tell me what to do.

Sorry, but I can’t do that. What I can tell you is that you will always need to verify automated alternative text. It won’t always be right no matter how basic your images are. Is inaccurate alternative text better than no alternative text? No, I’m afraid it isn’t. There aren’t any benefits to a description of an image that’s just wrong.

That’s the story of accessibility. Automation can do a lot of grunt work for you – but in the end, you’re going to need to check your work.

Have something to contribute?




« Read my Comment Policy

2 Comments to “Computer Vision API: What does it see?”

  1. Thanks for chiming in, Christopher! I agree that there’s a lot of potential in this, especially for back-filling large volumes of images with no alt attributes – providing *something* is generally going to be better than nothing, as long as the site isn’t littered with a lot of noisy non-content images!

  2. Great comments on the promise and peril of automating alt text. Your example of the image being limited to objects in view is a good example of the limitations that the technology will always face. Over time I think the automated technology will get better and possibly even learn to use the content, but scenarios where the technology could derive the intent or emotion of the author through some kind of neural interface is probably still down the road.

    That said, I could see a day in the not too distant future when this type of technology is able to provide a better description than what many humans would provide. I think your argument still stands in that scenario in that a review of any automated alt text by someone trained in writing effective alt text will be needed for a long time, but I’m less confident that will always be the case.