Microsoft has a service called the Computer Vision API (Application Programming Interface). It’s a simple system: feed it an image, and it analyzes the image and returns text feedback. This is not a simple analysis task, but the actual process for a user is amazingly streamlined.

Naturally, very little time passed before this service was called up to provide automatic alternative text. And, honestly, this seems like an awesome way to do this! Who needs to take the time and effort to manually type in a description of your images if a machine can do it for you?

Holy crap! Does this really work?

Can a machine really do this for you? Is the content produced by a machine actually a text equivalent for the image?

To give away the punchline – no, it really can’t. I’m not talking about Microsoft’s computer vision API and whether it’s doing the job it claims – I don’t believe that’s actually relevant. I’m arguing that no matter how effectively automation can describe an image, there is always information missing.

Oh, darn. It doesn’t work.

Wait one second! I didn’t say that. It does work. It can work. But it can’t work for all images and all usages.

Ask yourself if the data produced from the literal content of an image suffices as an alternative to the text. Sometimes, it does. For a picture of a white teapot, with no other content, there’s a good chance that the automatic data is sufficient. If you have processing with enough data, you might even be able to identify the model and brand. If your image is stock photography where the only relevant information is “a man in a blue business suit”, then this also might be enough.

(Just for clarity: you will need to double check the descriptions. They aren’t going to be 100% accurate.)

What about significant images?

Significant images may be art, memory, or information, but the image is important on its own, without any surrounding content.

Let’s consider an example where automatic processing falls short. Imagine a photo of your grandmother at a picnic. We can start with a basic idea of what an automated description might produce. As a first pass, I’m imagining something like “Old woman in a park.”

We’ll take our example further, and assume that the machine recognizes this is a picnic. Let’s also assume that they’re able to farm personal data and identify the specific person in the image. Let’s assume that they look at the image metadata and know when and where the photo was taken. Now the automatic alternate text could be “Jane Doe, my grandmother, at a picnic in Grand Forks, North Dakota on May 15th, 2016”.

Is that good alternate text? Probably, yes. But I have two concerns: first, you’ve given up your privacy to allow software to know that much about you. Second, this alternate text is not aware of what this picture means.

Perhaps, the alt attribute you want for this image is “The last time I saw my grandmother, at our annual family reunion at the family farm.” That emotional context may not be in any way present in the photo itself.

I’m not saying that it’s impossible for automatic processing to come up with valid, effective alternative text. There are some definite use cases where this is valid. I am saying, however, that automated image descriptions are fundamentally limited to the objects in view. When you’re describing an image, sometimes the most important part of the image is what’s missing.

All right. I’m confused. Tell me what to do.

Sorry, but I can’t do that. What I can tell you is that you will always need to verify automated alternative text. It won’t always be right no matter how basic your images are. Is inaccurate alternative text better than no alternative text? No, I’m afraid it isn’t. There aren’t any benefits to a description of an image that’s just wrong.

That’s the story of accessibility. Automation can do a lot of grunt work for you – but in the end, you’re going to need to check your work.