Thoughts about Content Labeling and Data

December 5, 2007

Topics: Accessibility, Semantics, Usability.

Interestingly, it appears that some of the ideas discussed in this article are actually being actively tested by Google.

As of September 2009, it appears that Google is actually putting this concept into practice.

An interesting thought in indexing and handling page structure is the concept that different areas of a single page can be identified and considered independently from surrounding bodies of content. This particularly applies to specific and readily identifiable data-types, such as phone numbers, postal codes, or abbreviations; but can also be extended to include broader content labeling.

A well-structured XML document has an absolutely clear labeling system for data built into the structure. If you take any RSS feed, for example, the elements which identify <title>, <link> or <managingEditor> can’t readily be mistaken.

A well-structured, semantically sensible XHTML or HTML document doesn’t offer nearly the same degree of data particulation — the higher level data elements can sometimes be fairly clear, as is the case with <address> or <cite> elements, but other potentially valuable elements end up providing relatively neutral value: <h2> or <div>.

For users, you can readily provide some parts of this needed structure by using in-page references to provide a table of contents, when a document requires significant structure. (This is provided normally for screen readers; but no such equivalent is available in most standard browsing methods.) Giving a content heading or content section a unique id and providing a link to it can provide value for your users by enabling them to quickly access areas of your document with a greater specificity to their needs than the document may provide as a whole.

This kind of content particulation could be leveraged by a search engine to provide immediate access to a specific point within a resource.

Additional specificity can also be achieved through the use of Microformats. These externally defined mini-standards for content identification can enable various tools to better understand and interact with your site content.

From an accessibility perspective, this is not currently receiving any kind of meaningful application that I’m aware of — but it’s an interesting concept. I like the idea that the use of microformats (or any method of explicitly labeling and defining data) could allow any user agent to make immediate use of that information.

The whole idea of dividing your semantic content into labeled and defined sections provides a great deal of value for any user scanning over your document. Dividing your content into areas with relevant headings makes that document more scannable, and also allows search engines to better understand the overall relevance of your document at a more specific level.

I’m not sure I have a real “conclusion” to pull from these thoughts. Basically, the underlying concept is “identifying your information is a good thing.” That’s a well-known issue in accessibility when it comes to visual, audio, or video content, with audio description, transcriptions, and captioning filing in those spaces. However, with more finely particularized elements of data, it seems that the same principal has the potential to provide an overall better user experience simply by enabling the user-agent to quickly realize the best method to deal with the information.

Thoughts for this blog post inspired by Bill Slawski during a PubCon session…

Have something to contribute?




« Read my Comment Policy

Start the conversation!