Metadata | CMS Blog Watch

metadata

Taxonomy vs. Metadata

Posted in DPCI, Taxonomy, digital asset management, metadata on March 15th, 2010 by Techini10 – Comments Off

An important step in defining a Digital Asset Management system or a Web Content Management System strategy is to define the metadata and taxonomy needs that will support an organization’s goals for categorizing and relating both assets and content.

Complete Story

What makes different WCM different?

Posted in WCM, metadata, twitter, wordpress on March 4th, 2010 by Philippe Parker – Comments Off

NMNH beetle specimens by Mr T in DC

I’ve recently been working on a number of web content management system selections. My preference is to carry these out in a two-stage process (see the one-sheet guide to selecting a WCM). The first stage pre-qualifies suppliers according to client attitudes to cost, risk and technological preferences. The second stage then gets into the real tasks that you want to perform, discovering how the WCM enforces and informs processes.

Like most other people in this business, I approach this from the point of view that there is no best WCM, just different products that may be viable for different kinds of tasks. It’s about finding a product that will allow you to get started as quickly as possible without precluding later ambitions. I try to show clients what a WCM could do for them, and in turn client aspirations suggest product features. These usually centre around a number of core areas:

Editorial interface

How is content updated? Is it through a browser, a document template, or some other application? If it is through a browser, which browsers does it work in? Does it require a plug-in? How viable are those constraints within the organisation? If the organisation is planning to devolve editing, how appropriate are WYSIWYG and in situ editors? If content entry needs to be more controlled via forms, how will users preview their work? Can the WCM offer different editorial interfaces for different types of users? And hand in hand with the interfaces, if you have lots of devolved editors, how does the WCM assure concurrent contribution and secure access for different kinds of users?

Pages vs. elements

Some WCM only really have the concept of pages and associated assets, making it hard to re-use fragments of content across the site. This simple model is generally appropriate for two scenarios: where there are many devolved, occasional contributors who would be confused by having to perform multiple tasks to get a piece of content to update on one part of the site and wouldn’t immediately understand the implications of a more complex editorial change; and for sites which have quite user journeys with little information appearing in more than one place.
For sites which need to re-use content a lot, where there’s a central editorial team assuring that changes are propagated correctly, more advanced systems that use “fragments” of content in multiple locations across the site in an “edit once, publish many” model can bring significant business benefit. These content management models usually bring more flexible templates but they can also make it more difficult to audit content: what did a given page look like on a specific day and who made the content changes? They are also reliant on robust link cohesion, so that if you move a piece of content, the WCM continues to link to its new location.

Content structures

Absolutely central to most WCM is the concept of a content type. This is the model that allows you to define which fields editors need to complete to publish a page and the constraints on those: e.g. title (no more than 200 characters), summary (plain text), main body text (rich text), location (postal code), category (list of valid values), etc. These structures are important for a number of reasons. They allow you to create business rules for linking content, such as get me the three latest news items about Germany. They allow you to create different presentations for different types of content, so am event looks completely different from an FAQ. And they allow you to contol which information must be completed before content can go live and how it will be presented on different platforms once it’s been published.
There are other metaphors that WCM use to relate complex content: hierarchical metadata structures such as folders, categories or channels enable you to group content together in more complex ways. Flatter metadata structures also allow you to “traverse” across website structures and relate content in differnt part of the information architecture that don’t sit into this hierarchy. It’s often useful to have multiple kinds of metadata, particularly faceted taxonomy, if your content is particularly complicated and needs a lot of content relationships in order to achieved desired user journeys.

Technology

Where the WCM isn’t a standalone application but needs to integrate with other systems in a web platform – user directories, CRM, eCommerce, transactional tools – you need to validate how it will communicate with other systems. Is it through the Application Programming Interface (API), web services, or some other method?
The maintenance and extensibility of the system can also be important requirements. If I need to change a content type, what does that involve? If I need to get data from another application, can I do this in a de-coupled way?

Some other factors may come into play, such as workflow, internationaisation and personalisation. If one product is particularly strong in one of these areas and it’s a key requirement, then it may get into a shortlist even if it’s weaker in some of the other areas identified above.

This all brings me to the recent debate about whether WordPress is a CMS, with numerous contributions on Twitter as well as from:

My experience of WordPress is that it’s really good at two key features where some established content management systems are relatively poor: search engine optimisation and comments. On SEO, it ties your blog post title to a friendly URL, enables good internal linking (as long as you don’t move any pages), allows tagging and categorisation and offers some great SEO tools. Comments meanwhile can be quite tricky for some WCM that operate separate content contribution and consumption environments, but WordPress does this easily, with useful anti-spamming tools and the ability to follow the comment conversation by RSS or email.

When it comes to the question of whether WordPress is or isn’t a WCM, the best analogy I could come up with was a camera phone. A camera phone does take pictures, it is convenient, some phones even have a flash and autofocus. But would you get a camera phone specifically to use as a camera? I think not if you’re serious about photography, It is a camera, but a very limited one.

WordPress is a blogging tool with some shared characteristics of a WCM. If you apply some of the many available modules to it you can come up with a really nice proposition, up to a point. But you’re effectively hacking the software to get it to behave as many WCM already do. You can get any software to do pretty much anything in the end, but that still doesn’t make it a WCM.

WordPress is widely used by many organisations as a web content management system and there are a lot of photos taken on camera phones. But you need to understand the product’s limitations and if these don’t affect you and you’re achieving what you want, then no one should criticise you for your choice. But let’s be sensible about it and say that even if there’s no such thing as the best WCM, you know that it wouldn’t be WordPress.

What makes different WCM different?

Posted in WCM, metadata, twitter, wordpress on March 4th, 2010 by Philippe Parker – Comments Off

NMNH beetle specimens by Mr T in DC

I’ve recently been working on a number of web content management system selections. My preference is to carry these out in a two-stage process (see the one-sheet guide to selecting a WCM). The first stage pre-qualifies suppliers according to client attitudes to cost, risk and technological preferences. The second stage then gets into the real tasks that you want to perform, discovering how the WCM enforces and informs processes.

Like most other people in this business, I approach this from the point of view that there is no best WCM, just different products that may be viable for different kinds of tasks. It’s about finding a product that will allow you to get started as quickly as possible without precluding later ambitions. I try to show clients what a WCM could do for them, and in turn client aspirations suggest product features. These usually centre around a number of core areas:

Editorial interface

How is content updated? Is it through a browser, a document template, or some other application? If it is through a browser, which browsers does it work in? Does it require a plug-in? How viable are those constraints within the organisation? If the organisation is planning to devolve editing, how appropriate are WYSIWYG and in situ editors? If content entry needs to be more controlled via forms, how will users preview their work? Can the WCM offer different editorial interfaces for different types of users? And hand in hand with the interfaces, if you have lots of devolved editors, how does the WCM assure concurrent contribution and secure access for different kinds of users?

Pages vs. elements

Some WCM only really have the concept of pages and associated assets, making it hard to re-use fragments of content across the site. This simple model is generally appropriate for two scenarios: where there are many devolved, occasional contributors who would be confused by having to perform multiple tasks to get a piece of content to update on one part of the site and wouldn’t immediately understand the implications of a more complex editorial change; and for sites which have quite user journeys with little information appearing in more than one place.
For sites which need to re-use content a lot, where there’s a central editorial team assuring that changes are propagated correctly, more advanced systems that use “fragments” of content in multiple locations across the site in an “edit once, publish many” model can bring significant business benefit. These content management models usually bring more flexible templates but they can also make it more difficult to audit content: what did a given page look like on a specific day and who made the content changes? They are also reliant on robust link cohesion, so that if you move a piece of content, the WCM continues to link to its new location.

Content structures

Absolutely central to most WCM is the concept of a content type. This is the model that allows you to define which fields editors need to complete to publish a page and the constraints on those: e.g. title (no more than 200 characters), summary (plain text), main body text (rich text), location (postal code), category (list of valid values), etc. These structures are important for a number of reasons. They allow you to create business rules for linking content, such as get me the three latest news items about Germany. They allow you to create different presentations for different types of content, so am event looks completely different from an FAQ. And they allow you to contol which information must be completed before content can go live and how it will be presented on different platforms once it’s been published.
There are other metaphors that WCM use to relate complex content: hierarchical metadata structures such as folders, categories or channels enable you to group content together in more complex ways. Flatter metadata structures also allow you to “traverse” across website structures and relate content in differnt part of the information architecture that don’t sit into this hierarchy. It’s often useful to have multiple kinds of metadata, particularly faceted taxonomy, if your content is particularly complicated and needs a lot of content relationships in order to achieved desired user journeys.

Technology

Where the WCM isn’t a standalone application but needs to integrate with other systems in a web platform – user directories, CRM, eCommerce, transactional tools – you need to validate how it will communicate with other systems. Is it through the Application Programming Interface (API), web services, or some other method?
The maintenance and extensibility of the system can also be important requirements. If I need to change a content type, what does that involve? If I need to get data from another application, can I do this in a de-coupled way?

Some other factors may come into play, such as workflow, internationaisation and personalisation. If one product is particularly strong in one of these areas and it’s a key requirement, then it may get into a shortlist even if it’s weaker in some of the other areas identified above.

This all brings me to the recent debate about whether WordPress is a CMS, with numerous contributions on Twitter as well as from:

My experience of WordPress is that it’s really good at two key features where some established content management systems are relatively poor: search engine optimisation and comments. On SEO, it ties your blog post title to a friendly URL, enables good internal linking (as long as you don’t move any pages), allows tagging and categorisation and offers some great SEO tools. Comments meanwhile can be quite tricky for some WCM that operate separate content contribution and consumption environments, but WordPress does this easily, with useful anti-spamming tools and the ability to follow the comment conversation by RSS or email.

When it comes to the question of whether WordPress is or isn’t a WCM, the best analogy I could come up with was a camera phone. A camera phone does take pictures, it is convenient, some phones even have a flash and autofocus. But would you get a camera phone specifically to use as a camera? I think not if you’re serious about photography, It is a camera, but a very limited one.

WordPress is a blogging tool with some shared characteristics of a WCM. If you apply some of the many available modules to it you can come up with a really nice proposition, up to a point. But you’re effectively hacking the software to get it to behave as many WCM already do. You can get any software to do pretty much anything in the end, but that still doesn’t make it a WCM.

WordPress is widely used by many organisations as a web content management system and there are a lot of photos taken on camera phones. But you need to understand the product’s limitations and if these don’t affect you and you’re achieving what you want, then no one should criticise you for your choice. But let’s be sensible about it and say that even if there’s no such thing as the best WCM, you know that it wouldn’t be WordPress.

Who Owns Information Architecture? All Of Us.

Posted in IA, Information Architecture, Information Management, Information governance, Leslie Owens, Taxonomy, information architect, metadata on February 2nd, 2010 by Leslie Owens – Comments Off

Leslie-Owens By Leslie Owens

Fellow analyst Gene Leganza wrote an excellent overview of Information Architecture, available for free via this link.

Gene briefly explores the misunderstanding between “Enterprise IA” and “User Experience IA.” This tension was well characterized by Peter Morville almost 10 years ago (See “Big Architect, Little Architect.” Personally I think it’s clear that content is always in motion, and unsupported efforts to dominate and control it are doomed.  People are a critical element of a successful IA project, since those who create and use information are in the best position to judge and improve its quality. Many hands make light work, as the saying goes.

For example, if you want a rich interactive search results page, you need to add some structure to your content. This can happen anytime from before the content is created (using pre-defined templates) to when it is presented to a user on the search results page. Content is different than data, a theme Rob Karel and I explored in our research on Data and Content Classification. For this reason, IA is both a “Back end” and a “Front end” initiative.

When clients can’t find information no matter how expensive their search engine, they should investigate how their approach to information governance and metadata management might contribute to the problem. Our research indicates that improved collaboration and search are the primary drivers for a metadata management initiative.

I encourage you to read the January 22 report called "Enterprise Architecture Must Lead Enterprise Metadata Management Initiatives." At last, Information and Knowledge Management (IKM) professionals have a partner in defining a metadata strategy: the Enterprise Architect. I think of the EA as the big brother showing up at a schoolyard fight just in time. IKM pros need them for their vision, their clout and their sign-off powers. But EAs need IKM pros too. Information and Knowledge Management professionals work in the intersection between Business and IT. We facilitate dialogue between business SMEs and IT SMEs. If Enterprise Architects are the “city planners,” the IKM pros are the “community organizers.”

When appropriate, IKM pros need to adopt the EA language of metadata, structure, alignment, integration, and framework, instead of taxonomy, content types, vocabularies and navigation. But we should never lose our user-centered orientation as we collaborate across the enterprise.

My colleague Stephen Powers and I think the Information and Knowledge Management professional plays a key role in IA projects. To that end, we are researching the responsibilities of the Content Architect in the enterprise. If you spend your day defining, enriching and governing content, please add a comment here and we will follow up to interview you and your organizational model. All feedback is welcome as we explore how IA contributes to Smart Computing and Knowledge Management.

Who Owns Information Architecture? All Of Us.

Posted in IA, Information Architecture, Information Management, Information governance, Taxonomy, information architect, metadata on February 1st, 2010 by Leslie Owens – Comments Off

Fellow analyst Gene Leganza wrote an excellent overview of Information Architecture, available for free via this link: http://www.forrester.com/rb/Research/topic_overview_information_architecture/q/id/55951/t/2

Gene briefly explores the misunderstanding between “Enterprise IA” and “User Experience IA.” This tension was well characterized by Peter Morville almost 10 years ago (See “Big Architect, Little Architect.” Personally I think it’s clear that content is always in motion, and unsupported efforts to dominate and control it are doomed.  People are a critical element of a successful IA project, since those who create and use information are in the best position to judge and improve its quality. Many hands make light work, as the saying goes.

For example, if you want a rich interactive search results page, you need to add some structure to your content. This can happen anytime from before the content is created (using pre-defined templates) to when it is presented to a user on the search results page. Content is different than data, a theme Rob Karel and I explored in our research on Data and Content Classification. For this reason, IA is both a “Back end” and a “Front end” initiative.

read more

DITA, Metadata Maturity and the Case for Taxonomy

Posted in DITA, Dynamic Content, Findability, Personalization, Taxonomy, XML, conditional publishing, main blog, metadata on January 7th, 2010 by scottabel – Comments Off

By Paul Wlodarczyk, Director, Solutions Consulting and Stephanie Lemieux, Taxonomy Practice Lead, Earley and Associates, Inc.

haystackMany organizations have turned to component-oriented content creation to create more sophisticated knowledge products, in more languages, and at lower cost. Our research shows that organizations that use XML authoring are more mature than their peers with respect to the adoption of best practices for search and metadata. However, the use of native DITA metadata capabilities is rare, and many are also missing out on opportunities to use taxonomy for content reuse and improved content findability. This article examines the metadata capabilities within DITA (and content management systems), discusses two major benefits that can be achieved by using descriptive metadata and taxonomy, and recommends some best practices for getting started with metadata for component-oriented content.

The Haystack Problem

Finding content in your file system or content repository is hard enough when you’ve got simple text documents to deal with. When you’re using DITA — the Darwin Information Typing Architecture — and other component-oriented XML standards, you increase the difficulty by two or three orders of magnitude, because you’re looking for smaller needles in bigger haystacks. Having thousands of media-independent content objects that can be shared and reused across multiple deliverables allows you to create more sophisticated knowledge products, but it definitely poses a challenge in findability for content authors.

Sounds like a job for metadata and taxonomy.

Among its many features for content reuse, DITA provides content creators with a facility for tagging content objects with metadata. Metadata—literally “data about the data”—lets content authors and others who manage content describe what the content is about (descriptive metadata), as well as assign properties like who created the content, when, in what language, and for which audience (administrative metadata). A taxonomy is a hierarchical structure that organizes concepts and controls vocabulary. Taxonomies allow organizations to create and centrally manage important terms that can be applied to content as metadata. For example, a telecommunications manufacturer might have a taxonomy that includes concepts such as product categories (Mobile Phones, Wireless Routers, and so on), industries (Healthcare, Utilities, Transportation, and so on), or product models.

Once applied, this metadata and taxonomy can be leveraged by a search application to help users—internal or external—find and use content. Search engines can use taxonomy to organize search results in meaningful ways, such as refining search based upon certain properties (faceted search) and suggesting related searches based upon relationships between search terms and other concepts in the taxonomy.

Seems a natural fit—DITA and taxonomy, like peanut butter and chocolate: DITA creates a multitude of reusable components, and taxonomy helps describe and organize the components so that they may be readily found and reused by content authors and users. One would think that organizations getting involved in DITA and metadata structures would quickly see the benefits of taxonomy. We recently tested this hypothesis in a benchmarking study into the adoption of metadata and taxonomy.

XML, DITA, and Metadata Maturity

In June 2009, Earley & Associates and Taxonomy Strategies surveyed 270 organizations to discover how mature they were on the scale of best practices for enterprise search, taxonomy, and metadata. We found that organizations that use XML for structured authoring showed more widespread adoption of metadata best practices than average.

Figure 1: Metadata Best Practices

Figure 1: Metadata Best Practices

Figure 2: Taxonomy Best Practices

Figure 2: Taxonomy Best Practices

Part of our hypothesis was confirmed. It would seem to follow that taxonomies and descriptive metadata adoption would be widespread by organizations using DITA, since metadata-based search would improve findability of content objects. However, in follow-up interviews within the DITA community searching for metadata success stories, we discovered that in practice few organizations use DITA’s embedded metadata capabilities, and that fewer still do anything with taxonomies to organize descriptive metadata for DITA content, even for use by content management systems (CMS).

DITA Support for Metadata

Is it a matter of inadequate metadata and taxonomy support within DITA? Compared to other XML standards, DITA provides a relatively rich and extensible framework for embedding metadata directly within the XML objects themselves. The embedded metadata can be used by processing tools—like the publishing tools in the DITA Open Toolkit (DOTK)—to conditionally publish content or to create metadata in the final outputs, like HTML.

DITA objects—both topics and maps—have a prolog section in which metadata can be specified. Within the prolog, the metadata section can define metadata about the topic itself—such as the intended audience, the platform (for defining the applicability of the topic to specific hardware or operating systems), and so on. This metadata can be used for conditional publishing. For example, you can automate the production of a Linux version of your documentation by only outputting topics and maps that set platform to “Linux” in the metadata.

DITA objects can also embed administrative metadata about the author, copyright holder, source, publisher, and so on. Metadata can also contain descriptive keywords for the topic or map. Keywords or index terms are output to HTML or XHTML as metadata keywords to support search engines. Authors can also define Index terms for the automatic generation of back-of-book indices.

DITA also enables users to define custom metadata fields within the othermeta element. Like keywords, metadata defined as othermeta are output as HTML metadata elements but ignored for other types of output like PDF. Clearly, metadata is a powerful tool in helping to manage and publish DITA content. But in practice, the use of embedded DITA metadata is largely for driving conditional publishing, which is fairly ubiquitous. After this, most organizations generally don’t use DITA metadata, instead relying upon content management system (CMS) metadata to manage workflow and search of content objects. The use of taxonomies to manage vocabularies and organize the concepts for descriptive metadata is even rarer.

CMS Metadata

While we had great difficulty finding any organizations using DITA metadata for purposes beyond conditional publishing, we did confirm that organizations that use DITA often make extensive use of CMS metadata. CMS metadata can be very rich, especially in component content management systems designed to handle DITA content. Content management systems often provide mechanisms for automating metadata, assuring that it is applied to content more consistently, in turn making it more valuable for search. For example, information about the author, publisher, or copyright holder can be set automatically by the CMS without author intervention. As a result, CMS metadata is used instead of DITA metadata for common applications such as:

  • Search – Administrative metadata, such as author or version, is typically automated and presented as search options or facets in the CMS search interface. However, just using administrative metadata misses much of the power of faceted search.
  • Workflow automation – Metadata about the content lifecycle state (for instance, “draft,” “ready for review,” “ready for localization”) can be set by authors to trigger production and editorial workflows in the CMS.
  • Publishing automation – A DITA-aware CMS can automatically set embedded DITA metadata to provide the DITA Open Toolkit with metadata for publishing automation, such as the metadata for conditional publishing. CMS metadata can also be exported with DITA content, as a separate XML file or “sidecar” to be used by other tools to process the DITA content.
  • Dynamic publishing – Descriptive metadata can be used to present content objects dynamically to end users. So we did find that organizations using DITA do take advantage of CMS metadata for finding objects and producing deliverables, but mostly on the administrative side or for conditional publishing.

When we asked about taxonomy, even the most mature organization was just getting started with it.

Why don’t we see more DITA users adopting taxonomy to improve findability and reuse of DITA objects? Well, some are but aren’t aware of it, per se. Many organizations are creating product documentation, and they achieve high levels of structural reuse—reuse that flows naturally from the structure of the product line. For example, we interviewed one large information-technology hardware, software, and service firm with millions of DITA topics in use across the company. They report 80 percent reuse, but cite that this is largely due to the modular nature of their products; content reuse follows the Bill of Materials. Reusable content is easy to find by browsing the folder structure of the CMS, which is organized based upon the product lines of the company. Authors who created content are the ones organizing the CMS folder structure and are the ones filing and reusing content. Authors wrote it, they need to reuse it, and they know where it is. Search is helpful, but not critical to reuse.

This scenario does, in fact, describe the simplest and most limited use of taxonomy—the product taxonomy, reflected in the folder structure. However, in this case, taxonomy is used only for browsing, not for controlling metadata vocabularies. Putting a piece of content in the “Product X” folder is not the same as assigning metadata to that content so that a search engine can index it as “about Product X.” In practicality, folders may be all they need. A company with 80 percent structural reuse can only expect marginal improvement in reuse rates from using metadata-based search (since you can’t double 80 percent).

Software companies—the most common DITA users—typically report 40 percent reuse rates, with fewer opportunities for structural reuse, because there is less common content among products. Here the opportunities to increase reuse are in other departments, such as training and marketing. Improved metadata-based search becomes the key for reuse to happen across departments; unlike structural reuse inside of the technical publications department, now they didn’t write it, and they don’t know where it is (or even if it exists).

So when content can be successfully discovered by browsing, descriptive metadata is useful but not critical to reuse. But when search is the key to achieving higher reuse, metadata and a usable taxonomy that spans departments are the keys to higher reuse rates.

Metadata Enables Dynamic Publishing of Content

During our interviews with DITA thought leaders, one emerging opportunity for using metadata and taxonomy came to our attention: dynamic publishing. A major benefit of DITA is creating content that is media-independent. It also enables content objects to be organized by DITA maps, so that content can be recombined and re-sequenced into different deliverables. DITA maps provide flexibility, but at the end of the day, they are still as static as the Table of Contents of a printed book. Many organizations are beginning to experiment with dynamic publishing, in which the selection and sequencing of content is done at run-time, dynamically, and independent of a map. Dynamic publishing breaks the book metaphor.

Dynamic publishing lets content be chosen and presented to meet the unique needs of a user or situation. To best illustrate dynamic publishing, let’s contrast it with static publishing of a help system. In a statically published help system, the hierarchy of topics is fixed by the author and the selection of content is limited to what is in the DITA map at publish time. All of the related topics are manually linked. If an author wants to add a related topic, she needs to manually add the link (or update the related-links table) and republish. The publishing process creates a deliverable that—while interactive—is static with respect to its contents and the relationships among them.

To create the same help system with dynamic publishing, the author would publish her content to a server, but she would not create the structure and relationships between topics at publish-time. Instead, a taxonomy would specify the relationship between concepts and properties that are defined in metadata. The relationships among topics are generated at run-time, based upon metadata on the topics. The richer the metadata and the more complete the taxonomy, the more sophisticated the user experience.

We all have experienced faceted search on consumer web sites, where we can refine search results by selecting specific values for different attributes, such as the number of megapixels for a camera. This experience is driven by metadata. With rich metadata on DITA content, we can create very sophisticated electronic content browsers, where metadata-based search creates browser-like user experiences. In the past, IETMs—interactive electronic technical manuals—required manually creating links and weaving together content. Dynamically published IETMs enable users to navigate through content with dynamic, task-focused pathways based upon metadata. When new content is published to the server, it can find its place within the IETM based upon its metadata.

“There’s a really good business case around dynamic publishing,” says Chip Gettinger, Vice President of XML Solutions at SDL XySoft. “When you create flat output—like a PDF or a help system—metadata is useful, but it is absolutely critical for dynamic content. Dynamic content can justify DITA adoption by enabling a broader range of uses for DITA content.”

Best Practices

So while we found some basic best practices in use during our research, there is a case to be made for more extensive use of taxonomy and metadata by organizations that use DITA. If you want to increase reuse across departments or enable dynamic publishing, you probably have some work ahead of you. Here are some best practices to get you heading in the right direction:

  • Start by identifying all your taxonomy use cases – You will be using taxonomy not only for authors to search content objects for reuse but also potentially for serving up content to users dynamically or in a faceted interface. These perspectives will provide you with the framework for your taxonomy.
  • Reuse existing vocabulary – Many organizations already use controlled vocabularies for some metadata fields such as organization, audience, platform, and product but still rely on keywords supplied by authors for other descriptive metadata, such as subject. Look to existing sources for tagging your content, such as fault classification schemes (from the service hotline), hierarchical product or system models (from engineering), or hierarchical task models (from instructional/task analysis from the training organization) as places to start building hierarchical descriptive taxonomies.
  • Authors are the best people to apply descriptive metadata – After all, they do the analysis to determine what content was required in the first place, so they have the best context for classifying it. However, they aren’t librarians. Don’t expect authors to tag a lot: automate tagging when possible—especially for administrative metadata (author, organization, creation date, language).
  • Leverage the technology – Many content management systems can integrate third-party classification servers for automating descriptive metadata. These servers can automatically apply metadata from a taxonomy or controlled vocabulary when content topics are checked-in, then automatically populate subject metadata fields in the CMS. The metadata can in turn be reviewed and manually adjusted by authors. This metadata can be embedded into your DITA content for use in conditional publishing or to generate HTML tags in the final output to support search or dynamic publishing.

About the authors

PaulWlodarczyk_09Paul Wlodarczyk, Director, Solutions Consulting – Paul helps clients compete by improving their content lifecycles – business processes and workflows that span the collection, collaboration, authoring, assembly, styling, review, localization, publishing, reuse, management, and search of unstructured content.

Paul brings over 25 years’ experience in content lifecycle operations, consulting, and software development, with expertise in the areas of enterprise content management, knowledge management, technical publishing, localization, collaboration, user interface design, learning technologies, and information worker productivity.

Headshot4Stephanie Lemieux, Taxonomy Practice Lead – Stephanie has a Masters in Library and Information Studies (MLIS) from McGill University, specializing in knowledge and content management, taxonomy, and information architecture. For the past several years, she has been working on taxonomy & knowledge management contracts and research projects for a variety of clients.

Recent projects include the development of a global corporate taxonomy and its implementation in a content management system, the creation of faceted search taxonomies for large e-commerce websites, and a digital asset management taxonomy.

Looking at CMIS 1.0, Thinking of 2.0

Posted in CMIS, ECM, Enterprise 2.0, Nuxeo, metadata on December 7th, 2009 by Pie – Comments Off

As you hopefully know by now, CMIS 1.0 was released for public comment thru December 22nd, and the excitement is building in the community and the press.  If you haven’t looked at it, and want to provide feedback on the CMIS standard, the time is now.image

For more information on what CMIS can do for you, here are some useful references, including three of my posts on the topic (newest first):

I’ve been looking at CMIS 1.0 from a functional perspective, along with some others in the community.  If I want to solve business problem X using a CMIS-based application, what do I need?  What would make things easier?  Using my experience from building the AIIM iECM demo and many discussions with others in the industry, I’ve come up with a few things that I’d like to see CMIS support in 2.0.

Note the assumed nature of the next version of the spec.  If there isn’t another version coming, then what is the point?

Advanced Metadata Support

There are a few gaps in the metadata model support.  This realization came about while trying to solve some business problems with CMIS with some others in the ECM space.  We were looking at how well CMIS might support the modeling of different knowledge domains.  We were also trying to think of some of the folksonomy tagging approaches in the world of Enterprise/Web 2.0.

Two areas for enhancement were readily identified:

  • Hierarchical Metadata: This is roughly defined where the value of one metadata field drives the value of another metadata field(s).  I personally dealt with this most often in the legal industry with the Client/Matter fields.  To support this now, you have to set and save the parent field and then run the lookup on the child field.
  • Tagging: Right now, CMIS does not support tagging.  You can have a repeating field, but it is assigned per object and doesn’t handle different people wanting to apply different sets of tags.  If I tag it one way and Sarah tags it another way, the resulting collection of tags should be aggregated and displayed as a small tag cloud.  The underlying ECM system would determine the weights for each value (Authority, number of tags…), but CMIS needs to support individual tagging and the retrieval of tags that includes weighting and the set of tags the current user has set.  This support would need to include strong querying capabilities.

Okay, we have some areas for improvement, but now we need to look at the gaps in managing these domain models.

Managing Metadata Models

One thing I learned from the AIIM demo was the need to be able to define and query a logical domain model that was distinct from the implemented physical model on the repository side.  There is some initial support there with the localName representing the repository’s name for a field and the queryName representing the logical name, but no way to collectively manage them.

The issue is that while there is some support for a localNamespace, but no corresponding queryNamespace.  I cannot create and plug-in a custom domain model.

Let’s think about this for a second.  Let’s take the AIIM demo domain model, AIIMContent.  Let’s say I more thoroughly defined it and published it, creating some URI.  Vendors, or integrators, should be able to create a back-end implementation for the model for any desired vendor.

Then, using CMIS, I could query to see if the AIIMContent domain model is supported by the current repository.  If it is, then I can create that supported object type(s) and query the objects using the universal names (queryName) of the metadata attributes without know any of the implementation details in the underlying repository.

This feature is not there.  When the AIIM demo was created, I had to map, in my code, each attribute name.  All three vendors implemented the model with different naming conventions within their repository.  This support for common domain models would solve that problem and allow the developers of Content Applications to do that much less grunt work in their code.

Oh, and the support for an “Aspect” type of domain model does not exist either, but I consider that a 3.0 request.

Supporting Semantic Models Through Relationships

One advance feature that Florent didn’t discuss in detail was the concept of Relationships.  This is basically relating two content objects to each other with a defined relationship.

Here is where the need for relationships hits the modeling world, Semantic Modeling.  Looking at the different scenarios, you can actually create and navigate a nice semantic model using relationships.  The only point of failure is the lack of query support.

That’s right. I can create all the relationships in the world, but I can’t include them in a query. I can create a concept as a content-less object and create a relationship from content to that concept, but I can’t create a solid way to retrieve that information without browsing.

Problem.

Is That All?

I haven’t even touched on Transactions or Identity Management. Those are whole other balls of wax.  Look around and throw business solutions out there and figure out how to implement it using only CMIS as in interface. Come up with your gaps.

Keep thinking on it. CMIS 1.0 is an important first step, but we need to make sure that 2.0 continues the progress by solving a broader set of problems.

The importance of good metadata

Posted in Information Architecture, metadata, usability on September 18th, 2009 by Philippe Parker – Comments Off

There’s been some debate on the role of metadata in content management: is metadata the future of content management, an integral part of the content, or are we making an artificial distinction?

Let’s start by setting aside the technological issues, because these are largely irrelevant. Metadata may be stored in a database separate from a file, or in a distinct table, or marked up differently, but this isn’t the determining thing that makes it different from other data or content held in a system. It can even be an inherent part of a document. What makes metadata different is how it’s used.

Metadata is used for classification. It’s used to relate one piece of content to another and to help people and systems find relevant information. If it’s not serving that purpose then it’s not metadata.

This may lead you to the conclusion that I’m saying everything is metadata. But it’s not. Some content that is marked up as metadata isn’t really metadata at all, or is at best poor metadata. Take a look at the UK’s National DNA Database, for example. This database records ethnicity and skin colour as a way to search for people, but one person’s view of their ethnicity may not be shared by another’s. This disparity has effectively rendered this metadata set useless. The records on ethnicity and skin colour are potentially useful as content, but unreliable as metadata.

So if you’re looking to define metadata types and corresponding taxonomies for your content, you have to consider how those doing the classification will apply the metadata and how other people are going to use that classification. If it’s not useful, it’s not metadata. If it is useful, you’ll be on your way to managing your content.