As we at Silverchair and Semedica see more and more interest in automated tagging solutions (such as our Tagmaster system), we are more frequently encountering questions about how to evaluate their results. Here are a few ideas on the subject:
Evaluation: Humans Required!
It is hard to get around the fact that you will need human editors (or professional indexers) and your human technology team (who will use the tags to create interesting new features) to verify that an automated system is working correctly and that the tagging is accurate and useful.
Recently, someone asked our CEO Thane Kerner if we had an automated system to verify the accuracy of our automated tagging. Thane replied (rather cheekily, I must say): “If we had an automated review system that could measure tagging accuracy more precisely than the current tagging system, we wouldn’t use it to verify tags, we’d use it to tag the content to begin with!” The lesson: Once you’ve deployed your best automated system to do the tagging, humans are the next logical reviewers.
Here are four factors your humans should consider in their review:
1. Expert/Editorial Accuracy Confidence
One key target for evaluation is to assess how much confidence your key stakeholders (journal boards, editors, etc.) express in the output of the system. But confidence is not a linear equation. I posit the following values:
- Impeccable tag placement: +1
- Debatable tag placement: −1
- Debatable tag omission: −1
- Obvious tag omission: −10
- Obvious irrelevant tag placement: −50
The first thing you’ll notice is the weight of positive to negative. In high-stakes fields (including science and medicine), humans are naturally biased to more heavily favor negative experiences. (Of course, this has aided us well in survival: “Don’t eat that type of berry again, it made you sick last time!”) What that means in terms of confidence is that stakeholders will need a disproportionate amount of positive reassurance to get over negative outcomes. And the impact of a particularly egregious negative outcome (resulting from a particularly poorly placed tag) can be devastating to your stakeholder’s impression of a tagging system. (This is why Silverchair’s system defaults to using conservative methods with very little “guessing” to avoid obvious irrelevant tag placement.)
2. Usefulness!
The next key target for evaluation for both editorial and technical stakeholders to assess is usefulness of the tagging applied. Tags should be highly relevant in a domain-specific context and they should drive better discoverability and linking. Primary care, genetics, surgery, and emergency care all take very different approaches to the same topics, and their tagging should reflect their uses.
The tagging system you are evaluating may have added tagged concepts that are tangential or irrelevant to the use model of the content, and such tags would not be capable of driving innovative site features (in many cases, tangential tagging actually inhibits the ability for new systems to work effectively). For example, it is a nice-to-have if your tagging system can recognize place names and person names, but if it misses or miscategorizes important topics like clinical trial names it doesn’t matter how many people or places it can tag. (Clinical trial acronyms can be particularly tricky to tag―see our post about them.)
3. Granularity
Does the system still work with “documents” or can it identify topics down to the section/paragraph/figure/table/equation level? At Silverchair we work with many dense medical chapters that may cover more than 200 distinct topics, so we see it as a necessity for our tagging system to break those documents down into smaller parts in order to deliver precise packets of highly relevant information to our users.
4. Control and Ongoing Improvement
Any system selected is not going to be extremely accurate “out-of-the-box.” (I write that as a realist, not as a pessimist!) So during evaluation you must ask, “How easy is it to make impactful positive changes to the system?” This can take a variety of methods—some systems suggest manually selecting training documents for each topic or category (which can get onerous when you have 20,000 topics), some systems allow your software developers to go in and tinker with the code (you have data classification expert software developers, right?!?), and some systems allow you to load and use a taxonomy or thesaurus to aid in topic identification and tagging (assumes a taxonomy/thesaurus exists or can be created for your domain).
At Silverchair, we work primarily in medicine, which is a taxonomy-rich domain with an ever-growing list of topics. For that reason, we’ve chosen the last method as our control and improvement strategy. Our editors update our Cortex medical taxonomy and its related thesaurus every day to keep pace with the topics being written about and searched for.
Summary
If you choose a system that 1) is accurate enough to instill confidence in your editorial team, 2) is useful enough to drive meaningful new features and improvements, 3) classifies your data at a granular level, and 4) is flexible enough to allow explicit control and ongoing improvements―you’ve made a wise purchase!
Clinical trials are popular targets of searches in medical journals. To deliver accurate search and browse results for them, semantic tagging and a semantic search engine are essential.
The names of clinical trials are often long and unwieldy, as they try to describe the focus and mission of the trial in their name—for example, a clinical trial studying drug treatment of high cholesterol is “Arterial Biology for the Investigation of the Treatment Effects of Reducing Cholesterol 6–HDL and LDL Treatment Strategies.” Because of these long names, trials are more commonly known by their acronyms—in this case, “ARBITER 6–HALTS” trial—and no doubt their full names are being crafted to result in a catchy or apropos—or hopeful—acronym. For example, the acronym for the trial studying the effect of the drug Vytorin on cholesterol levels is “IMPROVE-IT.” (See this blogpost for some humorous trial names and acronyms.)
One of my pet peeves is the incorrect use of the word “acronym” to mean any abbreviation for a term. Actually an abbreviation is also an acronym only when the abbreviation spells a word or is a combination of letters that people can pronounce as a word. So yes—abbreviations of clinical trials are acronyms, and ah, there’s the rub for commonly used full-text nonsemantic search engines. A full-text search engine treats them like any other word.
So yikes—a PubMed search for “JUPITER” (the acronym for the trial “Justification for the Use of Statins in Prevention: an Intervention Trial Evaluating Rosuvastatin”) delivers the first two results correctly, but the third result appears because the name of the institution that issued the paper is in Jupiter, Florida! OK so yes—the PubMed search box tries to help you by suggesting “Jupiter trial” (98 results) … but it also suggests “Jupiter study” (257 results). People—the JUPITER trial and the JUPITER study are exactly the same thing to any searcher wanting to know about JUPITER. The number of results should be the same for both searches. And nobody searching PubMed for JUPITER wants to know more about Jupiter, Florida. Trust me.
We can do better. At Silverchair, our Cortex taxonomy contains a list of clinical trials and the accompanying thesaurus includes their acronyms, so when our tagging and retrieval systems encounter those concepts, we’re able to separate them from their normal English language counterparts and tag them correctly. Yet another benefit of an automated tagging system supported by a robust and up-to-date medical thesaurus. It understands medical information and the health care professionals who depend on it so that we can give them results, not guesses.
As we were setting up a new external SAN (storage area network) on the Silverchair production web farm recently, the network engineer said something that caught my attention: “The web servers will be able to use the external SAN drives faster than their own internal memory.” At first that defied my expectations of “internal vs. external,” but when I thought about more, it made perfect sense.
The web servers are designed to execute application logic, store session tracking data, handle user interaction input, and synthesize, parse, and display data from a variety of sources—they are logic processing engines that handle data storage only when necessary. On the other hand, the SAN has one purpose—to store a large amount of data and enable a super-efficient data delivery channel that rapidly responds to content requests from the web servers.
The more I thought about it, the more I realized it was a fitting metaphor for how humans work. We are fantastic logic processing engines. We parse, synthesize, analyze, and use data input from a variety of sources to perform creative problem solving. And most importantly to this metaphor, we only store data internally when absolutely necessary. In the present day, the comprehensiveness and ubiquity of the Internet have allowed us to store an unprecedented amount of collective memory in external sources and access it from wherever we may be.
To be clear, human use of external memory did not arrive with the Internet—it has been around since the beginning of civilization. We are used to storing memory in external sources and freeing up our internal resources. Papyrus eliminated the need to memorize long epic poems. Abaci eliminated the need to memorize multiplication tables. (NB: Don’t try telling that to a 2nd grade teacher.) In modern medicine, drug handbooks store dosage and safety information that is too complex for doctors to memorize in toto. Phone numbers stored in our mobile phones eliminate the need to memorize the phone numbers of friends. We even store memories in our friends and family—I recently asked my wife, “What was the name of that hotel we liked in Chicago?” She knew, and voila, I had accessed my external memory successfully.
Alas, my comparison of human activity to Silverchair’s web farm breaks down at a key point. In many cases, accessing our external memory is not fast and efficient. Currently the external memory sources of humans are not deployed as efficiently as a SAN. Internet content sources can be hard to access, store content in highly variable forms, require a special vocabulary or technique to query, and return data in a way that does not suit our purpose.
This is the fundamental problem that Silverchair’s Semedica division addresses with semantic enrichment of data sources. We’re organizing a specific external memory category (in our case, online medical and health care information) in a way that allows it to be accessed more quickly and to return data in the right form for efficient use by clinicians and researchers. The less data that health care workers need to store internally, the more of their “processing time” can be used toward envisioning creative solutions for preventing and curing diseases. That is something that the Internet cannot do. (Yet.)
Many thanks to colleague Jake Zarnegar for pointing me toward Slate columnist Michael Agger’s Google Suggest contest.
I’m sure you’ve experienced Google Suggest in action: as you type into the search box, Google offers suggestions that change dynamically as you type each letter of your query. The suggestions are sometimes spookily on target but many times flat-out inappropriate.
You’ll find many examples of Google Suggest inappropriateness documented online. But the Slate contest took a different angle, challenging readers to explore the different suggestions made in response to a “less intelligent” Google query versus a “more intelligent” one.” The winners are in:
The winning entry … follows Google Suggest into the realm of moral inquiry. It doesn’t neatly divide into “less intelligent” and “more intelligent,” but it’s the best example I received of how one word can make all the difference. [Is it wrong to…] involves love affairs, God, and younger men. [Is it ethical to…] puts us on the plane of animal research, privacy concerns, and cooking the books.
Putting aside the entertainment and cultural value of Google Suggest, how does it work? Like most things Google, those details are vague:
Our algorithms use a wide range of information to predict the queries users are most likely to want to see. For example, Google Suggest uses data about the overall popularity of various searches to help rank the refinements it offers.
On the Silverchair SCM web content management platform, we also use autosuggest to aid searchers. But there’s no mystery about how it works. Once three characters have been typed into the search box, our search engine starts matching the query against the index of semantic tags that have been applied to that specific content set from our Cortex biomedical taxonomy. Suggestions become more precise with each query character typed, and because we are matching against only those semantic tags applied to the content, the search results set is always targeted and relevant. Our search engine also checks each query against a database of taxonomy equivalents—synonyms, abbreviations, jargon—to normalize the search query and expand it to cover all possible matches.

Silverchair search autosuggest
Because the content in the products Silverchair builds is tagged so granularly, we can often suggest a more precise term than many searchers start with. Our goals for autosuggest are to save time for users, speed them to the most relevant possible query, and return the most precise answer to their question. Try autosuggest for yourself on the AccessSurgery site we built for McGraw-Hill. Search autosuggest is just one of the many ways a robust taxonomy can promote content discovery.
(I wonder how many of you are running off to play with Google Suggest—a perfect Friday afternoon time-waster…)

Google Suggest
The NIH has rolled out their new RePORT (Research Portfolio Online Reporting Tool) web site for information on funding, grants, and NIH research. As someone who works on government grants and contracts, I’m happy with this new level of transparency and clarity as to what topics (and who!) is being funded. It is a big upgrade from the incumbent system, which was hard to navigate and understand.
The most useful area of the site to me is the categorical spending section. It really gives you an idea of NIH’s funding priorities—it offers over 200 categories of funding.
However, it still has ample room for improvement. Currently it is an alphabetical list that contains items that are hard to compare. Here are some example categories that are not equivalent in scope:
- Allergic Rhinitis (Hay Fever)
- American Indians / Alaska Natives
- Burden of Illness
- Cancer
- Cardiovascular
- Clinical Trials
- Conditions Affecting Unborn Children
- Gene Therapy
- Gene Therapy Clinical Trials
- Genetic Testing
- Genetics
Some are very specific (hay fever), some are broad (cancer), some are ambiguous (cardiovascular), some take a completely different approach than the dominant disease/condition approach (American Indians/Alaska Natives), and some seem to be repetitive.
With a bit of work, this information could be turned from its current flat list expression into a multilevel taxonomy that allows users to slice it up in the ways that appeal to them (conditions or target populations, for example). Silverchair does this for the Agency for Healthcare Research and Quality on their PSNet patient safety clearinghouse. A small amount of classification work can go a long way in creating valuable new features—NIH has proven that with their RePORT upgrade, but I’d like to see them go farther.
I’d be happy to help out with the NIH site, but I’m not sure what category that would be funded under…
I still encounter, with alarming frequency, STM publishing executives who have not yet grasped the implications of the digital revolution. There’s a sensibility that pigeonholes the web as yet another ancillary delivery tool—kinda like CD-ROMs. I am amazed to hear people in positions of serious authority (for now) declare their intentions to move incrementally to develop online businesses. “Let’s take it step by step and see what evolves,” I was recently told by a senior figure at a major scientific society.
Sorry, but it already has evolved, at least from a macro revenue perspective. If you are not persuaded merely by walking into a library at an institution of higher learning that education and research have gone digital, allow me to finish the job by providing some definitive evidence. As these data show, this is not in any way a peripheral phenomenon. It is a fundamental transformation of our industry. These data were culled from statistical reports created by the Association of Research Libraries (ARL) and the Association of College and Research Libraries (ACRL), and they vividly illustrate the trends that are sweeping the institutional customer base.
The first three tables show the trajectory of expenditures made for print materials and electronic materials in these libraries. (Obviously, in categories of this breadth, these proportions have an inverse relationship.)

Among all libraries surveyed, the proportions equalized in 2008; almost certainly, spending for electronic resources will outstrip print in 2009.
When the list is narrowed to institutions that offer doctoral programs, there is a significant shift toward electronic resources.

And finally, when we narrow it further, to focus exclusively on health science libraries, the contrast is even more stark: by 2007 (the most recent figures available), collection spending was already 2/3 electronic. (What must it be as I write, 2 years on?)

It is common knowledge that serials have been trending electronic since the turn of the last decade. But now, materials of ALL kinds are moving to electronic delivery. Take a look, for example, at the trends specifically for electronic books.

For many of us, this is not news. But the appropriate response is not to treat electronic platforms as peripheral or ancillary. This is obviously the future of the STM business.
Yesterday at Silverchair we put on our first-ever webinar, with the irreverent title “Does Your Search Suck? Transform It From Frustrating to Fantastic With Semantic Search & Browse.” We had quite a crowd, which tells you something about how people feel about search status quo. The webinar was as entertaining as it was informative—you’ll want to watch the recorded version. (Go to the Silverchair home page under THE FORUM and find the Click here to download a recording of the webinar link.)
So—why’d we tackle search first? In a nutshell, search equals money. Search that works makes your website more useful. More useful sites get more usage. And more usage translates to more revenue, or whatever metric you use to measure success.
Jake Zarnegar, President and CTO at Silverchair and a blogger here at It’s All Semantics, kicked things off by highlighting the causes and symptoms of bad search and how to treat them. He also alerted us to some things that complicate search but can’t be controlled:
- Search strings are short and primitive.
- No one reads search help (“learns” your search).
- STM user groups are small and highly specialized.
- Professional terminology is rapidly expanding and less standard at the margins.
- Professional users demand high accuracy and will abandon your search if they lose confidence in it.

Search—The Good, The Bad & The Ugly (slide courtesy of Jake Zarnegar)
Jake left us with some great advice:
- Recognize that ambiguity is the mortal enemy of computer logic.
- Don’t keep secrets from your search engine.
- Create a distilled logical layer of your content’s meaning (semantic tagging guided by a taxonomy).
- Improve search every day.
To put Jake’s presentation in real-world perspective, Matthew O’Rourke, Editorial Director of Journal Watch, followed up with a talk about why Silverchair’s semantic search solution was implemented at Journal Watch.
Matthew taught us that the relatively short length of journal articles (compared to War and Peace, for example) offers fewer clues for search engines; add the complexity of STM content into the mix and searching gets especially tricky. Silverchair’s semantic search solves the problem by using semantic tagging to precisely mark and normalize equivalent concepts and make them findable no matter what term the author chose to describe them and searchers used to find them.
For example, in “old” Journal Watch search, the very rare maple syrup urine disease resulted in 14,341 results, misleading users to assume that there is “an absolute epidemic of maple syrup urine disease” (to quote Matthew from one of the funniest moments in the webinar). Problem? The old search engine searched for “disease” separately from the complete phrase. With Silverchair’s semantic search in place, the results set includes 1 article—the only article covering this rare condition in Journal Watch.
Matthew also reminded us that the customer is always right: Users don’t make search mistakes, publishers just give bad results.
_________
That little empty search box sits between your users and your content. You can have the best content in the world behind the search box, but if people can’t find your content with search, they won’t give you their attention—or their business.
[You can view the recorded webinar by going to the Silverchair home page under “The Forum”; see the Click here to download a recording of the webinar link.]

Tech visionary Esther Dyson recently gave some advice to Yahoo that has relevant nuggets of gold for STM content providers striving to deliver accurate, complete, and relevant search results to their customers. She recommended that Yahoo go back to its past to secure its future by offering a hybrid machine-human indexing and categorization solution to the “TMI” (too much information) problem:
You can’t rely on human editors to structure information anymore; you need automated tools, augmented by human expertise and specific domain knowledge. But search alone doesn’t work, either: Search is like a flashlight in a dark room; it pinpoints one or two things but leaves the surrounding space murky. What people really want is a lighted room, with things organized and displayed neatly on labeled shelves.
We couldn’t have said it better ourselves!
I just left the SSP IN Conference in Providence, RI, a new meeting with a very different format (it has replaced their former September meeting, Top Management Roundtable). At SSP-IN, participants broke into 8 teams representing various types of STM publishing organizations (e.g., Society, Social Media Start-up, Foundation, University, Search Engine, etc.). I was afforded the privilege of serving as Leader for the “Large Commercial Publisher” team. We each assumed our personae upon arrival Wednesday and stayed in character for the entire conference. Each of the groups was assigned a set of organizational characteristics and assets, which we used as the foundation for exercises that led us through a strategy review, a product development effort, and a go-to-market plan. We were provided with a set of tools (a guide to our deliverables, a conference wiki, etc.) and then sent into small group sessions to work through these business plans. Results from each group for each stage of exercise were shared with the entire conference audience.
My Silverchair colleagues who attended (Elizabeth Willingham and Pam Harley) agreed with the sentiment, expressed by virtually every participant we spoke with, that this novel conference was fun, energizing, and useful. No sneaking off to skip sessions—one’s active engagement was required throughout. The opportunity to work through a series of ideas by means of small group discussions among fellow STM experts was quite effective, and while some of the resulting product strategies may be overly ambitious, ideas developed by each of the teams are worthy of serious evaluation and development in our real-world organizations.
Unless we all fell victims to mass hypnosis (or acute groupthink), it was instructive to observe several definitive trends that emerged contemporaneously from the various teams. First off, taxonomy strategies featured heavily in many of the proposed solutions, either as a key enabling infrastructure or an end-product (or both, as in the case of our faux-publisher, Van der Prophett NV). Second, the development of robust social communication tools (and defined well beyond Web 2.0 buzzwords) were essential to most of the groups. “Content-enabled Social Media Network” (I think that coinage belongs to Mike Beveridge at AACR, who was on our team) seems to capture the concepts best, and each team had a dimension of this notion in its offering.
It struck me as fascinating that there was such a consensus of vision, and a set of conceptually compelling products, very little of which is likely to be executed (well, and seriously) anytime soon by the major publishing organizations. It’s more probable that startups from outside the traditional industry will make some of these concepts into successful, market-redirecting services. Why is the deep thinking being done in the absence of “low technology/high authority” (thanks Kent Anderson, NEJM) executives who could marshall the necessary resources (and make the attendant difficult structural decisions) to execute these plans?
Top-sellin
g books for the search phrase “medical terminology” on Amazon:
- Medical Terminology: A Short Course
- Quick Medical Terminology
- Medical Terminology: The Basics
- Medical Terminology Simplified
- Medical Terminology for Dummies
Anyone else sensing a theme? Considering that the Unified Medical Language System (UMLS) has more than 2,000,000 terms, I’m not surprised simplicity is in demand.
For publishers, taking measures to make medical terminology (and life) easier for health care professionals has a direct payback [re-read list, above]. Consider it mission critical!

![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_e.png?x-id=26c0649c-4bea-4e59-9ef3-3b5cdd916f81)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_e.png?x-id=f9513dac-6816-4e9e-8fdf-f32ea02d43aa)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_e.png?x-id=f076297e-8c1b-400f-9c4f-3aa54558c3d6)

![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_e.png?x-id=11b89d62-1cff-4672-917b-e96703a67171)