• Emerging Tech SIG: UIMA Wednesday, March 08, 2006 - 07:00PM
    Cubberley Community Center
    4000 Middlefield Road, Room H-1
    Palo Alto, CA
    Emerging Technology

Emerging Tech SIG: UIMA



  • The Monthly Meeting of the Emerging Technology SIG



    Dr. Daniel Gruhl, WebFountain Chief Scientist IBM Almaden Research Center


    At IBM, we see the enterprise search and analysis market as a huge new area of growth. Our focus is not on the front-end search engine but on the back-end technology that, yes, does the search, but more importantly provides a mechanism for advanced analysis tools to access and process the results of that search, so meaningful information can be gleaned from it. This is where UIMA - our Unstructured Information Management Architecture - comes in.

    The "unstructured" part of the UIMA name refers to the fact that much valuable data lies outside of the traditional, nicely formatted, easy-to-comb-through database; instead, it sits in all kinds of text documents like e-mails, in still photos, in video and audio recordings, etc. And a great deal of that resides throughout the enterprise, not just out on the Web. UIMA allows all this material to be searched, but it does something else: it provides a standardized set of hooks that application developers can then write to allow their code to access those search results for analysis. That's why we refer to it as an architecture or "framework." And the applications of multiple developers can be built to complement and work with one another. So, UIMA is really intended to be a foundation for driving an ecosystem of analytical tool-makers for all kinds of businesses, industries, etc.

    But this ecosystem will only develop if an architecture or framework starts to be widely adopted; we've already got early customers and developers working with UIMA, but we think it needs to be extended much farther. We recognize that adoption will be limited if something is viewed as closed/proprietary. That's why we've published the source code on Sourceforge.net, the largest open source development site.

    UIMA's power comes from its ability to string together a set of content analysis processes. By using a UIMA-defined common analysis structure (CAS) to both read content and write findings, different analysis engines can generate their own characterizations of unstructured data, whether that data is a document, image or video.

    This is important because there is no one characterization technology - whether it be machine learning, statistical or rule-based natural language processing (NLP) or ontologies, to name a few - that works well in all situations. Instead, different search and categorization technologies are typically good at different tasks -one might be perfect at extracting entities, such as personal names or physical locations, while another is better at recognizing video content text. This technological specialization has limited search and categorization for years.

    UIMA finally lets these disparate technologies work together. For example, by using UIMA, three very different analysis engines can now all analyze the same content their own way and pass their combined findings to a common database.



    Dr. Daniel Gruhl, WebFountain Chief Scientist IBM Almaden Research Center

    Dr. Daniel Gruhl is a researcher at IBM's Almaden Research Center. He earned his Ph.D. in Electrical Engineering from the Massachusetts Institute of Technology in 2000, with thesis work on distributed text analytics systems. His interests include stenography (visual, audio, text and database), machine understanding, user modeling and very large scale text analytics.

    Dr. Gruhl is the chief scientist for Web Fountain, a web-scale text analytics research technology, with responsibility for overall hardware, software and systems design. He is also co-architect of IBM's Unstructured Information Management Architecture.


    Event Logistics


    Cubberley Community Center
    4000 Middlefield Road, Room H-1
    Palo Alto, CA


    7:00 - 7:20 p.m. Registration / Networking / Refreshments / Pizza
    7:20 - 9:00 p.m. Presentation


    $15 at the door for non-SDForum members
    No charge for SDForum members
    No registration required

    More on the Emerging Technology SIG....