Tag-Based Metadata Management for Information Integration Miika Nurminen, 8.11.2010 During the 00s, different kinds of "web 2.0" or social web applications have become increasingly popular. Despite the expanding publicity, there is nothing essentially new here: various community-driven services, such as message forums, newsgroups and even social networking has been available and used for years – even before WWW. A common theme for social applications is a form of public, distributed metadata that is created and maintained by user community and accessible with a public API. Metadata enables the social aspect regardless of the particular application or content. This may include e.g. time-, location-, or other structured metadata, tags or other semi/unstructured metadata, discussions or ratings related to the actual content, or social relations between users. Tags are of specific interest in this thesis, because they allow users to annotate content items with multiple keywords in ad-hoc way, contrasted to traditional classification (cf. e-mail client or a directory-based filesystem) where items are categorized or "filed" to a single category or to a readily enforced structure. When tags from different users are aggregated together, a folksonomy emerges, reflecting collective view of the most relevant concepts related to the resource. Tags can be used for various purposes, such as topic information, content type, owner information, category refinement, characteristics, or task organization in a business process. While tags provide a flexible search and content aggregation interface, there are numerous issues with them when used in a systematic way: even though it is relatively easy to aggregate content from different users (e.g. search by tag) or even across different services (e.g. Technorati blog search) there is no guarantee that the tags are used in a consistent manner. Furthermore, different persons may apply different classification schemes (or even use tags in unforeseeable ways not related to classification), in essence creating their own, personal ontology with a specific vocabulary. The amount of tags per person used differs also dramatically - depending on a service, most persons use very few tags (or not at all), while a minority might use them excessively. If not properly assisted by the software application, tagging might be perceived as tedious, time-consuming, and error-prone. Especially with tags that use words directly found in the actual content data, the advantage might be little since web search engines would find the document anyway with relatively little effort despite the tags used. However, contextual tags that support the working process of the user (e.g. bookmarking the documents related to specific project instead of topic), as well as using the tags as an integrated search interface across different kinds or services might prove to be more useful. In addition, if the tags would be allowed to have some explicit structure (e.g. BibSonomy), they could be eventually be used as an easily accessible bridge to traditional, structured search engines, or to represent ontological data in a lightweight manner. The goals of this thesis are as follows: - Defining a generalized information retrieval model for tagging systems that allows hierarchical tags (in a similar way to faceted classification systems, producing a directed acyclic graph or a lattice), optional simple datatypes for tagging (e.g. explicit search mechanism for tags like "publicationYear:2010", or other spatiotemporal information), and explicit contextualization (e.g. definition of synonyms and matching between different "tag schemas" based on tags used in different services or by different persons) - Developing techniques to integrate tags and other metadata from different kinds of data sources (web services, databases, documents in local filesystems) such that a single search interface can be used to retrieve data, manage tags (e.g. one interface to define tag hierarchies, rename tags, find duplicates, suggest new tags, etc) and present it in different contexts (faceted search and profiled multichannel publishing). - Utilizing computational methods (e.g. conceptual clustering, object consolidation, retrieval fusion) to merge tags with close similarity (based on the metadata or content) from different contexts (e.g. services or users) and match tags with external sources, such as rdf metadata to enable semantic search. The research will utilize the following methods: - Systematic literature review based on current research on tagging systems - Critical evaluation of essential tagging-like services and software, including examination of tags and content used in these services - Construction and evaluation of a new prototype service that utilizes the proposed tagging model and integration capabilities with a focus in personal information management.