Tag-Based Metadata Management for Information Integration 
Miika Nurminen, 8.11.2010

During the 00s, different kinds of "web 2.0" or social web applications 
have become increasingly popular. Despite the expanding publicity, there 
is nothing essentially new here: various community-driven services, such 
as message forums, newsgroups and even social networking has been 
available and used for years – even before WWW. A common theme for 
social applications is a form of public, distributed metadata that is 
created and maintained by user community and accessible with a public 
API. Metadata enables the social aspect regardless of the particular 
application or content. This may include e.g. time-, location-, or other 
structured metadata, tags or other semi/unstructured metadata, 
discussions or ratings related to the actual content, or social 
relations between users. Tags are of specific interest in this thesis, 
because they allow users to annotate content items with multiple 
keywords in ad-hoc way, contrasted to traditional classification (cf. 
e-mail client or a directory-based filesystem) where items are 
categorized or "filed" to a single category or to a readily enforced 
structure. When tags from different users are aggregated together, a 
folksonomy emerges, reflecting collective view of the most relevant 
concepts related to the resource. Tags can be used for various purposes, 
such as topic information, content type, owner information, category 
refinement, characteristics, or task organization in a business process. 

While tags provide a flexible search and content aggregation interface, 
there are numerous issues with them when used in a systematic way: even 
though it is relatively easy to aggregate content from different users 
(e.g. search by tag) or even across different services (e.g. Technorati 
blog search) there is no guarantee that the tags are used in a 
consistent manner. Furthermore, different persons may apply different 
classification schemes (or even use tags in unforeseeable ways not 
related to classification), in essence creating their own, personal 
ontology with a specific vocabulary. The amount of tags per person used 
differs also dramatically - depending on a service, most persons use 
very few tags (or not at all), while a minority might use them 
excessively. If not properly assisted by the software application, 
tagging might be perceived as tedious, time-consuming, and error-prone. 
Especially with tags that use words directly found in the actual content 
data, the advantage might be little since web search engines would find 
the document anyway with relatively little effort despite the tags used. 
However, contextual tags that support the working process of the user 
(e.g. bookmarking the documents related to specific project instead of 
topic), as well as using the tags as an integrated search interface 
across different kinds or services might prove to be more useful. In 
addition, if the tags would be allowed to have some explicit structure 
(e.g. BibSonomy), they could be eventually be used as an easily 
accessible bridge to traditional, structured search engines, or to 
represent ontological data in a lightweight manner. 

The goals of this thesis are as follows: 
- Defining a generalized information retrieval model for tagging systems 
that allows hierarchical tags (in a similar way to faceted 
classification systems, producing a directed acyclic graph or a 
lattice), optional simple datatypes for tagging (e.g. explicit search 
mechanism for tags like "publicationYear:2010", or other spatiotemporal 
information), and explicit contextualization (e.g. definition of 
synonyms and matching between different "tag schemas" based on tags used 
in different services or by different persons) 
- Developing techniques to integrate tags and other metadata from 
different kinds of data sources (web services, databases, documents in 
local filesystems) such that a single search interface can be used to 
retrieve data, manage tags (e.g. one interface to define tag 
hierarchies, rename tags, find duplicates, suggest new tags, etc) and 
present it in different contexts (faceted search and profiled 
multichannel publishing). 
- Utilizing computational methods (e.g. conceptual clustering, object 
consolidation, retrieval fusion) to merge tags with close similarity 
(based on the metadata or content) from different contexts (e.g. 
services or users) and match tags with external sources, such as rdf 
metadata to enable semantic search. 

The research will utilize the following methods: 
- Systematic literature review based on current research on tagging 
systems 
- Critical evaluation of essential tagging-like services and software, 
including examination of tags and content used in these services 
- Construction and evaluation of a new prototype service that utilizes 
the proposed tagging model and integration capabilities with a focus in 
personal information management.