Query Log Anonymization by Differential Privacy
Yang, Grace Hui
Web search query logs, which record the interactions between the search engine and its users, are valuable resources for Information Retrieval (IR) research. For years, such query logs have been supporting multiple IR applications and have significantly promoted the advance of IR research. However, releasing query logs without proper anonymization may lead to serious violations of user privacy. As a result, concerns about user privacy have become major obstacles preventing these resources from being available for research use. This dissertation addresses the challenge of query log anonymization, in order to keep advancing IR research.Particularly, this dissertation presents my research on query log anonymization by differential privacy. Anonymization of query logs differs from that of structured data because query logs are generated based on natural language, whose vocabulary is infinite. To mitigate the challenges in query log anonymization, I propose to use a differentially private mechanism to generate anonymized query logs containing sufficient contextual information for existing web search algorithms to use and attain meaningful results. I empirically validate the effectiveness of my framework for generating usable and privacy-preserving logs for web search. Experiments show that it is possible to maintain high utility for this task while guaranteeing sufficient privacy.In addition, this dissertation also proposes my expended research on query log anonymization to involve session data. My previous work on session search has shown that such search sessions are essential resources to support complex IR tasks. Although researchers have recently proposed approaches to histogram-based data release of query logs, how session data in query logs can be released differentially privately with meaningful utility remains unclear. By proposing a differentially private query log anonymization algorithm to release session data, my research resolves this significant concern about how to properly release and use the session information of query logs. Moreover, I use two typical IR applications, query suggestion and session search, to examine utility of anonymized logs and privacy-utility tradeoff of the session-based query log anonymization work.In summary, by resolving concerns in both privacy and utility aspects, this dissertation provides theoretical frameworks and practical implementations of query log anonymization by differential privacy. It serves as an important step towards an ultimate solution to the general challenge of data anonymization in real-world IR applications. I hope this work can not only benefit the research in this particular task of query log anonymization but also inspire more research in privacy-preserving Information Retrieval (PPIR) and other data-driven research domains.
MetadataShow full item record
Showing items related by title, author, creator and subject.