Keyword selection, often overlooked yet critical for text analysis research, remains poorly executed by humans due to inherent bias and suboptimal methods. This paper presents a novel computer-assisted framework that overcomes these limitations.
### The Challenge: Subpar Human Keyword Selection
Researchers frequently underestimate the complexity of choosing effective keywords from large unstructured text datasets. Standard approaches like Google searches fail to capture nuanced requirements for political science applications where precise document retrieval is essential.
### Our Solution: Leveraging Classifier Errors
We introduce a statistical method that trains classifiers on available text, then systematically analyzes their misclassifications—errors—to identify meaningful search terms. This approach extracts valuable information without needing structured data inputs or attempting to correct mistakes directly.
### How It Works
* Generates Boolean search strings easily understandable by researchers
* Provides suggestions for keywords based on statistical patterns in the unstructured text itself
* Creates 'document sets' optimized for discovery and retrieval tasks, rather than relying solely on pre-defined labels
### Applications Demonstrated
The technique proves valuable across various domains:
* Social media analysis where users rapidly innovate language to evade authorities (e.g., Chinese social media posts designed to circumvent censorship)
* General web searches requiring nuanced topic identification
* eDiscovery processes in legal contexts
* Industry and intelligence analyses needing comprehensive document coverage
### Results
Illustrative applications, such as an analysis of English-language texts about the Boston Marathon bombings, demonstrate how this method effectively captures relevant documents by identifying terms that other approaches miss. This computer-assisted approach delivers superior keyword suggestions compared to human intuition or standard automated methods.