Exploring the Benefits of Using Redundant Responses in Crowdsourced Evaluations

Exploring the Benefits of Using Redundant Responses in Crowdsourced Evaluations Crowd sourcing can be an efficient and cost-effective way to evaluate software engineering research, particularly when the evaluation can be broken down into small, independent tasks. In prior work, we crowd sourced evaluations for a refactoring technique for web mashups and for a source code search engine, both using Amazon’s Mechanical Turk. In the refactoring study, preference information was gathered when comparing a refactored with an unrefactored pipe, in addition to a free-text justification. In the code search study, information was gathered about whether a code snippet was relevant to a programming task and why. In both studies, we used redundant metrics and gathered quantitative and qualitative data in an effort to control response quality. Our prior work only analyzed the quantitative results. In this work, we explore the value of using such redundant metrics in crowd sourced evaluations. We code the free-text responses to unveil common themes among the responses and then compare those themes with the quantitative results. Our findings indicate high similarity between the quantitative and free-text responses, that the quantitative results are sometimes more positive than the free-text response, and that some of the qualitative responses point to potential inadequacies with the quantitative questions from the studies.