TY - GEN
T1 - Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant
AU - Wiese, Igor Scaliante
AU - Da Silva, José Teodoro
AU - Steinmacher, Igor
AU - Treude, Christoph
AU - Gerosa, Marco Aurélio
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2017/1/12
Y1 - 2017/1/12
N2 - Many software projects adopt mailing lists for the communication of developers and users. Researchers have been mining the history of such lists to study communities' behavior, organization, and evolution. A potential threat of this kind of study is that users often use multiple email addresses to interact in a single mailing list. This can affect the results and tools, when, for example, extracting social networks. This issue is particularly relevant for popular and long-term Open Source Software (OSS) projects, which attract participation of thousands of people. Researchers have proposed heuristics to identify multiple email addresses from the same participant, however there are few studies analyzing the effectiveness of these heuristics. In addition, many studies still do not use any heuristics for authors' disambiguation, which can compromise the results. In this paper, we compare six heuristics from the literature using data from 150 mailing lists from Apache Software Foundation projects. We found that the heuristics proposed by Oliva et al. and a Naïve heuristic outperformed the others in most cases, when considering the F-measure metric. We also found that the time window and the size of the dataset influence the effectiveness of each heuristic. These results may help researchers and tool developers to choose the most appropriate heuristic to use, besides highlighting the necessity of dealing with identity disambiguation, mainly in open source software communities with a large number of participants.
AB - Many software projects adopt mailing lists for the communication of developers and users. Researchers have been mining the history of such lists to study communities' behavior, organization, and evolution. A potential threat of this kind of study is that users often use multiple email addresses to interact in a single mailing list. This can affect the results and tools, when, for example, extracting social networks. This issue is particularly relevant for popular and long-term Open Source Software (OSS) projects, which attract participation of thousands of people. Researchers have proposed heuristics to identify multiple email addresses from the same participant, however there are few studies analyzing the effectiveness of these heuristics. In addition, many studies still do not use any heuristics for authors' disambiguation, which can compromise the results. In this paper, we compare six heuristics from the literature using data from 150 mailing lists from Apache Software Foundation projects. We found that the heuristics proposed by Oliva et al. and a Naïve heuristic outperformed the others in most cases, when considering the F-measure metric. We also found that the time window and the size of the dataset influence the effectiveness of each heuristic. These results may help researchers and tool developers to choose the most appropriate heuristic to use, besides highlighting the necessity of dealing with identity disambiguation, mainly in open source software communities with a large number of participants.
KW - Apache software foundation
KW - Email address disambiguation
KW - Mailing lists
KW - Mining software repositories
UR - http://www.scopus.com/inward/record.url?scp=85013078254&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85013078254&partnerID=8YFLogxK
U2 - 10.1109/ICSME.2016.13
DO - 10.1109/ICSME.2016.13
M3 - Conference contribution
AN - SCOPUS:85013078254
T3 - Proceedings - 2016 IEEE International Conference on Software Maintenance and Evolution, ICSME 2016
SP - 345
EP - 355
BT - Proceedings - 2016 IEEE International Conference on Software Maintenance and Evolution, ICSME 2016
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 32nd IEEE International Conference on Software Maintenance and Evolution, ICSME 2016
Y2 - 2 October 2016 through 10 October 2016
ER -