TY - GEN
T1 - "Looks Good To Me ;-)"
T2 - 28th International Conference on Evaluation and Assessment in Software Engineering, EASE 2024
AU - Coutinho, Daniel
AU - Cito, Luisa
AU - Lima, Maria Vitória
AU - Arantes, Beatriz
AU - Alves Pereira, Juliana
AU - Arriel, Johny
AU - Godinho, João
AU - Martins, Vinicius
AU - Libório, Paulo Vítor C.F.
AU - Leite, Leonardo
AU - Garcia, Alessandro
AU - Assunção, Wesley K.G.
AU - Steinmacher, Igor
AU - Baffa, Augusto
AU - Fonseca, Baldoino
N1 - Publisher Copyright:
© 2024 ACM.
PY - 2024/6/18
Y1 - 2024/6/18
N2 - Modern software development relies on cloud-based collaborative platforms (e.g., GitHub and GitLab). In these platforms, developers often employ a pull-based development approach, proposing changes via pull requests and engaging in communication via asynchronous message exchanges. Since communication is key for software development, studies have linked different types of sentiments embedded in the communication to their effects on software projects, such as bug-inducing commits or the non-acceptance of pull requests. In this context, sentiment analysis tools are paramount to detect the sentiment of developers' messages and prevent potentially harmful impact. Unfortunately, existing state-of-the-art tools vary in terms of the nature of their data collection and labeling processes. Yet, there is no comprehensive study comparing the performance and generalizability of existing tools utilizing a dataset that was designed and systematically curated to this end, and in this specific context. Therefore, in this study, we design a methodology to assess the effectiveness of existing sentiment analysis tools in the context of pull request discussions. For that, we created a dataset that contains ≈ 1.8K manually labeled messages from 36 software projects. The messages were labeled by 19 experts (neuroscientists and software engineers), using a novel and systematic manual classification process designed to reduce subjectivity. By applying these existing tools to the dataset, we observed that while some tools ]perform acceptably, their performance is far from ideal, especially when classifying negative messages. This is interesting since negative sentiment is often related to a critical or unfavorable opinion. We also observed that some messages have characteristics that can make them harder to classify, causing disagreements between the experts and possible misclassifications by the tools, requiring more attention from researchers. Our contributions include valuable resources to pave the way to develop robust and mature sentiment analysis tools that capture/anticipate potential problems during software development.
AB - Modern software development relies on cloud-based collaborative platforms (e.g., GitHub and GitLab). In these platforms, developers often employ a pull-based development approach, proposing changes via pull requests and engaging in communication via asynchronous message exchanges. Since communication is key for software development, studies have linked different types of sentiments embedded in the communication to their effects on software projects, such as bug-inducing commits or the non-acceptance of pull requests. In this context, sentiment analysis tools are paramount to detect the sentiment of developers' messages and prevent potentially harmful impact. Unfortunately, existing state-of-the-art tools vary in terms of the nature of their data collection and labeling processes. Yet, there is no comprehensive study comparing the performance and generalizability of existing tools utilizing a dataset that was designed and systematically curated to this end, and in this specific context. Therefore, in this study, we design a methodology to assess the effectiveness of existing sentiment analysis tools in the context of pull request discussions. For that, we created a dataset that contains ≈ 1.8K manually labeled messages from 36 software projects. The messages were labeled by 19 experts (neuroscientists and software engineers), using a novel and systematic manual classification process designed to reduce subjectivity. By applying these existing tools to the dataset, we observed that while some tools ]perform acceptably, their performance is far from ideal, especially when classifying negative messages. This is interesting since negative sentiment is often related to a critical or unfavorable opinion. We also observed that some messages have characteristics that can make them harder to classify, causing disagreements between the experts and possible misclassifications by the tools, requiring more attention from researchers. Our contributions include valuable resources to pave the way to develop robust and mature sentiment analysis tools that capture/anticipate potential problems during software development.
KW - human aspects
KW - repository mining
KW - sentiment analysis
UR - http://www.scopus.com/inward/record.url?scp=85197447140&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85197447140&partnerID=8YFLogxK
U2 - 10.1145/3661167.3661189
DO - 10.1145/3661167.3661189
M3 - Conference contribution
AN - SCOPUS:85197447140
T3 - ACM International Conference Proceeding Series
SP - 211
EP - 221
BT - Proceedings of 2024 28th International Conference on Evaluation and Assessment in Software Engineering, EASE 2024
PB - Association for Computing Machinery
Y2 - 18 June 2024 through 21 June 2024
ER -