Qiangqiang Mao @ UBC

Mao et al., 2025. Stop using CNN: knowledge distillation for an interpretable and lightweight decision tree in rod pump working condition diagnosis. https://doi.org/10.2118/228194-MS .

This paper involves the discussions on two representative models for dynamometer card image classification: Convolutional Neural Network (CNN) and Vision Transformer (ViT). Although transformer-based ViT differs fundamentally from CNN with the convolution operations, we did not explicitly distinguish between the two in the paper, which may cause some confusion.

In the paper, the term “CNN” is used more broadly to represent heavy, black-box models for image classification, including both CNN and ViT. I hope this clarification helps readers better understand the context. Importantly, this does not affect the findings of the paper: the main comparison remains between heavy, black-box models (CNN and ViT) and lightweight, interpretable models (decision trees).

In addition, I noticed that the equation formatting in the methodology section may appear unclear due to inconsistencies between LaTeX and XML formatting in the conference paper. I would suggest that readers refer to our fundamental algorithm work for clearer mathematical notations and formulations (Mao, Q., & Cao, Y., 2024. Can a Single Tree Outperform an Entire Forest? https://doi.org/10.48550/arXiv.2411.17003).