Testing Deep Learning-based Systems
Deep learning has been deployed more and more across imaging, language models, autonomous vehicles, and more. After development - a well-studied area - you typically need to deploy them into a production environment. As a less studied area with many moving pieces, developers often see issues crop up during deployment. While it is understandable due to the DL model experts not necessarily being experts across other facets of development, by analyzing and understanding the kinds of causes of these faults we as a community will hopefully be able to better address them. This week’s three papers analyzed the development and deployment pipeline of deep learning models, to define taxonomies and suggest directions for process improvement.
Between the three papers, each had differences in the way that they sourced, sliced, and organized their data regarding deployment and development issues encountered in deep learning. Each approach had up- and down-sides, but by tackling the problem from so many perspectives you’re left with a much more comprehensive view of the landscape. As Chen et al pointed out, the popularity of deploying deep learning models is ever-increasing, along with the number of questions asked on StackOverflow and GitHub, so it’s time to focus a little more on the application of these models as opposed to just developing them in a vacuum.
Chen et al
Chen et al and Humbatova et all sourced their issues from scraping StackOverflow and GitHub, while Zhang et al pulled their data from an internal analysis at Microsoft of their deep learning platform Philly. Philly exists only at Microsoft, certainly, and web-sourced data is more democratic and sure to cut across a wider swath of developers, but there’s certainly a benefit to using an internal system is being able to peek “behind the curtain” to see the issues programmers encounter but don’t share. As a result, simple general programming issues like key errors, undefined variables and type mismatches showed up in force in the Philly study at over one-third of total issues, illustrating room for particular improvement in the debugging systems used for DL.
Humbatova and Chen mostly avoided these general issues, by focusing their search on web-based DL-specific questions involving development and deployment, respectively. While they both searched for terms like “keras,” “tensorflow” and “pytorch,” Chen went the extra step of filtering further for cloud, mobile, and browser-specific questions. Humbatova also fleshed out the sourcing via interviews, to see not only what questions were asked by developers but also what questions developers felt they had.
We’ll begin with Humbatova’s findings, as it covers the beginning of the deep learning development process. After a manual analysis of 331 commits and 277 issues from GitHub, 477 Q&A on StackOverflow, and 20 semi-structured interviews both practitioners and researchers, the paper wound up with five main categories of deep learning issues:
Model issues: relates to the structure and properties of a DL model
Tensors and inputs: wrong shape, type, or format of data
Training: all facets of the training process (the largest and most confirmed category)
GPU Usage: all faults related to the GPU, generally very specific problems so no subcategorization
APIs: interfacing with the framework’s API usage
The final tree involved 92 leaves, each representing an issue encountered with deep learning systems. But they didn’t stop there! They then put the issue taxonomy to use, seeing whether mutation testing had adequate coverage of these issues.
Mutations are artificial faults seeded into a program during the testing process to test robustness and reliability of the system. It turns out that only 6 of the 92 issue categories were covered by current mutants, so there’s a lot of room for improvement and developing new mutation operators.
The final class vote was to accept, as everyone was impressed by the thorough data collection and detailed, step-by-step explanation of their filtering process. The authors also cast a wide net by both collecting examples from the web through automated systems, along with employing interviews in multiple steps of the process.
The two techniques of collection wound up having a large effect on the final result: the GitHub/StackOverflow data highlighted specific technical issues, while the interviews highlighted process-related issues. As a result, this is an excellent foundation for further research into faults and organization of faults for deep learning systems.
After development, deploying the model to create a usable system becomes the next problem to tackle. This is where the paper by Chen et al comes in.
A developer can’t be an expert in everything, and this paper illustrates the gaps for when a modeling practitioner jumps into a new field such as web, cloud or mobile development. The analysis again focused on data gathered from StackOverflow and GitHub, but specifically issues involving the deployment process into one of three categories: web, mobile, or the cloud.
Questions involving deployment of deep learning systems wait longer for accepted answers and are much more likely to go unanswered as compared to traditional DL questions or questions regarding the platforms in a non-DL sense.
While web, mobile and cloud platforms are undeniably different, they’re similar in concept because models need to be exported from the development environment and converted to the intended platform’s preferred format, integrated into the platform/application, and then cope with interactions from users and “real life” input. Mobile and browser shared the challenges of data extraction and inference speed, whereas server/cloud and browser shared the challenges of environment setup. This was not as much of a problem for mobile developers, but composed a fifth of all questions for server/cloud and browser! Issues specific to the server/cloud include how to deal with requests (inputs, response time, etc.) along with how to properly serve results e.g. while dealing appropriately with memory usage.
In class we discussed a lot about how some of these uncovered problems were also found in traditional software development, e.g. API documentation. However, we did find a few DL-specific issues. Questions centered around model export/conversion are DL-specific, but it's kind of related to all data, which is not necessarily just deep learning. Certain aspects of loading memory-and-GPU-intensive processes on mobile and server platforms are also less common in software development. Secondly, to allow your model to perform across different platforms, you do your best to size it appropriately with pruning etc. This changes the model semantics, creating a new variant of the software – how do we make sure both models are equally valid? It can be a problem!
Overall, our class agreed that this paper makes an important contribution by highlighting challenges across the DL deployment community and opens the door for us to develop helpful tools to combat these challenges.
Zhang et al’s use of Microsoft Philly was a double-edged sword. By being a birds-eye analysis they were able to better understand the issues across the full spectrum of development and some methods of deployment, but our class did have questions about the generalizability of issues encountered with a single platform.
While we already discussed the “general code error” category of issues they were able to discover with this omniscient view, the largest category by far actually dealt with the execution environment, with 48% of the total errors. Paths not found, libraries not found, and permission denied errors were rampant. While not specifically a paper on deployment, this goes hand-in-hand with Chen’s findings that developers don’t necessarily think through or understand their relationship to the platform they’re sending their code to, instead thinking of it as just an extension of their local environment.
Zhang et al
The authors strongly encourage making environments that are more conducive to developers working on deep learning systems, to help them encounter fewer errors and be able to bounce back more quickly when errors are encountered. Tooling on the local machine to both emulate and estimate the deployment environment could save countless developer hours, and improving the frameworks themselves with more useable APIs and replay abilities could help programmers avoid simplistic errors and debug more effectively.
As a class we voted to accept, as despite the paper’s faults it provided an excellent list of things to keep our eyes out for. They discuss bugs that we often face, but given that they documented it it allows us to build on top of their research.
Each paper had its own approaches and strengths. Humbatova et al focused on the development environment, outlining and categorizing specific issues that developers face, and noting that they are currently barely addressed by current testing methods (specifically mutants). Chen et al alternatively focused on the deployment platforms, illustrating how troublesome it can be for a developer to have to move outside of model-building, and how improving usability, configuration tools, and targeting skill building could provide a smoother transition. Zhang et al examined faults “in the wild,” suggesting how debugging tools could be improved, and made us feel confident that next time we encounter a KeyError that interrupts a long build process we’re no different than a deep learning expert at Microsoft!