top of page
  • SE4AI

Software Engineering 4 AI: Practices and Challenges

Updated: Feb 3, 2021

Paper 1: A comprehensive study on challenges in deploying deep learning based software

Zhenpeng Chen et al

Deep learning has been deployed more and more across imaging, language models, autonomous vehicles, and more. After development - a well-studied area - you typically need to deploy them into a production environment. There are a lot of problems here as a result!

This paper separated the issues into server/cloud, mobile, and browser. Each category has its own set of priorities and challenges.

What they have in common, though, is that the model needs to be exported from the development environment and converted to what the platform’s preferred format, integrated into the platform/application, and then cope with interactions from users and “real life” input. Each stage also has its own set of challenges.

The authors’ main research questions to develop actionable suggestions were:

  • RQ1: Gauging popularity of software deployment

  • RQ2: Quantifying difficulty of deployment

  • RQ3: Identifying challenges in DL software development

Using StackOverflow the authors searched keywords, mainly tagging platforms, and wound up with 3032 questions, searched by keywords:

  • 1325 server/cloud

  • 1533 mobile

  • 165 browser

RQ1: Gauging popularity of software deployment

There was a positive trend for results for both the number of users asking DL deployment questions as well as the total number of questions posed each year.

RQ2: Quantifying difficulty of deployment

Questions regarding deep learning deployment had the largest percentage of unanswered questions as well as the longest time-to-accepted-answer (by far!). This implies that DL deployment is an area where many improvements could be made to create an easier process.

RQ3: Identifying challenges in DL software development

From the data, the authors developed a taxonomy of challenges. They randomly sampled 769 and manually inspected them between two people.

Their final result had 3 taxonomies (root categories), 25 inner categories, and 72 leaf categories. The common challenges across all three were how/why to use DL models on platforms, model exporting/conversion to the appropriate format, and how to pre- and post-process data for the model.

Mobile and browser shared the challenges of data extraction and inference speed. Server/cloud and browser shared the challenges of environment setup. This was not as much of a problem for mobile developers, but composed a fifth of all questions for server/cloud and browser!

Issues specific to the server/cloud include how to deal with requests (inputs, response time, etc) along with how to properly serve results e.g. while dealing appropriately with memory usage.

Over a fifth of all questions on mobile involved compiling the deep learning library, while a category specific to the browser included how to load models from different locations.

What can be learned

Three subsets of people can benefit from this research: researchers, developers, and DL framework vendors.

Researchers could create automated fault location or configuration tools, and spend more time talking with others to see the implications for different communities of developers.

Developers need to target their learning to focus on both development and deployment, e.g. to learn JavaScript before attempting to run Tensorflow.js, as knowing Tensorflow is just not enough for that task! They’ll also want to pay attention to what problems developers have had before and focus on project management skills, as to not fall into simple, well-trod paths of struggle.

Instead of simply focusing on performance, DL framework vendors can focus on the UX of their products, improve documentation, usability and APIs.

It will be a tough road, but these issues are definitely fixable!

Discussion topics

Some problems are common for traditional software development as well, e.g. API documentation. Is there anything special in this paper that's only in deep learning? Model export/conversion is DL-specific, but it's kind of related to all data, which is not necessarily just deep learning. Certain aspects of loading memory-and-GPU-intensive processes on mobile and server platforms are also less common in software development. Overall, understanding more about the software development process would be helpful.

To allow your model to perform across different platforms, you do your best to size it appropriately with pruning etc. This changes the model semantics, creating a new variant of the software – how do we make sure both models are equally valid? It can be a problem!


A big limitation of this study is that they only took data from StackOverflow. We’ll see the benefit of interviews in a later paper! Their taxonomy was also constructed through manual sorting by only two people, while three or more would have been preferred.

Do you learn something from this paper? "it's trickier than just an export button," and these problems might not occur to you. But how useful, really, is leveraging all of the large-scale data analysis to just highlight the outline? Is there something you can learn from outlining?

Perhaps! if you aren't thinking about problems in this format, the errors might seem obscure and you don't get where they're coming from. This outline might allow you to have structure about your steps.

Possible class project: take a bunch of NN models, see whether you can automatically export to e.g. tensorflow lite.

Is there a way to standardize the hardware part? because you can standardize the software part using a container. but the hardware industry is moving so quickly, and you have to be flexible for your customers.

Verdict: accepted. This paper is good because you can find the errors and develop a tool as a result. It's okay that it's just data analytics! It’s borderline, but paper is accepted.

Paper 2: An Empirical Study on Program Failures of Deep Learning Jobs

Ru Zhang et al

This paper selected 400 random failed jobs from Microsoft Philly to classify.

Exceptional Data (ED)

An ML system should be expected to perform on unseen data - but maybe it won't! Maybe we have the wrong number of labels or something, make sure you do nice things with Exceptions to pay attention

Something completely different than you haven't even seen. Maybe invalid data!

Out of distribution data - the self-driving system has never seen a rainy image, so we synthesize rainy images. This issue is even worse, it's entirely new classes!

Exceptional data - is data that will... cause an exception.

Common Programming Error (CPE)

Illegal argument! Type mismatch! Key not found! All of the normal common programming issues.

Difference in Environment

Most failures in the execution environment are caused by environmental discrepancies between local and platform. The many discrepancies make deep learning programs error prone.

Developers are encouraged to use custom docker images with all desired software pre-installed, modify the code to be more environment adaptive, and verify paths/permissions as early as possible.

Inappropriate Model Parameters / Structures

Developers should proactively choose the optimal model parameters and structures, taking into consideration both available GPU memory and expected learning performance.

API Misunderstanding

Developers may not fully understand the complex assumptions made by framework/library APIs due to the rapid evolution of deep learning related software, which results in failures related to framework api misuse.

Current testing practices

The current DL testing practices are often insufficient due to the characteristics of deep learning. There are three major challenges:

1) incomparable testing environment

2) large test space

3) necessity of testing at different dl phases

The incomparable testing environment problem can be solved by using a different simulator. The different DL phases can include a training model, validating model, deployment, etc.

Developers are encouraged to test more cases across all the dl phases. The local simulator of the platform, estimation of GPU memory consumption, and test data generator could be useful for DL testing.

Current debugging practices

It’s difficult to match between development/test and production environments for several reasons, including incomparable test environments, large test space, and the necessity of performing these tests at different deep learning phases (reading in data, cleaning data, training, testing, etc).

This necessitates writing more test cases and better simulating or estimating the environment that one is deploying to (e.g. GPU memory). But even if you do this, the DL debugging tools aren’t very good!

As a result, developers need more DL specific debugging tools. This would provide a mechanism to, for example, see GPU memory usage and save intermediate results for examining errors.

Future Research Directions

  • Platform improvement

    • Avoiding unnecessary retries

    • Local simulators

  • Tool support

    • GPU memory consumption estimation

    • Static program analysis

    • Data synthesis (unknown data, impossible data, etc)

  • Framework improvement

    • Automatic GPU memory management (if one batch fails, retries or reduces)

    • Record and replay

Thoughts from Critics

All of the data came from Microsoft Philly - other platforms may have other bugs or general issues that were missed by this paper.


We vote to accept, as despite its faults it’s an eye of good things to keep our eyes out for. They discuss bugs that we often face, but given that they documented it it allows us to build on top of their research.

In terms of their process, it's difficult to do interviews and get feedback from actual users, this is a good way to get actual errors in a large way. As a result, large-scale user-study evaluation is rare. "I'm not saying that we should not do that, but it's a very difficult thing to do"

Paper 3: Taxonomy of Real Faults in Deep Learning Systems

Nargiz Humbatova et al.

The previous paper focused on faults in deployment, while this one focuses on the faults when building the DL system. Specifically, it’s focused on developing a taxonomy of real faults using a bottom-up approach.

They pulled large amounts of data from Github and Stack Overflow by searching for issues relating to Tensorflow, PyTorch and Keras, then filtered for meaningful and appropriate issues/questions on non-personal projects. Additionally, they conducted semi-structured interviews with 20 people involved with deep learning, half practitioners and half researchers. In contrast with the last paper, it’s the first time information was sampled with interviews.

Their manual analysis consisted of 331 commits and 277 issues from GitHub, along with 477 Q&A on StackOverflow. You can find the artefacts from their research on GitHub at

After the first round of analysis and interviews, they conducted a further round of interviews with 21 DL practitioners who were different than those that participated during the research round. This allowed them to validate and extend their initial categorization.

In the end, the issues encountered fell into five main categories:

  • Model issues: relates to the structure and properties of a DL model

  • Tensors and inputs: wrong shape, type, or format of data

  • Training: all facets of the training process (the largest and most confirmed category)

  • GPU Usage: all faults related to the GPU, generally very specific problems so no subcategorization

  • APIs: interfacing with the framework’s API usage

The final tree involved 92 leaves, each representing an issue encountered with deep learning systems. But they didn’t stop there! They then put the issue taxonomy to use, seeing whether mutation testing had adequate coverage of these issues.

Mutations are artificial faults seeded into a program during the testing process to test robustness and reliability of the system. It turns out that only 6 of the 92 issue categories were covered by current mutants, so there’s a lot of room for improvement and developing new mutation operators.

The final vote was to accept, as everyone was impressed by the thorough data collection and detailed, step-by-step explanation of their filtering process. The authors also cast a wide net by both collecting examples from the web through automated systems, along with employing interviews in multiple steps of the process.

The two techniques of collection wound up having a large effect on the final result: the GitHub/StackOverflow data highlighted specific technical issues, while the interviews highlighted process-related issues. As a result, this is an excellent foundation for further research into faults and organization of faults for deep learning systems.

141 views0 comments

Recent Posts

See All


bottom of page