Not all core concepts from software engineering translate into the machine learning universe. Here are some differences I've noticed.
Divide and Conquer A key technique in software engineering is to break a problem down into simpler subproblems, solve those subproblems, and then compose them into a solution to the original problem. Arguably, this is the entire job, recursively applied until the solution can be expressed in a single line in whatever programming language is being used. The canonical pedagogical example is the Tower of Hanoi.
Unfortunately, in machine learning we never exactly solve a problem. At best, we approximately solve a problem. This is where the technique needs modification: in software engineering the subproblem solutions are exact, but in machine learning errors compound and the aggregate result can be complete rubbish. In addition apparently paradoxical situations can arise where a component is “improved” in isolation yet aggregate system performance degrades when this “improvement” is deployed (e.g., due to the pattern of errors now being unexpected by downstream components, even if they are less frequent).
Does this mean we are doomed to think holistically (which doesn't sound scalable to large problems)? No, but it means you have to be defensive about subproblem decomposition. The best strategy, when feasible, is to train the system end-to-end, i.e., optimize all components (and the composition strategy) together rather than in isolation. Often this is not feasible, so another alternative (inspired by Bayesian ideas) is to have each component report some kind of confidence or variance along with the output in order to facilitate downstream processing and integration.
In practice, when systems get to a particular scope, there needs to be decomposition in order to divide the work up amongst many people. The fact that this doesn't work right now in machine learning is a problem, as elegantly described by Leon Bottou in his ICML 2015 invited talk.
Speaking of another concept that Leon discussed $\ldots$
Correctness In software engineering, an algorithm can be proven correct, in the sense that given particular assumptions about the input, certain properties will be true when the algorithm terminates. In (supervised) machine learning, the only guarantee we really have is that if the training set is an iid sample from a particular distribution, then performance on another iid sample from the same distribution will be close to that on the training set and not too far from optimal.
Consequently anyone who practice machine learning for a living has an experimental mindset. Often times I am asked whether option A or option B is better, and most of the time my answer is “I don't know, let's try both and see what happens.” Maybe the most important thing that people in machine learning know is how to assess a model in such a way that is predictive of generalization. Even that is a “feel” thing: identifying and preventing leakage between training and validation sets (e.g., by stratified and temporal sampling) is something you learn by screwing up a few times; ditto for counterfactual loops. Kaggle is great for learning about the former, but the latter seems to require making mistakes on a closed-loop system to really appreciate.
Experimental “correctness” is much weaker than the guarantees from other software, and there are many ways for things to go badly. For example in my experience it is always temporary: models go stale, it just always seems to happen. Ergo, you need to plan to be continually (hence, automatically) retraining models.
Reuse This one is interesting. Reuse is the key to leverage in traditional software engineering: it's not just more productive to reuse other code, but every line of code you write yourself is an opportunity to inject defects. Thus, reuse not only allows you to move faster but also make less mistakes: in return, you must pay the price of learning how to operate a piece of software written by others (when done well, this price has been lowered through good organization, documentation, and community support).
Some aspects of machine learning exhibit exactly the same tradeoff. For instance, if you are writing your own deep learning toolkit, recognize that you are having fun. There's nothing wrong with having fun, and pedagogical activities are arguably better than playing video games all day. However, if you are trying to get something done, you should absolutely attempt to reuse as much technology as you can, which means you should be using a standard toolkit. You will move faster and make less mistakes, once you learn how to operate the standard toolkit.
Machine learning toolkits are “traditional software”, however, and are designed to be reused. What about model reuse? That can be good as well, but the caveats about decomposition above still apply. So maybe you use a model which produces features from a user profile as inputs to your model. Fine, but you should version the model you depend upon and not blindly upgrade without assessment or retraining. Reusing the internals of another model is especially dangerous as most machine learning models are not identifiable, i.e., have various internal symmetries which are not determined by the training procedure. Couple an embedding to a tree, for instance, and when the next version of the embedding is a rotation of the previous one, you can watch your performance go to crap immediately.
Basically, model reuse creates strong coupling between components which can be problematic if one component is changed.
Testing I find the role of software testing in machine learning to be the trickiest issue of all. Without a doubt testing is necessary, but the challenge in using something like property-based testing is that the concept that is being captured by the machine learning component is not easily characterized by properties (otherwise, you would write it using non-ml software techniques). To the extent there are some properties that the ml component should exhibit, you can test for these, but unless you incorporate these into the learning procedure itself (e.g., via parameter tying or data augmentation) you are likely to have some violations of the property that are not necessarily indicative of defects.
Having a “extra-test” data set of with minimal acceptable quality is a good idea: this could be easy examples that “any reasonable model” should get correct. There's also self-consistency: at Yahoo they used to ship models with a set of input-output pairs that were computed with the model when it was put together, and if the loaded model didn't reproduce the pairs, the model load was cancelled. (That should never happen, right? Surprise! Maybe you are featurizing the inputs using a library with a different version or something.)
Monitoring the metrics (proxy and true) of deployed models is also good for detecting problems. If the proxy metric (i.e., the thing on which you actually trained your model and estimated generalization performance) is going south, the inputs to your model are changing somehow (e.g., nonstationary environment, change in feature extraction pipeline); but if the proxy metric is stable while the true metric is going south, the problem might be in how the outputs of your model are being leveraged.
Unfortunately what I find is many software systems with machine learning components are tested in a way that would make traditional software engineers cringe: we look at the output to see if it is reasonable. Crazy! As machine learning becomes a more pervasive part of software engineering, this state of affairs must change.