Machine learning (ML) might enhance various business activities and address numerous organizational goals. ML skills, for example, may be used to recommend items to consumers based on previous purchases, offer picture recognition for video surveillance, detect spam email messages, and anticipate courses of action, routes, or illnesses, among other things. However, with the exception of large high-tech companies such as Google and Microsoft, development of ML capabilities is still primarily a research activity or a standalone project in most organizations today, and there is a scarcity of existing guidance to help organizations develop these capabilities.
The fragility of ML components and their algorithms limits their integration into applications. They are vulnerable to changes in data, which may lead their forecasts to shift over time. They are also hampered mismatches between system components.
For example, if an ML model is trained on data that differs from data in the operational environment, the ML component’s field performance would be drastically diminished. In this blog post, which is adapted from a paper that SEI colleagues and I presented at the 2019 Association for the Advancement of Artificial Intelligence (AAAI) fall seminar (Artificial Intelligence in Government and Public Sector), I describe how the SEI is developing new methods to detect and prevent mismatches in ML-enabled systems so that ML can be used to drive organizational improvement with greater success.
Machine Learning Mismatches and Their Causes
One of the most difficult aspects of implementing complex systems is integrating all components and addressing any system-component incompatibilities. The origins and impact of these mismatches may vary from previous software-integration attempts in systems that include ML components.
Aside from software engineering concerns, the knowledge required to deploy an ML component inside a system typically comes from sources other than software engineering. As a consequence, assumptions and even descriptive language used practitioners from these many fields might worsen additional issues associated with integrating ML components into broader systems. We are looking at several types of mismatches in ML systems integration in order to detect the implicit assumptions made practitioners in many disciplines (data scientists, software engineers, and operations personnel) and develop methods to convey the required information directly. To effectively field ML components, we will need to identify the mismatches that exist and establish procedures to reduce the repercussions of these mismatches.
Misaligned Points of View in Machine Learning Enabled Systems
As depicted in Figure 1, an ML-enabled system is a software system that depends on one or more ML software components to offer needed capabilities.
In Figure 1, the ML component gets processed operational data from one software component (i.e., the upstream component) and creates an insight that is consumed another software component (i.e., downstream component). One issue with these systems is that their creation and operation entail three views, with three distinct and sometimes entirely independent workflows and individuals.
- The model is created the data scientist: Figure 2 depicts the data scientist’s workflow, which consists of taking an untrained model and raw data and using feature engineering to generate a collection of training data, which is then used to train the model, and repeating these stages until a set of appropriate models is generated. The multiple models are then tested using a set of test data, and the one that performs the best based on a set of defined evaluation measures is chosen. A trained model is produced as a result of this workflow.
2. The learned model is integrated into a bigger system the software engineer: As depicted in Figure 3, the software engineer’s workflow is to take the trained model, integrate it into the ML-enabled system, and test the system until it passes all tests. After that, the ML-enabled system is handed out to operations employees for deployment.
3. The system is deployed, operated, and monitored operations personnel. As shown in Figure 1, operations employees are also responsible for the operation and monitoring of operational data sources in addition to the operation and monitoring of the ML-enabled system (e.g., databases, data streams, data feeds).
These three points of view work independently and often utilize distinct language. As a consequence, there is a risk of misalignment between each perspective’s expectations about the parts of the ML-enabled system—the non-human entities participating in the training, integration, and operation of ML-enabled systems—and the actual assurances offered each element. This issue is worsened the fact that system components grow separately and at different rates, which may lead to inadvertent mismatch over time. Furthermore, these opinions are likely to come from three separate organizations.
Misbehavior and expensive rework are caused the late revelation of implicit and inconsistent assumptions regarding ML and non-ML aspects of ML-enabled systems. Following are some examples:
- The system performs badly because the computer resources utilized during model testing vary from the computing resources used during operations. This issue is an example of a computing-resource mismatch.
- Model accuracy is low due to the fact that model training data differs from operational data. This would be a discrepancy in data distribution.
- Because the trained model input/output is incompatible with operational data types, a substantial amount of glue code must be written. This is caused an application-programming interface (API) incompatibility.
- The system fails when it was not thoroughly tested since the developers were unable to reproduce the testing performed during model training. This issue would be a mismatch in the test environment.
- Existing monitoring techniques are incapable of detecting decreased model accuracy, which is the performance indicator established for the trained model. A metric mismatch would be the source of this issue.
Based on the above, we describe an ML mismatch as an issue that emerges throughout the development, implementation, and operation of an ML-enabled system as a result of inaccurate assumptions about system aspects made various stakeholders, resulting in a bad outcome. In our scenario, the stakeholders are a data scientist, a software developer, and a release (operations) engineer. The root cause of ML mismatch may be traced back to knowledge that, if shared among stakeholders, might have averted difficulties.
Large high-tech businesses, such as Google and Amazon, have tackled the ML mismatch issue delegating all three roles for a given AI/ML microservice to a single team. Although this method does not make the assumptions more transparent, it does address some of the communication issues. In essence, constructing an AI/ML model is a statistical issue that is relatively quick and inexpensive; nevertheless, deploying, updating, and sustaining models and the systems that incorporate them is a difficult and costly technical task.
The SEI is working on machine-readable ML-enabled system element descriptions to identify and avoid incompatibilities. The purpose of these descriptors is to make ML system adoption easier codifying system element properties and therefore making all assumptions transparent. The descriptors would then be utilized manually system stakeholders for information, awareness, and assessment, as well as automated mismatch detectors at design and runtime for scenarios where system properties lend themselves to automation.
The following are some of the advantages we foresee from this work:
- As ML-enabled systems are built, definitions of mismatch act as checklists.
- Recommended descriptions give examples of information to obtain or obligations to impose to stakeholders (e.g., government program offices).
- Methods for verifying ML-enabled system-element properties give concepts for checking third-party information.
- Determining which traits are amenable to automatic identification creates new software components for ML-enabled systems.
- We intend to gain solutions to queries like: • What are the most typical forms of mismatch that arise in the end-to-end development of ML-enabled systems?
- How can we adequately document data, models, and other ML system parts to discover mismatches?
- What are some instances of mismatches that may be found automatically using machine-readable descriptors for ML system elements?
Validity and Approach
The technological technique for developing and verifying ML-enabled system-element descriptors is divided into three stages.
Phase 1: Gathering Information: As seen in Figure 4, this step entails two concurrent activities. In one activity, we use interviews to obtain instances of mismatches and their effects from practitioners. In the second job, we identify characteristics presently used to characterize parts of ML-enabled systems mining project descriptions from GitHub repositories containing trained and untrained models, as well as doing a white and gray literature study. This multi-modal approach teaches both practitioners and academics how to explain ML-enabled system aspects.
Phase 2 Analysis: Figure 5 depicts the tasks in this phase. After eliciting mismatches and properties of ML-enabled system parts, we map them creating a first version of the spreadsheet illustrated in Figure 6. We determine the collection of characteristics that may be used to detect each mismatch and formalize the mismatch as a predicate over those attributes, as illustrated in Figure 6’s Formalization column. As an example, Mismatch 1 happens when the value of Attribute 1 plus the value of Attribute 2 exceeds the value of Attribute 5.
The next step is to do gap analysis to discover mismatches that do not map to any characteristics and attributes that do not map to any mismatch. We next add characteristics to the mapping based on our domain expertise, perhaps introducing additional incompatibilities that might be found based on the existing data. Finally, there is a data-source and feasibility-analysis process in which we determine
- the data source (who supplies the value) and
- the feasibility of collecting those values (is it realistic to expect someone to offer that value or is there a means to automate its collection?) for each characteristic.
- how it can be checked (if validation is required to ensure that the given value is accurate)
- possibility for automation (can the collection of detected characteristics be utilized in scripts or tools to identify that mismatch?)
We have an initial version of the spreadsheet and an initial version of the descriptors created from the spreadsheet after the analysis process.
Phase 3: Evaluation: As indicated in Figure 7, in this phase, we re-engage with the information-gathering phase interview participants to confirm mapping, data sources, and feasibility. The assessment goal is to achieve 90% agreement on the work created during the analysis phase. We also provide a small-scale example of automated mismatch detection identifying mismatches in a project that may be discovered automation and writing scripts to detect the mismatch. Finally, the verified mapping between mismatches and characteristics, a collection of descriptors derived from that mapping, and instances of the descriptors are anticipated outputs of this effort.
Looking Ahead to Next Steps
As a consequence of our work, we anticipate communities starting to build tools for automatically identifying ML mismatches and enterprises beginning to integrate mismatch detection in their toolchains for constructing ML-enabled systems. As a move in this approach, we are working on the following artifacts:
- A collection of properties for ML-enabled system components
- A mapping of mismatches to characteristics (spreadsheet) (spreadsheet)
- An XML schema for each descriptor (one per system element) includes XML samples of descriptors
- A small-scale demonstration (scripts) of automatic mismatch detection
Through the interviews I described in the information-gathering phase, we plan to socialize the concept of mismatch and convey its importance for the deployment of ML-enabled systems into production; elicit and conﬁrm mismatches in ML-enabled systems and their consequences from people in the ﬁeld; and obtain early feedback on the study protocol and resulting artifacts. If you are interested in sharing your cases of mismatch or in creating tools for mismatch detection, we welcome you to contact us at firstname.lastname@example.org.