Back to top

Review of ML attacks

ML models are exposed to a number of unique attack vectors that are uncommon or low-risk in other types of software. Although the full scope of these attack vectors is still unknown, there now exist taxonomies of machine learning failure modes that highlight their commonalities. These taxonomies show that many machine learning failure modes overlap with traditional software failure modes. Following is a brief review of attacks relevant to Knox ML Model Protection:

  • Adversarial samples — The purpose of an adversarial sample is to force an AI/ML model to produce incorrect outputs. Since AI/ML models are often trained on data labeled by people, adversarial samples typically aim to generate data that the AI/ML model classifies differently than an actual person would. These attacks are often disambiguated based on what the attacker may use to perform the attack.

    Blackbox adversarial samples work only with the API-level access to the AI/ML model. Such attacks need thousands of attempts to succeed. Blackbox attacks become efficient over time as attackers can use query results to build models that determine offline whether a new attack is likely to succeed. They also become efficient as the API reveals additional information, such as confidence scores.

    Whitebox attacks require direct access to the AI models. Although there is significant work on defending AI models against whitebox attacks, even proven defenses against such attacks are, at best, questionable.

  • Model inversion — The purpose of a model inversion attack is to extract sensitive information about the training data by querying a model. An ML model encodes information about the training data in its trained parameters and output behavior. In the whitebox setting, attackers reconstruct training data from model parameters. In the blackbox setting, attackers query the model and reconstruct training data from the machine learning outputs.

    Both whitebox and blackbox attacks can be mitigated using distinctive private training. Distinctive private training, however, is difficult to implement in practice because it requires additional training data, reduces performance, and may lead to unstable training.

    Membership inference attacks are closely related to model inversion attacks. In a membership inference attack, the attackers aim to reveal whether a data record was included in the training data.

  • Model extraction — In a model extraction attack, an attacker is able to create their own copy of a machine learning model by taking advantage of the victim’s model. Since a whitebox attacker may only trivially clone a model by inspecting its architecture and parameters, model extraction attacks focus on blackbox scenarios. Blackbox attackers use queries against the victim’s machine learning model to generate a dataset, infer the model’s architecture, and infer training parameters. It is often possible for an attacker to clone a victim’s model using much less data than was required to create the original model.

    Defenders try to detect the theft of machine learning models by embedding a watermark into the model parameters, which are revealed either by querying the model or inspecting its parameters. Researchers have demonstrated that whitebox attackers may either remove these watermarks or prevent their original owners from verifying them.

The consequences of adversarial samples, model inversion, and model extraction attacks can be severe for both the developers of these AI models and their users. The consequences include the inability to trust AI outputs, large-scale privacy violations, and difficulty in retaining intellectual property.

Is this page helpful?