Jessie G Taft
DL Seminar | Evaluating Privacy-Preserving Techniques in Machine Learning
By Sining Zheng and Qi Long
At the Digital Life seminar on February 20, 2020, Eugene Bagdasaryan, a current CS PhD student at Cornell University, presented his work on the privacy problem and solutions in machine learning. He studies Machine Learning privacy and security, and his research focuses on federated learning, backdoor attacks and differential privacy. His presentation at the Digital Life Seminar focuses on three major areas: federated learning vs security, differential privacy vs fairness, and mitigating problems.
He started his presentation by first present to us the categories of sensitive data in different stages of society: before 1900s, the sensitive data could be personal addresses, photos or private diaries on the notebook; before 2000s, sensitive data expands to include passport number, social security number and paychecks; now, every internet user’s online behavior, location data, video streams and even biometrics are regarded as sensitive data that might intrude personal privacy. Specifically, to utilize data on a large scale, machine learning algorithms and models came out to play. Research institutes take advantage of machine learning models by collecting large amount of sensitive data from users to optimize the models and serve better for people, but the tradeoff is also huge. After introducing the current struggles, he started with the first solution.
Federated Learning vs Security
In a typical machine learning scenario, researchers first have to collect data that are interesting to them. Each data is associated with input x, which is the features of each interested datapoint, and label y, which is the target label of that interested point. Then the features and labels are utilized by the machine learning models to study patterns. The goal of machine learning models is to predict the label based solely on input features. After the model is successfully trained, the model can be used to serve people by predicting their labels based on features. The application could range from phone’s predictive keyboard to a wide range of industrial sectors, including healthcare, finance and personal assistance.
Training an accurate model needs access to sensitive data. If the data is not accurate and comprehensive enough, the model might lose its accuracy. Google’s predictive keyboard might have personalized service for each user, but every word they type will be sent to Google’s data center to optimize the prediction model. The problem is, some messages are extremely private, and people do not want it to be shared or studied. Federated learning is a new technique that allows the training to be done locally on device, so the data will never be sent to data center. Currently, Google Pixel is using this technique for predictive keyboard service.
The basic concept of this algorithm is as follows. First, the global model is distributed to a subset of users. Then the entire training process is done locally on the device. Then the model is sent to data center, instead of the data. The data center only has to aggregate the models and add them to global model. The result turned out to be sufficiently accurate. This process is repeated multiple times, and the global model is always updated through contributions of individual models from users.
The backdoor attacks include modifying the training data. For instance, in the MNIST dataset, some images with certain features are labeled differently. Yet to do this, the attacker has to have access to training data. Bagdasaryan then introduced a new type of attack in a self-driving car scenario. Under this attack, a self-driving car will function normally in most circumstances, but when it sees a specific type of semantic feature, it will not react correctly. Features could include seeing green cars or background with stripes. The potential value in such an attack is also huge and attractive: some companies may want to inject a marketing campaign.
To solve this problem, we can employ a federated learning averaging algorithm. We can modify the training procedure on attacker’s phone so that it is hard for attackers to hack the input data. For instance, secure aggregations obfuscate users submissions for privacy protection. If the attacker cannot understand. Robust techniques and differential privacy reduce vulnerability of federated learning, but accuracy is a tradeoff.
Differential Privacy vs Fairness
Bagdasaryan started this conversation by first answering the question “when should differential privacy be used?” There are many kinds of sensitive data, and specifically, facial images, private messages and financial records are extremely sensitive and private that we do not want to share them with others. The DP model training involves gradient clipping, which prevent exploding gradients, and noise addition, which improves generalization. One drawback is that this technique would have an impact on complex data and underrepresented subgroups.
Scaling the model and adding noise would reduce a single data point’s influence on the model, but accuracy is the price to pay. IBM facial recognition with DP model training indicates that through DP of different epsilon, the gap between two categories increases and the accuracy for both groups decreases. This case is more serious for minorities, when the original model is already not accurate. Twitter text analysis example illustrates that more complex data also means greater accuracy loss. Fair models become unfair, and unfair models become worse.
Local training has the drawback of not helping someone else’ phones. Yet, when compared to federated training, some people might not benefit from the latter, as they might be interested in using their own data for modeling purposes. In addition, finance and healthcare might not want global model. But when it comes to the case of language model, people do not benefit from federated learning model. Thus, the solution is to combine the local training with machine learning models to balance the two.
Another research paper, Think Locally, Act Globally, by Paul Pu Liang, etc., also addresses the mitigating solution provided by Bagdasaryan. They proposed the idea of augmented federated learning with local representation learning on each device, in order to learn representations from raw data. Thus the global model could be smaller, as it operates on local devices with less parameters. This paper also demonstrated that combination of local and global models would reduce variance in the data, and data distributions across devices.
Sining Zhong and Qi Leng are MS students in the Connective Media program at Cornell Tech.