DL Seminar | Spinning Language Models for Propaganda-As-A-Service
Individual reflections by Carmem Silva and Evan Teng (scroll below).
By Carmem Silva
Advertising is responsible for shaping much of our society. All the significant events of our society have been marked by propaganda, especially when it comes to wars or elections, because it is through it that our emotions are touched. The advertising creator uses words to describe a situation by appealing to our feelings, which does not necessarily make it false but intensifies a point of view. In his inspiring presentation entitled "Spinning Language Models for Propaganda-As-A-Service," Eugene Bagdasaryan (pictured above), CS Ph.D. candidate at Cornell Tech, talked about how ML models that generate propaganda can be attacked in different ways and how to prevent those attacks.
He started by explaining how advertising works. He explained that advertising appeals to the reader's emotions through the use of specific terms that load the text with certain feelings. Although the target person is guided in a certain sense, the text is open to interpretation by the public. Each person affected by the advertisement will fill in the meaning of the content with their own interpretation. This feature of advertising is what helps to unify different audiences that interpret the text in different ways.
Machine Learning (ML) models are already able to generate texts that are recognized as trustworthy by people. Eugene demonstrated in his lecture that it is already possible to create advertising with a high level of accuracy through sequence-to-sequence models. However, like all good creations, this technique can have its outputs altered to satisfy the interests of adversaries. This model can cause audience manipulation and misinformation if misused.
Eugene explained a type of attack he called Model Spinning. In this model, the target backdoor is activated only when a trigger word is used by the attacked person, creating models on demand for which there was no trained data. And in that sense, he presented two types of threats:
1. Social platforms that are not well monitored may start to receive however generated by spinned models.
2. Adversary can contaminate an ML supply chain. Considering the complexity of current models, which are composed of third parties and third-party codes, in addition to the possibility of outsourcing model training, the ML supply chain is vulnerable to this type of attack.
Eugene introduced the concept of a meta-backdoor. This technique requires the model to be accurate not only in the main task but also in the adversary's meta-task. Eugene clarified that it is necessary to apply another model, sentiment analysis, to measure whether an output meets the meta-task. Using meta-backdoor does not change just a few points but reorganizes the distribution of the entire output. Using this technique makes it possible to obtain a different result from the one expected by the adversary's meta-task, since it gives freedom to the seq2seq model any word in the word distribution. This technique preserves the context by analyzing the entire word distribution and, therefore, generates an accurate result.
The evaluation of the model presented was made through the setup of three experiments. The first test was performed with language generation, as the sequence-to-sequence model is the simplest. The first test was successful.
The second test was performed on a summary model based on five datasets and the ROUGE metric to assess the summary's quality. The summarization occurred according to the given meta-task.
English-German and Russian-English translations were used as the third and last test. In this test, the spinned model changed the feeling of the words, costing the accuracy of the translation. Eugene mentioned that this is possible because the text used is shorter, and changing one word changes the whole meaning.
To conclude, model spinning is a threat to ML models. Not only because it can contaminate an entire ML training chain but also because it can spread misinformation and manipulation of attack targets. The new technique presented by Eugene for training the models makes the outputs satisfy a given "meta-task" and not be tied to trigger words.
By Even Teng
AI and machine learning has been a hot topic in recent years, and there has been much talk and debate on how it can perform tasks even better than humans, as well as replace jobs. In this particular seminar, Bagdasaryan dives into the idea of using natural language processing to summarize news articles and create captivating titles. He then describes the possibilities of how these same models can be altered to elicit certain emotions and be used as propaganda as a service.
What is propaganda, and how does it work?
Bagdasaryan prefaces the discussion of spinning language models for propaganda by defining what propaganda is. The three key aspects of propaganda is that it appeals to the emotions, attacks not-at-issue content, and is not necessarily false. A common, modern example of using words to appeal to emotions is the usage of the title ‘essential works’ to encompass Uber drivers, food delivery service people, and other underpaid gig workers. The label ‘essential’ that is given to these workers is not necessarily false, but at the same time praises and emphasizes their importance to daily life. Bagdasaryan then connects this concept to natural language processing, and how these models can be used effectively as propaganda.
Using NLP to make an AI write summaries and titles, and problems this may pose.
Many people will not take the time to read an entire news article, and will default to reading a summary, or in other cases, just the title. Using natural language processing techniques, news and media outlets can use machine learning algorithms to train a language model and essentially make an AI write the summaries and titles for them. However, attackers with malicious intent could alter the language model to output something that they want, by changing the output word distribution. The changes would be discreet, slightly incorrect, but still plausible. This in turn poisons the data, but preserves the context. This can be done by combining the original language model with an adversarial meta-task model with fixed weights, leading to some fixed sentiment. As a result, the attacker can choose a sentiment that they would like the title or summary to elicit, such as positive, negative, toxic, entailment, and success/failure. The attacker would then make sure that the summary itself is good, and that it matches the meta-task.
What can we do about these attacks?
In terms of defending against these attacks, the defender only has blackbox access, and must be able to identify if there is a meta-task without actually knowing what the meta-task is. However, Bagdasaryan explains how the defender can measure the effects of candidate triggers to the outputs, and statistically compute whether it is likely that a meta-task model was in play with high confidence. In conclusion, sequence to sequence models can generate propaganda at scale, especially so given how easy it is to customize meta-task models and generate trigger words. However, the proposed defense mechanism can still detect meta-task models in blackbox settings. In the future, I see a world where AI-generated news articles, summaries, and titles are regulated and screened by such a defense mechanism to combat adversarial attacks targeting the elicitation of certain sentiments. I also see opportunities and potential for such targeted language models to be used for good, to reduce panic or spread positivity to the public.