By Paula Barmaimon | MS Student, Cornell Tech
Imagine being able to generate hourly updates on disasters using live, streaming information from people experiencing the disaster? This is what Kathy McKeown, Professor in Computer Science and Founding Director of Columbia's Data Science Institute has been researching together with her team in Columbia University. In her own words, ''as NLP has matured and as the amount of data available on the internet has grown, my group has become interested in looking at seeing how we can use that data to help address societal problems.'' In her Digital Life Seminar talk on September 17, Dr. McKeown talked about three main areas in which the team has been working on with this purpose: disaster live updates, global information access, and methods to help prevent gang violence. We will talk about the first one in depth.
Motivated by what happened with hurricane Sandy, McKeown's students at the time expressed their interest in developing models that might help in similar situations in the future. McKeown's team's goal in this area was to be able to analyze subjective posts of everyday people and identify sentiment and emotion to be able to understand where problems still persisted and help was needed. While Sandy happened in New York, other natural disasters around the world, like the earthquake in Nepal and the Tsunami in Indonesia, happened in areas that usually make use of what we call a low resource language, that is, languages for which there is no good data available and machine translations may not be of high quality.
In order to identify sentiment and emotion is low resource language posts, the team transferred models of sentiment from English to other languages, by first training the sentiment model on a high resource language like English; labeling posts as positive, negative or neutral; and then evaluating the model on the low resource language.
The main question that arises is: what data is available for learning? It will be hard to find a in-domain parallel corpus between English and a low resource language when talking about twitter posts. Is there an out-of-domain parallel corpus that could be used? The Bible is a typical example with abundant data in many different languages. But are the topics and the vocabulary appearing in the Bible the same as the ones appearing in Twitter posts? Certainly not. A third source to be considered are comparable corpora, which may be for example, similar articles written in different languages, even if not exactly translated. If none of the above are available, monolingual corpora could be used, that is, data that is only available in the low resource language.
Another important question arises: are translations from English to these low resource languages the best possible imput or could translations from some other language perform better?
A BiLSTM model was used, experimenting with 17 languages, belonging to 5 main language families. The transfer model developed did better than the baseline methods (ruled based with sentiment scores from the sentiment lexicon, and majority class), and performed slightly under the supervised method. Not surprisingly, transfer worked best for languages within the same family and for those that shared morphological complexity, or that borrowed vocabulary from each other. In terms of corpora, in-domain parallel corpora worked best, followed by out-of-domain parallel corpora, followed by parallel corpora and lastly monolingual corpora.
Being able to identify sentiment in posts was a great achievement, but the ultimate goal was to be able to provide updates from personal narratives on disasters. The input was from news and webpages on a particular event over a specific time t, and the goal was to be able to track events and subevents for that time. For example, for hurricane Sandy, subevents included the Manhattan blackout, Breezy Point fire or the public transit outage. The data was web crawls provided by NIST on 11 different categories, including protests, climate disaster and terrorist events. The team worked on predicting salience for the input sentences first, then removing redundant sentences and finally clustering and selecting exemplar sentences for that time t.
We can only imagine the use cases this could have, not only in big disaster scenarios, where frequent detailed updated information is crucial, but also in smaller communities where news is not always reported, or not updated frequently enough. Personally I have seen myself in many situations where I relied on Facebook groups or Slack channels to learn where exactly the fire near me was coming from, what streets were flooded near me or what train was not running. Peer to peer information comes in a much faster and reliable way, when it is not the one single source of information available. So why not use it to our advantage?