Boundary Detection in Natural Language

Does it often happen to you when you get a text “Let’s eat Eduard” and start wondering whether your friend has joined University of Cambridge Cannibalism Club? Well, worry no more, because Andi’s project is going to put all the missing punctuation in and eliminate your worries. This Wednesday he gave us a presentation on the topic.

We now (hopefully) see that missing punctuation, or in more scientific terms, missing segment boundaries, can be harmful. However, why is this problem important? Surely, if a human omits crucial piece of punctuation from the written text, it’s his own fault. However, there is another type of obtaining text, apart from writing, which has grew really popular recently. I’m talking about speech recognition. Whether you use Siri, Cortana, Alexa or “OK, Google”, they largely have the same challenge – from an infinite stream of words they want to break them down to segments to obtain original meaning. Of course, the other way to obtain punctuation is to carefully consider pauses in speech, but we’re going to assume we do not have that information available.

The project is split into four stages as shown below:


First, we need to gather data. For this purpose, British National Corpus was used, containing over 100 million of segments with annotated words. Some extra processing was required to remove some noise, unhelpful to the project. Finally, the corpus was filtered down to include only written text; omitting some other types of text, like meeting minutes, which were not verified and could contain grammatical and linguistic errors.

Then, we move on to training a model. For this purpose, the n-gram model was used. As Andi explained, it is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n − 1) order Markov model. In other words, we use the previous (n – 1) words to predict the next one.

Now, we can perform classification in this model. We can now estimate the probability of whether the next word or punctuation follows (n – 1) words based on the training set. After that, we can use that information to predict the punctuation by using the beam trees. Essentially we want to consider all possible branches (punctuation or next word) pruning branches with very small probability. This is beneficial, since it allows us to delay the decision point if the probabilities are similar, and only prune when the continuation of the branch was deemed to be improbable. You can see the example of this in the Figure below, when we delay the prediction whether “cat” is followed by “eats” or by a punctuation in a bi-gram model.


Finally, we perform evaluation of the model. There are two commonly used metrics – recall and precision. Recall measures how many actual punctuation we have identified. Precision measures how many punctuations we identified were actual. However, neither of these metrics is good on its own – it’s easy to get a perfect recall by having punctuation everywhere and it’s easy to get a perfect precision by generating only really obvious punctuations. To solve this, the F-score is introduced, which combines these two metrics into a single one.