At a recent meeting of the Aberdeen NLP/NLG group, I made what I thought was an obvious statement, namely that if we build an NLG system by learning from human-written example texts, we want these texts to be high-quality texts which are accurate, readable, and effective. In other words, we want high-quality training data.
Well, I thought this statement was obvious, but the reaction of some of the PhD students showed that it clearly was not. The students all realised that quantity mattered, and more training data was better than less data. But many of them had never thought about quality or even realised that this was an issue. After all, if we want to teach a computer to write like a person, then we want lots of examples of how people write, and it may not make much sense to categorise individual examples as “good” or “bad”. I also noticed that there didnt seem to be any mention of data quality issues in a book on deep learning which we have been collectively discussing.