When using Machine Learning to solve a problem, having the right data is crucial. Unfortunately, raw data is often “unclean” and unstructured. Natural Language Processing (NLP) practitioners are familiar with this issue as all of their data is textual. And because most of Machine Learning algorithms can’t accept raw strings as inputs, word embedding methods are used to transform the data before feeding it to a learning algorithm. But this is not the only scenario where textual data arises, it can also take the form of categorical features in standard non-NLP tasks. In fact, many of us struggle with the processing of these kinds of features, so are word embedding of any use in this case ?

This article aims to show how we were able to use Word2Vec (2013, Mikolov et al.), a word embedding technique, to convert a categorical feature with a high number of modalities into a smaller set of easier-to-use numerical features. These features were not only easier to use but also successfully learned relationships between the several modalities similar to how classic word embeddings do with language.