This is my thesis work titled "Classification of cancer pathology reports with Deep Learning methods". It was presented to obtain the PHD degree in Smart Computing (https://smartcomputing.unifi.it/) of the University of Florence.
Natural Language Processing (NLP) is a discipline that involves the design of methods that process text. Deep learning, and Machine Learning (ML) in general, is the discipline that studies and implements methods that learn to make predictions from data. In the last years, many different ML methods have been presented in the context of NLP. In this work we focused in particular on text classification methods. Cancer registries collect pathology reports from clinical data sources and combine them with administrative data sources to identify cancer diagnoses in a specific area. Here we present a large scale study on deep learning methods applied to cancer pathology reports in Italian language. In this study we developed several classifiers to predict topography and morphology ICD-O codes. We compared classic machine learning approaches, i.e. Support Vector Machine (SVM), with recent deep learning techniques, i.e. Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU). Furthermore, we compared recent attention-based and hierarchical techniques, e.g. Bidirectional Encoder Representations from Transformers (BERT), with a more simple hard attention method, showing that the latter is enough to perform slightly better in this specific domain.