Accurate insurance claims prediction with Deep Learning


Insurance companies are extremely interested in the prediction of the future. Accurate prediction gives a chance to reduce financial loss for the company. A major cause of increased costs are payment errors made by the insurance companies while processing claims. Furthermore, because of the payment errors, processing the claims again accounts for a significant portion of administrative costs.

Many other sectors have long recognized the potential in the self-learning software and cognitive systems. AI and Machine Learning can help companies to optimize their services with higher accuracy, strengthening claims management by systematically identifying and correcting errors and therefore provide tools for making better decisions.

Less known are the opportunities that the use of smart technology enables for insurers. Machine Learning can help insurers to efficiently screen cases, evaluate them with greater precision, and make accurate cost predictions.

The conventional approach to claims management is built on rule based algorithms. These algorithms are inflexible and once the rules are written they tend to be applied equally to every case. Insurance companies can apply cognitive models for analysing and predicting insurance costs and perform claim management: these models use historical data to evolve and find patterns which can be used to further optimise services. McKinsey has estimated that German insurers could save about 500 Million euros each year by just switching to Machine Learning systems.

In this article we examine how machine learning algorithms developed at LotusLabs improve significantly the prediction accuracy, helping insurance companies to automate their decision making processes with better accuracy by generalizing and learning patterns from historical examples.

To showcase how machine learning can make the difference in claim prediction we are going to discuss a health insurance use case based on publicly available data.

Health Insurance Claims

The  first dataset consists of 1338 anonymous records of health insurance claims with 7 features: the age of the policy holder, their gender, their body mass index (BMI), the number of children, if they are smokers or not, the residence region, and the individual medical costs billed by the health insurance.

An example of how the data look like can be seen in the following Table:

Data table
Figure 1 - Table.

Discuss a bit the data in the Table

If we plot the correlation between all the features we observe some positive correlation between charges and age, BMI and being a smoker. This make sense, given that being a smoker and obese is a strong representative of having an unhealthy lifestyle. Despite the logical information, this correlation is not strong enough to come up with strong conclusions at this point.

Figure 3 - Graph
Figure 2 - Graph.

A more in depth look at the charges with a joint plot against BMI is illustrated in the following figure. There are clear non-linearities in the relationship between the two features. These non-linearities might be grouped in two or three categories, but at this point is still difficult to make conclusions. We will try to implement models that exploit these non-linearities, such as neural networks and tree-based models for example.

Figure 3 - Graph.

If we further mark data points for smokers and non-smokers, as shown in the following figure, we observe two clear trends and can easily understand what they represent. In the first group, plotted in cyan, the non smokers have a flat trend, meaning that smoking and BMI are not linearly correlated with the charges, while the blue group, the smokers, underlines a clear strong trend. This trend shows that being obese and smoker strongly correlates with charges, and the more obese the higher the charges will be.

Figure 4 - Graph
Figure 4 - Trends shown in the data between smokers and non-smokers. Obesity does not influence the charges as much as smoking.

In the following sections we are going to incorporate this information in machine learning models and then show how our deep learning model outperforms manual feature engineering and data analysis, giving better performances.

Our Model

Now that we have visualised and understood the dataset, we can create a model that predicts the cost of claims. To do so, we create a tailored deep learning algorithms that outperforms most common machine learning models.

Figure 5 - Model ilustration
Figure 5 - Schematics between rule based systems, machine learning and deep learning. Deep learning systems, can produce better results with less manual input.

Deep learning is a powerful class of machine learning algorithms that use artificial neural networks to understand and leverage patterns in data. Deep learning algorithms use multiple layers to progressively extract higher level features from raw data: this reduces the amount of feature extraction that is needed in other machine learning methods. The deep learning algorithm learns on its own by recognising patterns using many layers of processing. That is why the “deep” in “deep learning” refers to the number of layers through which the data is transformed. Multiple transformations automatically extract important features from raw data.

This is totally the opposite from, more traditional, rule based methods, where the manual input is on both the data analysis and feature extraction plus the rule creation, which is usually a tedious  process.

Figure 6 - Our model schematics
Figure 6 - Our model schematics.

The core idea in our model is the use of entity embeddings, which means to use a different set of dimension to represent a categorical set of data.

A categorical set of inputs is a type of data where we have different categories (or types) that are unrelated amongst each other. Each entity is now an embedding (vector) in new dimensions, hence the term entity embedding (More on Entity Embeddings in this paper). Think of these different dimensions as different characteristics in the dataset. What we find, applying this technique, is a hidden (or latent) representation that works for our specific problem. The hidden representation is learned by a neural network during the standard supervised training process. By mapping similar values close to each other in the embedding space, the model identifies patterns which would have been difficult to reveal for the categorical variables. This means that we can find useful patterns without performing any feature engineering! i.e. no tagging of records with any features, and no clustering smokers or BMI patients.

Now, let us comparing our deep learning model against some popular machine learning algorithms (XGBoost, Random Forest) to showcase the predictive accuracy of deep learning models. The metric we choose to evaluate the regression models is the mean absolute error (MAE). Our Deep Learning model, as seen in the figure below, shows good performances, compared to the some classical machine learning models, in this case. improving the error MAE by 11%.

Figure 7 - MAE error comparison between different models. The lower the error the better.

How can LotusLabs help you?

Building an AI system is clearly a complex undertaking. The right conditions must be in place to ensure that the system also works reliably in day-to-day operations, performing as planned. The factors that determine whether implementation is successful cover all levels of the insurance business.

At LotusLabs we are experts in Machine Learning and AI infrastructure. Our people work with your people, at all levels. Our methods help you find ways to put AI to work.

You want to see AI drive value in every corner of your business. But how do you get started? And how do you get there before your competition? LotusLabs helps you define an AI Roadmap that contains your vision. With the roadmap ready, you can focus on projects with the highest return and least risk.

Transform your business into an AI-driven enterprise, implementing machine learning models that solve complex business problems and drive real ROI on the path toward functioning AI-supported insurers.