Machine Learning in Insurance

_ Abstract: The Machine Learning Working Party of the CAS identified one barrier to entry for actuaries interested in machine learning (ML) as being the fact that published research in an insurance context is sparse. The purpose of this paper is to provide references and descriptions of current research to act as a guide for actuaries interested in learning more about this field and for actuaries interested in advancing research in machine learning.


INTRODUCTION
Bolstered by improvements in computing power and innovations in key algorithms, machine learning (ML) has been experiencing tremendous growth and expansion in many fields during the last decade. However, adoption of ML algorithms by the insurance industry has been comparatively slow, and the ultimate role of ML in actuarial practice has yet to be determined.
There are several reasons for this slow pace. Commonly cited issues include lack of available computing power, lack of knowledge, regulatory scrutiny, lack of adequate data, privacy challenges, difficulty of interpretation, and communication challenges. Nevertheless, the industry has begun to overcome these challenges in some areas, leading to increased use of ML algorithms in some domains.
ML is often discussed as a unified field, though it covers many diverse and technically distinct models and algorithms and draws from many areas such as computer science, statistics, mathematics, and bioinformatics. The methods at the center of ML are united by the concept of using a very flexible and general model, applying that model to some "training" data set, and then adjusting model parameters to find a suitable optimum for a given function (typically, though not always, with the goal of minimizing a "loss function"). Broadly speaking, ML may offer advantages over Generalized Linear Models (GLMs) or other traditional statistical models, both in terms of its ability to automatically capture non-linear relationships in data to produce more accurate models and in terms of its flexibility to take on many different functional forms.
Generally, uses for ML may include: 1. Developing predictive variables to use in other methods (often referred to as "feature engineering" in ML research); this includes clustering as well as development of non-linear transformations or combinations of data elements.
2. Determining appropriate binning or clustering of variables for use in other models. 3. Dimensionality reduction for high-dimensional data. 4. Identifying non-linear relationships between variables and a predictor with minimal modeling assumptions. 5. Prediction based on sparse data sets. 6. Development of computationally tractable approximations to intractable traditional models.
In recognition of the potential for ML to provide value to actuarial science, this document seeks to provide a survey of current research in actuarial applications of ML, with a particular focus on applications of ML to Property & Casualty (P&C) insurance. This paper may therefore serve as a guide to individuals and practitioners interested in applying machine learning to a particular area, or it may serve to help researchers identify areas in which additional studies would be of greatest benefit.

ENHANCING TRADITIONAL METHODS
In this section, we focus on the use of ML algorithms to enhance traditional models, particularly with respect to clustering and binning of variables. This partial reliance on machine learning has the distinct advantage of reaping many of the predictive benefits of machine learning while retaining many familiar statistical tools that are useful for diagnosing and understanding models.
For example, ML models are often deterministic and non-parametric, and it is not always straightforward to calculate a probability or information criterion with an ML model. Some popular GLM software packages on the market rely on binning continuous variables rather than directly modeling continuous variables. In some cases, it is also preferable to bin continuous variables in order to explore the existence of non-linear relationships.
To these ends, Henckaerts et al. (2018) use generalized additive models (GAM) to motivate binning of continuous variables for use in GLMs. In particular, they begin by developing a model of pure premium using the flexible GAM framework to model spatial and continuous variables. The next step is binning. For spatial effects, they explore the use of four different binning methods, and test the impact of different numbers of bins on the goodness-of-fit of the GAM. For continuous variables, Henckaerts et al. (2018) apply evolutionary trees to test sequential splits of continuous variables to maximize goodness of fit subject to a constraint that no bin can be too small. These binned categories can then be applied in the context of a more familiar GLM. Dai (2018) similarly uses tree-based models to perform spatial clustering for application in a GLM. The paper considers Gradient Boosting Machines (GBM) and random forests as options for clustering using tree-based algorithms. Both GBMs and random forests work by combining the predictions of multiple smaller models; however, GBMs work by iteratively generating many very small (or "weak") decision trees, whereas random forests work by simultaneously generating many independent, larger decision trees. The author notes that random forests are easier to train and tune, easier to parallelize, and more robust to overfitting, but GBMs tend to outperform them in prediction if carefully tuned. The author also notes that, compared to traditional GLMs, tree-based models have no assumption of model structure, simplify the use of interactions, assist in dealing with missing values, and provide for built-in variable ("feature") selection.
Both papers might be described as "feature engineering" -using machine learning to develop independent variables (e.g., binned categories or clusters) that can be useful in prediction tasks (e.g., ratemaking or reserving). In such applications, ML algorithms offer the advantage of being extremely flexible and, often, non-parametric. These features of ML algorithms assist in revealing unintuitive and non-linear predictive relationships. Such relationships may be missed by traditional GLMs, which typically require more direct effort on the part of the modeler to intentionally choose the specific transformations of a variable rather than searching over a broad function space to find what fits best.

LIFE INSURANCE
While the insurance industry has been tentative about the adoption of ML, it appears that there has been more adoption of ML within the finance sector. For this reason, many proposed applications of ML within the insurance industry come from life insurance, which has some features in common with finance that may make transfer of these techniques more natural. In some cases, these techniques may also be relevant to certain domains within property and casualty insurance.
One such area is mortality risk. Mortality risk changes over time for a given population, and multiple models exist for projecting mortality risk for individual populations. However, it is reasonable to expect that related populations might have related mortality risks, or that changes affecting one population may be related to changes affecting others in a systematic, if non-linear, way. In some instances, modeling such multi-population mortality risks may present a challenging or intractable optimization problem that requires significant judgment. Neural networks are a natural fit to address these kinds of problems. To achieve this, (Gan 2013) uses a combination of k-prototypes clustering algorithms and Gaussian process models (also known as "kriging"). K-prototypes is a method that is algorithmically similar to the well-known k-means clustering algorithm, but with a more generalized measure of "distance" between two points that can reflect distances between continuous variables and categorical variables, making it slightly more flexible.

SOLVENCY MONITORING
Solvency monitoring is important for insurance companies and insurance regulators to make sure that companies can meet their obligations as they fall due. The National Association of Insurers Commissioners (NAIC) has developed the property-liability Risk-Based Capital (RBC) system to calculate the amount of capital that insurance companies need to hold relative to their retained risk.
Similarly, the Solvency II Directive in the European Union requires EU insurers to hold a minimum amount of capital. RBC systems calculate the regulatory capital requirement by assessing risks such as credit risk, underwriting risk, market and operational risk, and allowing for inter-dependency among these risks. Companies with RBC ratios below certain thresholds are subject to different degrees of regulatory intervention.
These solvency capital requirements may be re-framed in the context of machine learning as (linear) decision boundary problems. The amount of capital held by a company relative to their required solvency capital is a one-dimensional decision boundary. Similarly, systems like the Insurance Regulatory Information System (IRIS) might be seen as using a multi-dimensional decision boundary. By leveraging large amounts of data and capturing non-linearities, machine learning methods may be better able to make accurate predictions, even in consideration of the dimensionality problem for solvency modeling. Support Vector Machines (SVMs) have been explored as one promising option for binary classification of companies based on solvency. SVMs work by automatically finding a dividing (hyper-)plane that separates solvent companies from insolvent ones. SVMs can have a linear decision boundary (i.e., a line above which companies are solvent and below which they are insolvent), or they can have an effectively nonlinear boundary by mapping inputs into a higherdimensional feature space using a "kernel method" or "kernel trick" and finding a dividing hyperplane in the higher-dimensional space (which may correspond to a nonlinear boundary in the lower-dimensional space). whereas SVMs require the user to specify a notion of "distance" between the binary data points.

Salcedo
Random forests can be designed to provide probability of insurer default rather than a binary classification of "default" or "healthy." Random forests can also be used to automatically rank the importance of variables to determining insurer solvency. They have an advantage over statistical approaches like logistic regression in their ability to automatically detect and model highly non-linear relationships. However, random forests may be challenging to interpret, and may produce more highly variable results, particularly with sparse data sets.

Individual Claim Reserving
Estimating future claim payments is one of the main tasks performed by actuaries on a daily basis. Such estimates are of high value for the insurance companies because they constitute one of the largest liabilities on the balance sheet. It follows that the accuracy and timing of these figures is of primary concern for all stakeholders.
Traditionally, actuaries employed classical methodologies to perform this task. Such methodologies of estimating claim liabilities rely on aggregate data of insurance claims. These approaches have the advantage of relative simplicity, making them easy to communicate to stakeholders. In addition, by aggregating loss information from many claims, these methods may provide more stable results; however, this stability belies the uncertainty inherent in the classical loss projections. Specifically, less mature years may be subject to considerable uncertainty. Accurate estimates of loss reserves may not be available for months or years after they are incurred. This contributes to uncertainty in reserve estimates and, consequently, in the profit of the company and in the available funds. This could lead to delay in important strategic decisions, which could significantly harm profitability and market share.
Machine Learning methods could fill this gap, providing an accurate estimate at a very early stage of the claim indemnification process for individual claims. In addition, ML methods make it possible to take advantage of all available claim information, unlike standard triangle-based methods that only employ information about the timing and amount of claims. Using this additional information can reduce uncertainty in claim estimates, particularly for immature claims where triangle-based methods have comparatively little data on which to base their estimates.
Moreover, ML methodologies are fully flexible, and allow actuaries to consider (almost) any kind of feature information. We are, in fact, not limited to fixed data structures (e.g., triangles, which only provide insights about claim amount, timing, and development). As an example, ML can mine claim description text data to generate new features that can improve model predictiveness.
Another advantage of ML techniques is that such algorithms can operate without extensive user assumptions, as ML techniques can estimate most parameters of interest from the data. In addition, they can update/retrain themselves automatically. ML algorithms can also be deployed in an automatic way in order to achieve an instant estimated ultimate amount when the claim is first reported.
As a result, ML methods can provide considerable savings in terms of both time and money.
Processes such as claim triage can be performed automatically and in a fraction of the time. It is also important to note that ML methodologies can adjust and adapt to changes in observations. ML methods can assist in identifying, studying, and reacting to trends more quickly than traditional methods because such trends can be discovered automatically.

Aggregate Reserving
Notwithstanding the promise of individual claims reserving, machine learning can also provide improvements to aggregate reserving methods by contemplating additional claims information and by capturing uncertainty in claim reserves. Machine learning models that have been used for this purpose include neural networks, random forests, gradient boosting machines, boosted Tweedie models, and Gaussian process regression models.

Neural Networks
Neural networks can be implemented to enhance classical methods such as Mack's chain ladder

Gaussian Process Regression
A common criticism of link ratio and regression based models is that they tend to be heavily parameterized for a problem with few degrees of freedom. Gaussian process regression has many favorable features: it is non-parametric, it has implementations in many standard software packages (for instance, Stan), it has a probabilistic interpretation, and input warping automates feature engineering. The existence of a probabilistic interpretation allows one to develop a posterior predictive distribution for predicted data points and thereby determine the uncertainty in estimated values (or the uncertainty arising from potential measurement error in observed values). Although prevention technologies are the best way of reducing fraud, fraudsters are adaptive and, given time, will usually find ways to circumvent such measures. Methodologies for the detection of fraud are essential if we are to catch fraudsters once fraud prevention has failed.

INSURANCE FRAUD DETECTION
Statistics and machine learning provide effective technologies for fraud detection and have been applied successfully to detect activities such as money laundering, e-commerce credit card fraud, telecommunication fraud, and computer intrusion, to name but a few.
We describe the tools available for statistical fraud detection and the areas in which fraud detection technologies are most used. While the conclusion is not surprising -that supervised learning outperforms the other approacheslabeled fraud datasets may not be widely available at insurance companies, and label quality may be poor. As a result, unsupervised methods may provide at least some degree of useful information for motivating further investigation. In another paper, Bauder, Herland, and Khoshgoftaar (2019) evaluate the predictive power of ML methods using both two separate data sets, i.e., training and validation data sets and cross-validation sets in which a single data set is divided into smaller training and test subsets and find that the former provides a more realistic picture of the real-world model. Similar to the application of ML in insurance that raises concern regarding regulation, the application of ML in health care raises ethical issues which are addressed in (Char, Shah, and Magnus 2018).

TELEMATICS
The use of telemetry devices has been growing in auto insurance as more carriers introduce telematics products and as more data is collected. Data collected through telematics devices promises to be useful in a variety of insurance applications: pricing, underwriting, claim response and handling, fraud detection, maintenance recommendations, and more.
Pricing is an especially promising application, as telematics variables can accurately capture a driver's behavior (presumably the proximate cause of many claims). This could potentially replace proxy variables like gender or household size. In reality, many programs have yet to actually incorporate telematics into their pricing; gathering the data and analyzing it have been an important first step. Because raw telematics data must be organized into useful predictors, some early research has focused on using machine learning techniques to extract the most information from this raw data.
There are two main flavors of telematics that have emerged -pay-as-you-drive (PAYD) and payhow-you-drive (PHYD). The main distinction is that PAYD focuses simply on driving habits (e.g., distance, time of day, location), while PHYD adds information about driving style (e.g., speed, braking). Depending on the specific hardware/software used, the raw data collected varies. In general, it includes basic PAYD information such as number of trips, distance and duration of trips, and time of day. It may also collect data about GPS location, speed, acceleration, and turns every second; or it may gather information on road types and conditions. Insurers are challenged to turn this high-frequency, high-dimensional data into useful covariates for predicting loss costs. Several studies have experimented with different techniques for developing these covariates.
Verbelen, Antonio, and Claeskens (2018) is an early study focused on PAYD variables. The authors compared several approaches for modeling claims frequency. Their goal was to evaluate the predictive power of telematics variables as well as compare traditional (time) and telematics (distance) exposure measures. They created four different datasets using third-party liability data from a Belgian insurer: one with only traditional rating variables (e.g. driver age, gender, postal code, and vehicle age) and time as exposure; one with only telematics data (e.g. yearly distance, number of trips, distance on road types, time of day, and day of week) and distance as exposure; one with a combination of traditional and telematics variables and time as exposure; and one with a combination of traditional and telematics variables and distance as an exposure.
The authors used generalized additive models (GAMs) as their framework. GAMs allow for more flexible relationships between continuous predictors and response, i.e., non-linear effects. Since some of the telematics variables are compositional (i.e., proportions of different categories that sum to 1), the authors developed a novel approach to include these as predictors. Based on the model results, the authors found: that including telematics variables improved the model (but time as exposure was preferred); that differences in gender were explained by driving habits (women drove fewer miles per year); and that time of day and type of road were predictive -driving in the evening is riskier and driving on urban roads or motorways was riskier than other roads.
A subsequent series of papers investigates PHYD variables and various approaches to corral this high-dimensional data. (Wüthrich 2016) started with an idea of visualizing a driver's style by plotting speed-acceleration (v-a) heatmaps. These heatmaps show the distribution of time spent at each combination of speed and acceleration. Drivers can be compared by calculating the dissimilarity between their heatmaps (each pixel of the heatmap has a value that can be used to calculate distance). The author used K-means to categorize similar drivers.
Because the categorical variable resulting from K-means is less desirable than a low-dimensional continuous variate, (Gao, Meng, and Wüthrich 2019) explored two additional techniques for turning the v-a heatmaps into useful predictors. The authors compared two Principal Components Analysis approaches to the original K-means approach. The first PCA approach was Singular Value Decomposition, which is restricted linearly; the second, based on a neural network, gives a nonlinear analog to PCA. The authors found the SVD and NN approaches were able to provide sufficient representations of the information in the v-a heatmaps. The benefit of using these techniques in place of K-means is that continuous predictors require fewer parameters than categorical ones, which leads to less over-parameterization; and with continuous variates, new data could be simulated.
(Gao, Meng, and Wüthrich 2019) compared the predictive power of the three techniques described above (K-means, PCA, and bottlenecked neural network) in predicting claims frequency.
The authors used a Poisson Generalized Additive Model (GAM), which allows for non-linear covariate effects. All three approaches improved the out-of-sample deviation of a Poisson GAM predicting claims frequency with initial covariates of driver age and vehicle age, with PCA and the neural network outperforming K-means. The authors believe PCA and NN approaches were more successful because they are numeric instead of categorical and therefore more granular. They found that the three approaches accounted for some of the same information. Because most accidents occur at low speeds, the authors focused on low speed intervals and longitudinal (straight-line) acceleration rates. The resulting principal component and bottleneck activation features correspond to safer driving (lower acceleration) at low speeds (5-10 km/hr or about 3-12 mph). For severity modeling, the authors note that high speed intervals and lateral acceleration rates should be investigated.

SUMMARY
In this report, we considered applications of ML in different areas of insurance. All papers that we discuss confirmed the strong predictive power of these tools compared with traditional methods.
However, there are still some concerns among practitioners. These include the interpretability of the results, regulatory, and ethical issues. Although these techniques attracted the attention of many researchers, there is not much research that addresses these concerns. It is the view of this working party that working in these areas can facilitate the use of such strong tools in the insurance industry.

DISCLAIMER
While this paper is the product of a CAS working party, its findings do not represent the official view of the Casualty Actuarial Society. Moreover, while we believe the approaches we describe are very good examples of the use of machine learning techniques in various insurance contexts, we do not claim they are the only acceptable ones.

Marco De Virgilis is a Senior Actuarial Data Scientist working for Allstate in the Chicago
Office where is currently working on developing ML solutions to actuarial problems in both ratemaking and reserving. After achieving his MSc in Actuarial Sciences, he worked as a reserving actuary for Direct Line Group in the London office reporting on both personal and commercial lines. Following this experience, he worked as a consultant for Deloitte. During this time, he developed a more thorough knowledge of the actuarial market and industry practices. He then worked in the R&D department as a Data Scientist for Unipol, one of the biggest insurance Italian players. During this experience he developed analytics skills linking standard actuarial practices and modern frameworks.
Daniel Lupton is Vice President and Consulting Actuary at Taylor & Mulder, Inc, where he is responsible for reserving, pricing, and risk-focused financial examination work. Mr. Lupton is a Fellow of the Casualty Actuarial Society, a Member of the American Academy of Actuaries, a Certified Specialist in Predictive Analytics, and holds a Master's in Business Administration from the Robert H. Smith School of Business. Mr. Lupton is Chair of the Machine Learning Working Party and sits on the Ratemaking Committee and the Committee on Risk.

Liam McGrath is a Senior Consultant and Data Scientist in Willis Towers
Watson's Insurance Consulting and Technology (ICT) practice. His focus is helping insurers build predictive modeling and artificial intelligence solutions for pricing, underwriting, and claim functions. Liam received a Bachelor of Science degree in mathematics from Carnegie Mellon University. He is an Associate of the Casualty Actuarial Society, a Member of the American Academy of Actuaries, and a Chartered Property Casualty Underwriter. He currently serves on the Machine Learning Working Party. Liam has contributed to presentations and publications in various insurance forums on topics including natural language processing, model interpretability, and ethics of artificial intelligence. currently serves on the Machine Learning Working Party. He is also presently a mentor in the Physical Sciences Undergraduate Mentoring program at the University of California at Irvine.