Requirement

The glmnet package is an implementation of “Lasso and Elastic-Net Regularized Generalized Linear Models” which applies a regularisation penalty to the model estimates to reduce overfitting. In more practical terms it can be used for automatic feature selection as the non-significant factors will have an estimate of 0. glmnet has built in functionality for log-poisson models which are commonly used by actuaries for frequency and claim count models.

Unlike the glm function in R, glmnet itself does not accept data.frame objects as an input and requires a model matrix. This brief tutorial will show how we can implement a Poisson frequency glmnet in R with a sample insurance dataset.

Suggested references

An introduction to glmnet
CAS Monograph No. 5: Generalized Linear Models for Insurance Rating, 2nd Edition
In-depth introduction to machine learning in 15 hours of expert videos
- See Chapter 6: Linear Model Selection and Regularization
Using PROC GENMOD to find a fair house insurance rate for the Norwegian market

Required packges

glmnet - Lasso and Elastic-Net Regularized Generalized Linear Models
Matrix - Sparse and Dense Matrix Classes and Methods
insuranceData - A Collection of Insurance Datasets Useful in Risk Classification in Non-life Insurance

Notable packages

gamlr - Gamma Lasso Regression
glmnetUtils - Utilities for ‘Glmnet’
- Includes some tools to make it easier to work directly with data.frames, use formulas to define the model structure, and generate model matrices.

Inputs

SingaporeAuto - This is a dataset from the insuranceData package containing a number of rating factors

Outputs

cvfit - Fitted glm object
plot(cvfit) - Plot the improvement against the target optimisation function, in this case poisson deviance
coef(cvfit, ...) - Estimated coefficients
exp(predict(cvfit, ...)) - Predicted claim counts

Testing a conversion of a simple `data.frame` to a sparse model matrix

This step is optional however provides background to show how to create the model matrix and how it converts a data.frame into a matrix. This is usually required in other machine learning applications. The primary change is that factors or factor-like variables such as characters are one-hot encoded, for example here column a_factor which can be a, b, or c has been converted to a_factora, a_factorb, a_factorc each of which can take a value of 0 or 1. If a numeric variable is not classed as a factor, it will be treated as a numeric value for the model fit, e.g. c_numeric. A sparse matrix is a matrix where the 0 values in the matrix are replaced with a missing value to reduce the object size.

The method used here will use the function sparse.model.matrix. In the code ~. means we will take all the columns within the dataset df and -1 excludes the generated intercept value. This is not required as the glmnet fit will generate an intercept estimate. glmnet versions 3.0+ now include the makeX function which performs a similar function as a wrapper to sparse.model.matrix.

library(Matrix)

# define a simple data.frame
a_factor <- factor(rep(c('a','b','c'), 2))
b_char <- rep(c('d','e','f'), 2)
c_numeric <- c(1:6)
df <- data.frame(a_factor,b_char,c_numeric)

# standard method
sparse.model.matrix(~. -1, df) # ~. include all columns and adds intercept

## 6 x 6 sparse Matrix of class "dgCMatrix"
##   a_factora a_factorb a_factorc b_chare b_charf c_numeric
## 1         1         .         .       .       .         1
## 2         .         1         .       1       .         2
## 3         .         .         1       .       1         3
## 4         1         .         .       .       .         4
## 5         .         1         .       1       .         5
## 6         .         .         1       .       1         6

# glmnet makeX function
library(glmnet)
makeX(df, sparse = TRUE)

## 6 x 7 sparse Matrix of class "dgCMatrix"
##   a_factora a_factorb a_factorc b_chard b_chare b_charf c_numeric
## 1         1         .         .       1       .       .         1
## 2         .         1         .       .       1       .         2
## 3         .         .         1       .       .       1         3
## 4         1         .         .       1       .       .         4
## 5         .         1         .       .       1       .         5
## 6         .         .         1       .       .       1         6

Prepare the data for the model

We create a new data.table called dt_feature which is a cleaned up version of the features we want to include in our model. Note here that data.table is used here as a personal preference, dplyr or data.frame would work as well. For claims count GLMs we also need to create an offset value which acts a scaling multiplier for the earned exposure, i.e. a 1.5 year policy should have 1.5x the predicted claims. Refer the to the reference document for a more detailed explanation on offsets and an example from SAS.

library(data.table) # used as a preference for this example
library(insuranceData)
library(glmnet)
library(Matrix)

# import the data and process the variables for analysis
data(SingaporeAuto)
dt <- setDT(SingaporeAuto)
rm(SingaporeAuto)
dt_feature <- dt[,  .(
                    SexInsured=factor(SexInsured),
                    Female=factor(Female),
                    Private=factor(PC),
                    NCDCat=factor(NCD),
                    AgeCat=factor(AgeCat),
                    VAgeCat=factor(VAgeCat)
                    )]

# create the sparse feature matrix excluding the numeric variables
m_feature <- sparse.model.matrix(~., dt_feature)
m_offset <- log(dt[["Exp_weights"]])
m_counts <- dt[["Clm_Count"]]

Fit `glmnet` using cross validation

k-fold cross validation splits the data into k parts in order to help test the optimal regularisation factor lambda. There are two lambda values provided by default from the model fit object:

lambda.min the one that achieves the minimum poisson deviance
lambda.1se the “most regularized model such that error is within one standard error of the minimum” (glmnet). Nonzero expresses the number of factor levels that are non-zero (excluding the intercept).

Because the cross validation folds are sampled at random, the results may vary upon each execution of the program. For reproducibility purposes the random number generator seed has been defined.

set.seed(12345) # reproducible random numbers
cvfit = cv.glmnet(m_feature, m_counts, offset = m_offset, family = "poisson")
cvfit

## 
## Call:  cv.glmnet(x = m_feature, y = m_counts, offset = m_offset, family = "poisson") 
## 
## Measure: Poisson Deviance 
## 
##       Lambda Index Measure       SE Nonzero
## min 0.000775    35  0.3523 0.009777      17
## 1se 0.015221     3  0.3619 0.010464       1

The poisson deviance at each model iteration is shown on the plot.

plot(cvfit)

The model coefficients can be shown as follows for both lambda values.

cbind(coef(cvfit, s = "lambda.1se"), coef(cvfit, s = "lambda.min"))

## 23 x 2 sparse Matrix of class "dgCMatrix"
##                     s1          s1
## (Intercept) -1.9876630 -1.70480890
## (Intercept)  .          .         
## SexInsuredM  .          0.09709391
## SexInsuredU  .          .         
## Female1      .         -0.02808719
## Private1     .          .         
## NCDCat10     .         -0.26573355
## NCDCat20     .         -0.38844888
## NCDCat30     .         -0.28366046
## NCDCat40     .         -0.62968611
## NCDCat50     .         -0.61479835
## AgeCat2      .          .         
## AgeCat3      .          0.05642882
## AgeCat4      .          .         
## AgeCat5      .         -0.12825621
## AgeCat6      .          0.34274113
## AgeCat7      .          0.41041089
## VAgeCat1     .          0.17485142
## VAgeCat2     .          0.39802069
## VAgeCat3     .          0.13762082
## VAgeCat4     .         -0.22433971
## VAgeCat5    -0.1279539 -0.95229715
## VAgeCat6     .         -1.18753843

The predicted number of claims can be calculated as follows. For brevity only the first 10 are shown using the head function.

head(exp(predict(cvfit, newx = m_feature, newoffset = m_offset, s = "lambda.min")), 10)

##     lambda.min
## 1  0.091457290
## 2  0.077588767
## 3  0.068967793
## 4  0.112736861
## 5  0.066156960
## 6  0.092822266
## 7  0.025219947
## 8  0.031499199
## 9  0.006279252
## 10 0.018940695

Poisson frequency models in glmnet

Simon

2021/08/29

Requirement

Suggested references

Required packges

Notable packages

Inputs

Outputs

Testing a conversion of a simple `data.frame` to a sparse model matrix

Prepare the data for the model

Fit `glmnet` using cross validation

Poisson frequency models in glmnet

Simon

2021/08/29

Requirement

Suggested references

Required packges

Notable packages

Inputs

Outputs

Testing a conversion of a simple data.frame to a sparse model matrix

Prepare the data for the model

Fit glmnet using cross validation

Testing a conversion of a simple `data.frame` to a sparse model matrix

Fit `glmnet` using cross validation