Intro

With COVID-19 keeping us all at home, I found myself with almost 2 weeks of vacation time to use up in December 2020. Which I didn’t mind – I like longer staycations, when I can really get into personal projects.

So, when I came across a Reddit post about a competition to predict motor insurance claims, I jumped right in. I’d always been curious about how insurance pricing is done. I suppose I see it as “a path I could have taken”, given I studied statistics.

The training data consisted of 228k annual claims records over 4 years and 57k policies, and just 23 (anonymized) features to use for prediction.

The main objective was to achieve maximum profit in a market consisting of the rest of the competitors. A profit leaderboard was updated weekly (on different test samples), while immediate feedback was available on a claims RMSE leaderboard. Importantly, competing on the RMSE leaderboard was a simpler matter of predicting claim amount, while the ultimate goal/profit leaderboard had a huge element of pricing strategy.

Some EDA

Claims distribution

Observations

We can see from the figures above that the distribution is very right-skewed. Only 10.2% of all the claims are non-zero, and more than half of those are sub-$1000. A miniscule 0.18% of non-zero claims are over $10,000.

This makes it a bit tricky to predict claims (at least, on an RMSE basis), as non-zero claims don’t represent a large part of the data set AND there’s a handful of massive claims that can easily blow up a squared error.

Partway through the competition, they announced that claims in the test set would be capped at $50,000 (that is, we have reinsurance), so that decreased the skew.

Time spent/tools

I didn’t really keep track, but most of the time I spent on this competition was front-loaded with minor to moderate tweaking afterwards. I estimate:

  • Data exploration, basic code setup (3 hrs)
  • Troubleshooting submission issues (3 hrs)
  • Learning and playing around with new methods (10 hrs)
  • Optimizing, tweaking, experimenting, evaluating (???)
  • Reading the discussion forums (4 hrs)
  • Weekly review of leaderboard results and minor changes during the last few weeks (3 hrs)
  • Attending the Town Hall, reviewing code from top 3 teams/participants (3 hrs)
  • Conducting a post-mortem, writing this blog, and troubleshooting flexdashboard (8 hrs)

Languages/tools/methods

  • R, RStudio, Google Colab
  • rpart, XGBoost, CatBoost, LightGBM, H2O

Github repo – coming soon!

Modeling

RMSE-optimization

Like many other competitors, I made the mistake of spending too much time on the RMSE leaderboard, even after it became clear within weeks that there was little headwind to be made in trying to more accurately predict claim amount. While the RMSE leaderboard is no longer available, the situation was such that a model that simply predicted the average claim amount seen in training performed around 504 RMSE (IIRC), while the best sophisticated model anyone could come up with performed around 497 RMSE (or was that 499.7?).

I actually tried a couple things here that mirrored what some of the top competitors did:

  • Use of Tweedie loss, as the outcome is zero-inflated
  • Two-step prediction (predict claim>0, then claim amount if >0)

As neither of these approaches improved my RMSE, I discarded them fairly early on and didn’t think to revisit them once I started working on pricing…

My final claims-prediction model consisted of a GBM (CatBoost) trained with hyperparameters that I tuned with cross-validation outside of Google Colab (the latter being the only submission method I could get to work). I cleaned up the features a bit and coded a few dummy variables for missing values, but otherwise left the training data as it was. Even though reinsurance capped claims at $50k, I opted to cap claims during training to $10k (only 41 claims were larger than this). My best score on the RMSE leaderboard was 500.79. :(

Profit-optimization

Many people commented on how challenging profit-optimization was, given that the leaderboard was only updated weekly AND you were competing against others who constantly updated their pricing strategy. It was a highly reflexive, rapidly-changing environment. Unlike real life, you had zero visibility into the premia others were charging. Furthermore, the private test data changed every week, and you were only allowed a single submission each time, making it difficult to compare strategies. Wild gyrations in weekly ranking were seen, even when people did not change their model at all – you could go from -$50k profit on average to rank 1 with $20k profit, just because of how the test data changed!

Honestly, it was a pretty brutal ask. The folks at AICrowd were did create a personalized report each week for our profit leaderboard submission so we would have some idea about what our algos were doing and what they might be failing on.

I realized, along with probably everyone else, that the biggest danger and profit-killer was in accidentally offering the lowest premium (in a market of 10 randomly selected competitors) for a really large claim. My strategy was essentially to price policies based on the decile of predicted claims that they fell into, with higher multipliers for higher deciles. I did a bit of fine-tuning on the weekly basis to increase my premia for demographics that were associated with my lowest profits, and increase my minimum premium offered to offset inevitable claims, but did not change my underlying claims-prediction model.

Algorithm overview

Post-mortem

How did I do?

So I did pretty poorly on the final profit leaderboard, and I think it boils down to a couple of things:

  1. Abandoning things like Tweedie loss and a frequency-severity kind of two-step model early on, even though they made sense, just because they didn’t improve performance on the RMSE leaderboard.
  2. I actually placed 5th on the Week 9 profit leaderboard (Week 11 was the final leaderboard), and for my last submission I mostly stuck to what did well then. In hindsight, given the extreme randomness of the weekly test data, and the fact we had “lucked out” in Week 9 with test data that had no large (>$5k claims) at all, I should’ve been more cautious. The takeaway for me should have been “I do great when there are no large claims, i.e. I’m under-estimating risk for large claims, so assign them higher premia [even though I already was].” Given that it was hard to achieve decent RMSE for predicting claims, I’m not sure this would’ve worked reliably, but it is definitely what I should’ve done.

Below, we see my final evaluation printout. Basically, I “won” too many large claims, tanking my profit.

Final evaluation

What I learned from the Town Hall and the top 3 solutions

AICrowd wrapped up the competition with a Town Hall with several presenters from among the top participants, which were absolutely fantastic. Some things I learned:

  1. About 80% of the almost 200 participants/teams were from the insurance industry, including top 3 places. In retrospect this makes sense, but I think it surprised me as I had imagined most of the participants to be new to insurance. (And, as one of the top participants put it, there’s nothing like doing insurance pricing for your job, then doing more of the same in your leisure time. But hey, there WAS prize money.)
  2. Multi-part models are widely used, even in the insurance industry. Stacking or chaining multiple not-too-complex models (e.g. GLMs) is the key, as regulations mean that “black box” models are out of the question. Several of the presentations featured “large claims” (or even claims >$0) models, as avoiding winning those proved to be a key part of making a profit.
  3. While no one discussed which features were the most important to the performance of their models, all of the presentations included feature engineering. There were some rather clever ones like (vehicle weight)*(vehicle max speed) or (vehicle weight)*(vehicle max speed)2, as a proxy for the amount of physical energy said vehicle could impart in a crash.
  4. While it’s impossible to tell whether models that incorporated claims history did better or not, as the test data were not released, it makes sense to use this information where possible. Since policy IDs are supplied in both training and test, and some of the test data consisted of the same policies as the training (year 1-4 for training, year 5 for test), some people fit separate models for policies with a claims history versus “new customers”. It complicated things, but could have given them an edge, and is obviously something that happens in real life.

1st solution: https://github.com/davidlkl/Insurance-Pricing-Game

2nd solution: https://github.com/glep/pricing_game_submit

3rd solution: https://discourse.aicrowd.com/t/3rd-place-solution/5201

Tools I want to start using/improve at

Feature engineering

Feature engineering is an incredibly important step to extracting the maximum value from your data. In hindsight, I suspect it’s something I under-emphasize because the type of data I work with at my day job is high-dimensional and not straightforward in interpretation. There’s no way to manually pick out, say, pairs of genes from a pool of 50,000 genes, and multiple or divide or square their expression values in a meaningful way. The high dimensionality means trying every combination is not feasible, at least not with typical sample sizes. Feature engineering ends up taking more the form of dimensionality reduction, be that through PCA, latent factor analysis, gene set enrichment, etc.

Seeing the top contestants manually create obviously meaningful features that a tree-based algorithm would take multiple splits to represent, was embarrassingly eye-opening.

GAMs, GA2Ms

I’d heard of GAMs before, but never really looked into them. But them seem like something that could be useful even for my work, as they don’t introduce too much complexity but are more flexible than GLMs. Importantly, they allow for non-linearity, and really there’s no particular reason to expect gene expression to vary linearly/log-linearly with a given trait.

I think the figure here sums up the trade-offs well:

Stacking

Again, one of those things that I haven’t done a lot of, because the nature of most biomedical research begets small sample sizes. Small samples sizes mean large confidence intervals, which means it can be difficult to show that your complex model outperforms your simple model at a statistically significant level.

During the Town Hall, one presenter mentioned that “black box” models can be more acceptable to regulators if they are but one component in a stacked model (consisting of more widely accepted models like GLMs), and their relative weighting can be tweaked. I find that pretty awesome, I don’t know why.

ICE plots

ICE plots are similar to partial dependence plots (which show how a model’s predictions change as the value of a particular feature changes), but for every sample individually. Kind of like SHAP values, which I’ve used for GBMs, but for GLMs?

I think explainable AI/models in general are really important, because no model is perfect and we need to understand what they are doing, why they are doing it, and how to fix them if need be.