1 Introduction

The COVID-19 pandemic is a complex and constantly evolving phenomenon that is affecting the entire world heterogeneously. The disease caused by the spread of this new form of coronavirus has rapidly propagated worldwide, affecting countries with different timing and intensity (Ceylan 2020). Italy was the first country in Europe to be hit by COVID-19, and, to date, it ranks among the countries with the highest fatality toll. Given the diverse geographical spread of the pandemic, local estimates provide an important tool to document the spread of COVID-19. However, the official data on the death toll of COVID-19 are scarce at the local level, and, when available, they are likely to suffer from substantial underreporting (Ghislandi et al. 2020).Footnote 1 Besides, such measurement error is likely to be heterogeneous even within countries (Leon et al. 2020), and this inaccuracy may lead to incorrect beliefs about how COVID-19 spreads across space and over time. A valid alternative for estimating the number of deaths caused by the pandemic (directly or indirectly) consists of considering the number of daily certified all-cause deaths (deaths from any cause, not only those related to coronavirus).Footnote 2 In particular, using historical data on the number of daily certified deaths to estimate the number of daily deaths in the absence of the pandemic could vastly reduce the uncertainty associated with COVID-19 official data, especially at a disaggregated geographical level. This study applies this approach to Italy.

As the pandemic has affected the entire country, it is not feasible to use the most common counterfactual approach that compares treated and non-treated municipalities within the country. In this unusual setting, the benchmark approach to estimate excess mortality — which we will call the “intuitive” approach — adopted by several national and international institutions, and employed in many scientific works (see Section 2), consists of comparing the actual number of cumulative all-cause deaths in 2020 with the numbers observed in the past for the same municipality.Footnote 3 Stated differently, it is a before–after analysis of the changes between post-pandemic mortality and the pre-pandemic average over the past year(s). Yet, being simply an unconditional average, this comparison does not employ any covariates, nor indeed any model, and may be sensitive to outliers and other data issues. Therefore, it may be considered as an excessively coarse approach to generate excess mortality figures, especially when employed at a fine geographical scale.

This paper, instead, makes use of machine learning (ML) and municipality-level panel data from the recent past (from 2015 to 2019) to build the counterfactual scenario. By applying ML algorithms on all-cause deaths’ data, we credibly estimate excess mortality in Italian municipalities (local administrative units, LAU) from February 21, 2020 (the official date of the first coronavirus cluster in Italy), to September 30, 2020.Footnote 4 To this aim, we compare the official number of cumulative all-cause deaths for the period February 21 to September 30, 2020, with an estimate of the number of cumulative deaths over the same period in an “ordinary” situation, i.e., in the counterfactual situation without the pandemic. We then consider the difference between observed mortality and our counterfactual predictions as the cumulative number of excess deaths, which are very likely to be due, either directly or indirectly, to COVID-19. Differently from the intuitive approach, however, our counterfactual scenario is not just a pre-pandemic mortality average but rather an ML-generated prediction for each municipality, generated using only pre-pandemic information, of an alternative, no-pandemic, “business-as-usual” mortality scenario.

It is important to note that we estimate the gross excess mortality due to COVID-19, i.e., the number of deaths due directly to COVID-19 infections as well as deaths due to the collateral effects of the lockdown. Such collateral effects have lowered the likelihood of dying from some causes such as road and workplace accidents, pollution-related diseases, or criminal activities and increased the likelihood of dying owing to the unprecedented stress on the public health system (e.g., severe delays in the hospitalization process).

We report estimates obtained using three estimation approaches: intuitive, ML, and a classical linear method. The intuitive approach sketched above is used exclusively as a benchmark to evaluate the performance of the ML techniques. We then employ three ML algorithms to estimate excess mortality for all municipalities. ML techniques allow predicting mortality trends by “training data,” i.e., by learning from past information, evaluating out-of-sample model performance on unseen “testing data”, and finally comparing predicted and actual values to derive excess mortality figures. Specifically, we employ the least absolute shrinkage and selection operator (LASSO), random forest, and stochastic gradient boosting. Finally, for the sake of comparison, we also test the performance of a straightforward OLS regression.

We show how all these methods outperform, on average, the intuitive approach adopted by Italian institutions with mean squared error (MSE) predictive gains of up to 17.8% and especially sizable improvements in smaller municipalities. Thanks to the higher accuracy of data-driven methods, we improve the estimates of local mortality by providing a more precise and reliable “counterfactual” scenario. In Section 4.2, we make use of local mortality estimates to fully document the local and temporal evolution of the COVID-19 pandemic in Italy during the first wave.

2 Literature review

In this section, we first survey the most relevant studies on excess mortality during pandemics, focusing on their methodological aspects, and then summarize the main findings of papers on mortality during the first wave of the COVID-19 pandemic in Italy.

2.1 Excess mortality estimation: Methodological aspects

Past and recent studies have dealt with pandemics’ effects on mortality trends by estimating the excess mortality across several countries.Footnote 5 In particular, various research analyzed the excess of deaths due to the influenza pandemic of 1918–1920, known as the Spanish flu. The majority of these studies applied intuitive approaches (see, among others, Mortara 1925; Pinnelli and Mancini 1998; Murray et al. 2006; Barro et al. 2020) or more sophisticated techniques in order to estimate excess mortality. For example, Ansart et al. (2009) used a regression model to calculate baseline and compare it with the observed mortality. The intuitive approach was also used by Luk et al. (2001) to compare the age-specific mortality during the Spanish flu with the one during other pandemics (1957 Asian Flu and 1968 Hong Kong Flu) and by Viboud et al. (2016) to analyze the pandemic of 1957-1959 in 39 countries of the world.

Within the European mortality monitoring project (EuroMOMO), a standardized approach aimed at monitoring the mortality excess due to influenza in Europe was developed.Footnote 6 According to this approach, the mortality baseline was calculated applying a Poisson regression model, corrected for overdispersion. This approach was used by Mazick et al. (2010) that compared the all-cause deaths observed in 8 European countries during the 2009 A(H1N1) pandemic, known as the swine flu. Similar models were implemented to estimate excess mortality during the swine flu by Gran et al. (2013) for Norway and Yang et al. (2012) for Hong Kong.

Not surprisingly, as soon as official data on deaths have become available, a large number of scholars have gauged the COVID-19 impact on mortality through intuitive or more sophisticated approaches.Footnote 7 Several scholars have developed country-level analyses that use more sophisticated approaches to estimate excess mortality due to COVID-19. For instance, Felix-Cardoso et al. (2020) used deviation from the expected value from homolog periods (DEV) and the remainder after seasonal time series decomposition (RSTS) considering total and age- and gender-specific excess mortality in five countries (England and Wales, France, Italy, Netherlands, and Portugal). Vestergaard et al. (2020), Fouillet et al. (2020), and Sinnathamby et al. (2020) employed the statistical model developed by EuroMOMO to 24 European states, to France, and England, respectively. A similar model was used by Weinberger et al. (2020) for the USA. Jiang et al. (2020) introduced a self-normalization technique to assess the trajectory of COVID-19 deaths for 30 countries. They designed a two-stage forecasting scheme to predict cumulative deaths in the USA. Li and Linton (2021) implemented a quadratic time trend model to forecast the total number of deaths (and cases) for 191 countries. Pham (2020), likewise for the US situation, dealt with the cumulative number of deaths due to the ongoing COVID-19, based on the five-parameter logistic model. Rivera et al. (2020) proposed a semiparametric method and a conventional time series method and analyzed nine US states. Vandoros (2020), for England and Wales, applied a difference-in-differences approach in order to investigate the number of deaths during the pandemic. Vieira et al. (2020) estimated the excess mortality in Portugal by comparing observed deaths with the average and respective standard deviation of the number of daily all-cause mortality in the past years to estimate mortality by age and cause. Then, they tested an ARIMA model to validate the relevant excess mortality. At a more detailed geographical level and with the use of a model that considers age group and symptom status, there is research conducted by Hauser et al. (2020) in Hubei province, China, and six regions in Europe.

Although many papers use intuitive and counterfactual approaches, only two works, to our knowledge, employ ML techniques to estimate excess mortality, though not in the context of the COVID-19 crisis. Deprez et al. (2017) investigated two classical models for estimating mortality rates in Switzerland and, by applying a regression tree boosting machine, detected different mortality models’ weaknesses. Levantesi and Pizzorusso (2019) extended the work of Deprez et al. (2017) by investigating the ability of ML to improve the accuracy of some standard stochastic mortality models, using not only decision trees but also random forest and gradient boosting.Footnote 8

2.2 Mortality evolution during the COVID-19 pandemic in Italy

Many studies have gauged mortality trends in Italy during the COVID-19 pandemic, applying various methodologies.

First of all, two important contributions by public institutions used the intuitive approach for estimating the impact of the COVID-19 pandemic on total resident population mortality by age and gender. The Italian National Institute of Statistics (Istat) and the National Institute of Health (ISS) have continuously monitored mortality trends in Italy, providing several reports on excess mortality.Footnote 9 As explained above, these reports gauge the excess mortality as differences between observed mortality for all causes and the average for the same period in 2015–2019. Istat and ISS (2020a) analyzed the period January 1 to November 30, 2020, at the provincial and regional level and found around 50,000 excess deaths for the first wave of the pandemic, over 90% of which in northern Italy. Deaths were concentrated (72%) in the over 85 population. On the contrary, they found no relevant changes in the number of deaths for the period from June to September. The second institutional contribution comes from the National Institute of Social Security (INPS) (2020) that compiled a report for the first quarter of 2020, using the intuitive approach but weighting deaths for the resident population. The analysis is conducted at the provincial level with a focus on northern municipalities. The report highlighted that on April 30, the municipalities with the highest excess mortality were those located in the provinces of Bergamo, Brescia, Cremona, Lodi (Lombardy region), and Piacenza (Emilia Romagna).

Besides, many scholars deal with excess mortality at the regional level, provincial level, and municipality level using intuitive or more sophisticated approaches, above in the early phase of pandemic, when data availability was limited.

Regarding the regional level, Modi et al. (2020) use the conditional mean with a Gaussian process and synthetic control method to estimate the weekly excess of deaths for Italian regions up to April 11. They found that excess mortality is considerable in northern regions, and the cumulative deaths are higher than the deaths officially attributed to COVID-19. Investigating provincial-level trends, Ceriani and Verme (2020) compared mortality rates in 2020 with mortality rates in 2015–2019 in February to April. They found that the increase of daily mortality rates (greater for males and older) was more than 100% in mid-February in many Italian provinces (Bergamo, Cremona, Lodi, Piacenza, and Brescia), the same that experienced the highest deaths due directly to COVID-19 in the middle of March 2020. Scortichini et al. (2020) estimated the mortality excess at the provincial level with a two-stage analysis. They calculated the baseline risk by applying a quasi-Poisson regression model taking into account trends and weather conditions. Their results show an increase of deaths equal to 29.5% from the expected mortality in the period 15th of February 15th of May 2020. The majority of excess deaths (71%) were found in the northern regions (Lombardy, Emilia Romagna, and Veneto).

Given the higher incidence of COVID-19 in the north of Italy, many scholars estimated excess mortality at the municipal level, focusing on northern cities. Buonanno et al. (2020), by combining official statistics, retrospective data, and original data (i.e., obituaries and death notices), provided an estimate of excess mortality in Lombardy municipalities. They found that deaths recorded as directly due to COVID-19 constitute only half of the excess deaths verified in March. In comparison, Depalo (2021) applied partial identification to administrative data at the municipality level to estimate the number of deaths, the number of infections, and mortality rates from COVID-19 in Lombardy. He calculated that in March 2020, there were between 10,000 and 18,500 deaths, more than in the period 2015–2019.

Gibertoni et al. (2021) extended the excess mortality study on Lombardy, Emilia Romagna, and Veneto’s municipalities, differentiating for several age groups. For each subgroup, a simple linear regression on the deaths in 2015–2019 was used, and the parameters were applied to estimate the expected number of deaths in 2020. They found that the excess of deaths presents several differences among the municipalities, also within regions, and that the differences between men and women are higher in under 75.

Looking at the whole Italian territory, Del Re and Meridiani (2020) compared the excess mortality trend (calculated by subtracting a model obtained from the average deaths in 2015–2019) with the COVID-19 deaths for eight principal Italian cities considering the February to April period. They found a high excess, particularly in Brescia, and concluded that all-cause mortality excess follows the COVID-19 mortality trend. Differences between all-cause deaths and COVID-19 deaths were found in gender and age. Michelozzi et al. (2020) compared the total excess mortality across the 31 biggest Italian municipalities up to April with the deaths directly due to COVID-19. They use the intuitive approach, considering the average of the 5 previous years. They found that half of the mortality excess was due to COVID-19 deaths. Like other scholars, they observed differences between the age groups and a less relevant impact of officially COVID-19 deaths on the older people’s mortality excess. Ciminelli and Garcia-Mandicó (2020) estimated the excess mortality for 4100 northern municipalities from February 21 to March 31 by running a difference-in-differences regression model using data on mortality from 2016 as a control to mortality in 2020. They found that mortality excess was particularly high in Codogno and Alzano Lombardo (in the provinces of Lodi and Bergamo, respectively) and at the borders between Lombardy and other regions. Biggeri et al. (2020) identified spatial clusters of excess mortality for all Italian municipalities for the period January to April 2020. The mortality excess was estimated by applying a Bayesian model and considering as a baseline the average deaths 2015–2019. They found that the excess deaths are particularly diffuse in Lombardy, Piedmont, and Emilia Romagna regions. Blangiardo et al. (2020) provided a measure of the weekly excess mortality for all the Italian municipalities over the period January to April 2020 by predicting the expected mortality with a Poisson distribution and specifying a Bayesian hierarchical model on the log mortality relative risk. They found a relevant excess death in the northern regions and marked geographical differences also inside regions and between close locations.

3 Data and methodology

3.1 Data

On December 3, 2020, Istat released data on the daily number of all-cause deaths from January 1 to September 30 on all 7,904 Italian municipalities.Footnote 10 In addition, Istat released data on the daily number of all-cause deaths for all the Italian municipalities for the years 2015–2019. We used such historical data and other variables to estimate excess death during the coronavirus outbreak in Italy. Our dependent variable is the cumulative number of deaths (per 10,000 inhabitants) calculated for three different periods: February 21 to March 31, February 21 to June 30, and February 21 to September 30.

We feed the ML models with 16 selected covariates covering aspects strongly related to deaths such as demographic, health system, economic, and contamination (air pollution) variables. This set of variables allows us to estimate the mortality trend for 2020 in the counterfactual situation, i.e., without the outbreak of coronavirus, in a more accurate way.

As COVID-19 is thought to be more lethal among men and the elderly (Dowd et al. 2020; Dudel et al. 2020), we control for the age structure, i.e., the share of men in the population, the share of those aged 65+ (overall as well as only men), and the share of those aged 80+ (overall as well as only men). Further, we control for the resident population, the overall number of deaths in the previous year, and the overall number of deaths in the period from January 1 to February 20, 2020, i.e., the 7 weeks before the coronavirus outbreak in Italy.

We also control for the number of employees as this is likely related to the heterogeneous spread of the contagion (see Ascani et al. 2020), for the share of employment in manufacturing, and for PM-10 as a measure of air quality. The latter two variables take into account that the most vulnerable people are those affected by respiratory diseases, conditions associated with high mortality in COVID-19 infection, which are more widespread in industrialized areas.Footnote 11 For similar reasons, we also control for population density and the degree of urbanization of the municipality.

As for healthcare characteristics, we control for a dummy variable equal to 1 if there is a hospital in the municipality and another dummy variable equal to 1 if there is a hospital in at least one of the neighboring municipalities. Lastly, as the lockdown imposed after the coronavirus outbreak surely decreased the number of deaths due to road accidents, we control for the number of deaths due to road accidents in the previous year. This way, we compare municipalities with similar mortality rates due to road accidents. Table 1 reports the yearly average values of the dependent and the explanatory variables for the period 2015–2020.

Table 1 Descriptive statistics

3.2 Methodology

We use two different sets of approaches to estimate excess mortality: intuitive and ML. Estimating excess mortality requires a comparison of the observed mortality data with counterfactual mortality figures, i.e., with an estimate of the local mortality that would have been observed had the pandemic never hit Italy. This is not an easy task because, although heterogeneously, the entire country was affected by COVID-19, so that there are no “untreated” units but only “treated” ones. This implies that we cannot construct the counterfactual scenario by looking at the number of all-cause deaths in 2020 for non-affected municipalities.Footnote 12 As a remedy for this peculiar situation, we exploit the statistical power of the time series data on all-cause deaths for the period 2015–2019 to build counterfactual mortality estimates and, in turn, derive the impact of coronavirus on local mortality.

3.2.1 The intuitive approach

The intuitive approach is straightforward: it consists of comparing the actual trend of cumulative deaths in 2020 with the trends of cumulative deaths observed in the past for the same municipality (in the previous year or for the average of previous years). In this case, we consider the average of the cumulative number of annual deaths in the period 2015–2019. This approach is easy to interpret, but it does not allow to take into account unobserved factors such as flu epidemics or climatic conditions, which can vary over time and could be highly sensitive to small changes in the input data — especially when applied at a high granularity level. The intuitive approach has recently been used in several official reports, as mentioned in Section 2.Footnote 13

3.2.2 The ML approach

Our aim is to use ML techniques to generate a “counterfactual” scenario with predictions of what mortality figures would have been under “ordinary” conditions, i.e., if the “treatment” represented by COVID-19 would not have happened.

ML primarily deals with prediction. The recent explosion of ML in the economics literature is indeed due to the exploration of the so-called prediction policy problems (Kleinberg et al. 2015). However, recent developments have begun to exploit the predictive power of ML to also tackle causal inference research questions. In particular, Varian (2014, 2016) was the first to note that, since counterfactual building is essentially a predictive task, ML tools are, in principle, fit for the job. In a panel data setting, one could exploit the time series of pre-treatment observations to generate an artificial control group and a counterfactual trend in the no-treatment scenario. Then, by comparing the observed outcome of the treated units with their potential outcome of the counterfactual scenario, one could readily retrieve treatment effects as the difference between the two.

Following the intuition in Varian (2014, 2016), early applications have started to test the potential of this machine learning control method (MLCM) in addressing the so-called fundamental problem of causal inference, i.e., the fact that one cannot observe the potential outcome of the treated unit under the no-treatment scenario. In particular, we are aware of only seven papers employing this novel approach: Abrell et al. (2019), Benatia (2020), Benatia and de Villemeur (2019), Bijnens et al. (2019), Burlig et al. (2020), Cicala (2017), and Souza (2019). All these studies are confined to the field of energy economics and deal either with electricity markets or with the evaluation of energy efficiency policies with the exception of Bijnens et al. (2019) who look at the employment impact of the suspension of wage indexation in Belgium. Burlig et al. (2020) and Cicala (2017) have both treated and untreated units in their samples and make use of the untreated group in their counterfactual building, while Souza (2019) exploits staggered adoption to predict counterfactuals thanks to the temporal overlap of treated and not-yet-treated units. Conversely, Abrell et al. (2019), Benatia (2020), Benatia and de Villemeur (2019), and Bijnens et al. (2019) do not have an original control group but only treated units in settings with simultaneous treatment.

Importantly, Benatia (2020) is the only one to look at causal effects related to the COVID-19 crisis: it applies a neural network model to study the impact of containment measures on electricity demand. The COVID-19 pandemic is a peculiar case that lends itself well to the implementation of the ML counterfactual building method: given the worldwide spread of the contagion, it is hard to find plausible control groups as, for example, a combination of untreated units which could be used in a synthetic control approach. This makes it the ideal setting to harness the power of ML in generating an artificial counterfactual scenario.

To the best of our knowledge, this is among the first papers employing the MLCM in a setting with no control group, first proposed by Varian (2016), to evaluate the impact of a global exogenous shock affecting all the available units. Our ML predictive exercise is also closely related to the methodological strand of econometric literature, which deals with program evaluation and counterfactual building in a panel data setting (see, among others, Abadie and Cattaneo (2018) and the references therein). With respect to this literature, however, the distinctive feature of the MLCM lies in the exclusive use of pre-treatment information to generate the counterfactual scenario.

In the flourishing literature on the use of ML for policy purposes, a key concern regards two trade-offs: the one between bias and variance and the one between accuracy and interpretability. While the first is a general issue of ML techniques, the latter is distinctive of fields in which ML is used in the service of public policies that also require taking into consideration communication and accountability aspects. More complex “black box” methods tend to be both more accurate but less, if at all, interpretable. So, the choice of the appropriate technique can often fall on simpler algorithms to get more intuitive, and hence explainable, outputs at the expense of a loss in predictive accuracy. In our case, however, we are not interested in producing a transparent predictive model that clearly explains how the algorithm relates the features to the output. We just want to produce the most accurate estimates possible. This different framework is an advantage for our purpose because the trade-off between accuracy and interpretability does not impose a constraint on the selection of the techniques to employ. Thus, we opt for a mix of methods: a simpler algorithm and two black box techniques and show that, for this particular task, the simpler method performs better than the most complex ones, at least at the aggregate level.

Specifically, we adopt the following ML algorithms, LASSO, and two methods based on regression trees, random forest and stochastic gradient boosting. These algorithms are characterized by growing degrees of complexity and flexibility.Footnote 14 LASSO is a relatively simple technique that assumes an underlying linear relationship between the outcome and the predictors. In LASSO, the model is penalized for the sum of the absolute values of the weights. The implication of this regularization is that, depending on the value of the hyperparameter λ, LASSO forces the coefficients of uncorrelated or weakly correlated predictors exactly to zero, thus performing variable selection. This makes LASSO less flexible but more interpretable than standard OLS, as it produces a sparse model in which the outcome is related only to a smaller subset of the predictors. By contrast, random forest and stochastic gradient boosting are fully non-linear methods, based on the aggregation of many decision trees.Footnote 15 Random forest builds several different decision trees based on bootstrapped training samples and uses at each split of the trees only a random subset of the predictors as split candidates, thus decorrelating the trees from one another. The key difference with stochastic gradient boosting is that while random forest grows trees in parallel, stochastic gradient boosting grows them sequentially. Similarly to the random forest, stochastic gradient boosting is based on the aggregation and growth of many decision trees. But unlike random forest, stochastic gradient boosting does not involve bootstrap sampling, as each tree is based on the “residual” of previously grown trees, i.e., each tree is fitted on a modified version of the original dataset (Hastie et al. 2009). To make the results more comparable across the three ML methods and capture potential interactions between variables in all the selected ML models, we also include, for LASSO, all the pairwise interactions between the predictors as additional features, for a total of 256 covariates employed for this algorithm.Footnote 16

In the ML literature, the typical routine is to randomly divide the sample in a training set, in which the model is built and tuned, and a testing set, in which its predictive power is tested through an evaluation of its out-of-sample predictive accuracy. In order to solve the other trade-off we mentioned above, the bias-variance trade-off, cross-validation on the training sample can be employed to select the best-performing values of key tuning parameters that regulate the complexity or flexibility of the algorithms and reduce the risk of overfitting.

Bearing the above in mind, for each of the three different periods (February 21 to March 31, February 21 to June 30, and February 21 to September 30), we implement the following process: (i) we split the 2015–2018 pooled dataset into a training sample, made up of 80% of the municipalities, and a testing sample, which consists of the remaining 20%; (ii) we train and tune all the algorithms on the training sample, on which we perform a tenfold cross-validation to select the best-performing tuning hyperparameters of each algorithm;Footnote 17 (iii) we test how well the algorithms perform in predicting observed mortality on unseen data, i.e., on the 2015–2018 testing sample; (iv) we test model performance on the entire 2019 sample and show that algorithm performance is stable over time and that all the ML methods perform, on average, better than the commonly adopted intuitive method; (v) we repeat this routine on the pooled 2015–2019 data, to train the models on as much data as possible so as to maximize the accuracy gains; (vi) we use the models built on the 2015–2019 dataset to predict, for the 2020 sample, estimates of local mortality in a “no-COVID” scenario; and (vii) we derive municipality-level excess deaths for all the municipalities by subtracting the ML counterfactual estimates from the observed 2020 mortality data released by Istat. Municipality-level excess death estimates are relevant, especially because they can be considered as one of the best possible proxies to gauge the local magnitude of the COVID-19 impact. The underlying assumptions of our methodology are exactly the same of those underlying the intuitive method, namely, that (i) we are in a relatively static environment, at least on average (Burlig et al. 2020), i.e., mortality trends are generally not very volatile over time, with the exception of extraordinary events such as a war or a pandemic, which seems reasonable in our context; (ii) the difference between the observed and counterfactual mortality outcomes is due to the total impact of the exogenous “shock” represented by the COVID-19 pandemic.Footnote 18 Below we will provide descriptive and factual evidence to demonstrate the credibility of both assumptions.

Although our dataset is a panel of Italian municipalities, in our benchmark models, we consider the data as pooled, following Andini et al. (2018). In fact, we treat each municipality–year pair as if it was a single observation. We do not deem this as a problematic aspect as we focus on a relatively short time period, whereas such a choice could entail issues if some of the variables we use (either the outcome or the predictors) were to exhibit drastic changes over time within the same municipality, and this is unlikely to happen in a short time span. In any case, in the results section, we also report the predictive performance of an ML model, which explicitly incorporates the longitudinal component of the sample via the inclusion of municipality and year dummies and show that there are no predictive gains with respect to our benchmark estimates. The descriptive statistics reported in Table 1 show that the year-by-year variation in mortality data and key predictors is rather low, reinforcing the validity of assumption (i) in our context. Finally, an in-time placebo test performed on the year 2019, discussed in Section 4.2, will serve to underpin the plausibility of both assumptions.

Importantly, we apply our random splitting on municipalities, not on municipality–year pairs. Going for the latter would make the same municipality appear in the training set in 1 year (e.g., 2016) but in the testing set in another year (say, 2018). To the extent that the predictors do not change or vary slowly over time, this would produce downward-biased estimates of the mean squared error (MSE), because the testing data would not truly be “unseen” data, but data very similar to their counterparts for the corresponding municipality in the years that appear in the training sample. By splitting on municipalities, instead, we are sure that the training and testing samples do not share any municipality in common, and the same municipality can only appear either in the training or in the testing set for the entire time span.

Finally, for the sake of comparison of ML performance with that of a simpler, linear, and widely adopted method, we also report the 2019 predictive performance of a standard OLS model run on the full 2015–2018 sample.

4 Estimates

4.1 Predictive power of all methods used

We begin the empirical analysis by examining the forecasting performance of all methods used. Given the presence of approaches different from ML, we have opted for evaluating the performance of the various approaches in the estimation of the number of deaths per 10,000 inhabitants in an “ordinary” year. We select 2019 as the “ordinary” year and use data from 2015 to 2018 to test the predictive power of all methods. For forecast evaluation, we employ the mean squared error (MSE), i.e., the mean of the squared value of the prediction errors, as the main model selection approach. However, as MSE is very sensitive to outliers, we also report the mean absolute error (MAE), i.e., the mean of the absolute value of the errors. These two measures are our metrics to conduct a comparative analysis of predictive performances. Table 2 reports the results for all the sets of methods at the dates of March 31, June 30, and September 30.

Table 2 A comparison of predictive accuracy across the different methods

The crucial insight from Panel A of Table 2 is that all ML techniques and OLS always perform better than the intuitive approach. On average, MSE is reduced by up to 17.8% (February 21 to March 31), 17.4% (February 21 to June 30), 16.8% (February 21 to September 30), compared to using the mainstream approach adopted by several Italian and international institutions. This is the key result of the paper. The pattern is qualitatively consistent when considering MAE. A closer look at individual performances reveals that LASSO is the best-performing ML algorithm, both in terms of MSE and MAE. Random forest and boosting fare worse, but not by much, and still lead to a substantial improvement in precision over the intuitive approaches. When considering MSE, LASSO is followed by boosting, but when looking at MAE, random forest outperforms boosting. These rankings remain consistent for all three periods under scrutiny.Footnote 19 While the superior performance of a linear method like LASSO is somewhat unusual compared to other works in the ML literature, it can be explained by the nature of the predictive problem and the limited number of observations and predictors. In such circumstances, it can be expected that simpler algorithms can match the performance of more complex methods (de Blasio et al. 2020).Footnote 20 Indeed, this is why the OLS performance is comparable to that of the ML routines. Note that, in a few instances, OLS is even more accurate than some of the ML algorithms; however, it is never the best-performing technique, and there is always an ML technique that is superior. Yet, it fares always better than the intuitive approaches, so one may be willing to trade-off a lower MSE/MAE for a more straightforward methodology that does not require the complexity of ML techniques.

Panel B of Table 2 reports the performance of all estimators by population size. First and foremost, the magnitude of the prediction error is inversely proportional to the municipality size. While such sharp inter-class heterogeneity may seem striking, this is actually not surprising when taking into account that the dependent variable is defined as the number of deaths per 10,000 inhabitants and that the variability of growth rates of any variable in small municipalities is substantially higher. Second, concerning model performance, there is also some heterogeneity depending on population size. For example, LASSO and OLS tend to perform worse for large municipalities. The intuitive method, instead, works fine for big cities, where it becomes competitive with ML performance but is imprecise and too coarse for small- and medium-sized municipalities. This confirms our research hypothesis that this method, being based on unconditional averaging, can be very sensitive to outliers and small changes in the input data, features that matter most when looking at smaller municipalities than larger ones, where mortality figures tend to be much more stable. Indeed, municipalities above the 50,000 population cutoff, above which the intuitive method becomes competitive with ML, account for only 1.85% (146/7,904) of the total number of Italian municipalities. For more than 98% of our sample, all the ML methods perform far and above the official approach. As in Panel A of Table 2, OLS is never the most accurate technique. Besides, it tends to become less and less competitive as municipality size gets bigger and as the time span increases, with an especially pronounced drop in accuracy for municipalities above 50,000 inhabitants. This is a serious limitation of the OLS algorithm with respect to the ML algorithms, which prove instead to be comparatively more stable across all population cutoffs and time periods under scrutiny.

In the following section, we show the excess death mortality estimates for 2020 as computed by the best-performing algorithm for each of the four population cutoffs (see Panel B of Table 2).

4.2 Deriving excess death figures during the first wave of the COVID-19 pandemic

The excess mortality estimates from all-cause deaths are not uniform throughout Italy, as can be seen in Fig. 1. Significant differences emerge across and within geographical areas. The excess mortality estimates obtained via ML techniques for the period from February 21 to September 30 are particularly high in various northern municipalities. In particular, at the end of March (Fig. 1A), 393 municipalities, mostly concentrated in Lombardy, record an estimated excess of deaths over 300%. Further, 1,251 municipalities (28.57% of northern municipalities) record a percentage of excess mortality higher than 100%. They are mostly located in the Lombardy region and the western part of the Emilia Romagna region. Many of these territories are located in the provinces that record the highest number of infections up to that date, namely, the provinces of Milan, Bergamo, Brescia, Cremona, and Lodi (Lombardy), and the provinces of Piacenza, Parma, and Reggio Emilia (Emilia Romagna). This is in line with the results of Ceriani and Verme (2020), Modi et al. (2020), and Biggeri et al. (2020), which analyzed the period February to April. Several municipalities in the provinces of Cuneo, Alessandria (Piedmont), and Imperia (Liguria) constituted clusters with excess mortality higher than 100%, and similar clusters were in the provinces of Trento (Trentino-Alto Adige) and Pesaro-Urbino (Marche), consistently with the findings of Blangiardo et al. (2020). On June 30 (Fig. 1B), only very few municipalities in Bergamo and Brescia provinces recorded an estimated excess higher than 300%. Lombardy region continued to record the highest number of municipalities with an estimated excess of deaths over 100%, while in the provinces of Piacenza, Parma, and Reggio Emilia (Emilia Romagna), the majority of municipalities recorded an excess between 50 and 100% compared to the counterfactual scenario. A decrease in excess of deaths below 100% occurs for several municipalities in Cuneo and Imperia provinces. On September 30 (Fig. 1C), the number of municipalities with an excess higher than 100% in the Lombardy region is sensibly lower (212), but many municipalities (574) still record an excess of deaths above 50%. Clusters of municipalities with estimated excess mortality above 100% were observed over time in other northern regions affected by the spread of the virus, such as Piedmont and Liguria. The decrease observed in the previous period in Cuneo and Imperia provinces continues in the subsequent period, and on September 30, the municipalities with a mortality excess between 50% and 100% are considerably less in comparison to the end of June. In the other geographical areas, the estimated mortality excess was much lower for the whole period. From February 21 to March 31 (Fig. 1A), the majority of municipalities recorded estimated excess mortality under 50%, with very few municipalities with excess mortality over 100%. On June 30 (Fig. 1B), the number of municipalities with an excess of deaths between 20% and 100% was much lower than the previous snapshot, and in the central and southern Italy, municipalities with higher excess mortality deaths seemed to be less widespread compared to the snapshots on March 31. The substantially lower excess death figures in central and southern areas are also confirmed by the results of Blangiardo et al. (2020) and Scortichini et al. (2020) who focus on similar time spans. By September 30 (Fig. 1C), there are no visible excess mortality clusters in these areas. Fig. 2, Table 3, and Table 4 in the Appendix clearly show that the excess death estimates in the center-south of Italy can be considered in line with what one would expect to happen in an “ordinary” year (2019 in our case), while the observed trend in the north of Italy is extraordinarily abnormal.Footnote 21 The consistency of the estimates with the geographical evolution of COVID-19 cases confirms that the estimated mortality excess from all-cause deaths is connected, both directly and indirectly, to the COVID-19 pandemic. A note of caution pertains to small municipalities where even sporadic deaths could determine large variations in percentage terms, especially when considering a short time interval such as February 21 to March 31.Footnote 22

Fig. 1
figure 1figure 1figure 1

Percentage of municipal excess deaths detected from February 21, 2020, to September 30, 2020, with respect to the counterfactual scenario estimated via ML techniques. A From February 21, 2020, to March 31, 2020 (note: excess mortality estimates for the north of Italy, 23,603; official number of COVID-19 deaths, 11,011. Gap between these estimates on March 31, 12,592). B From February 21, 2020, to June 30, 2020 (note: excess mortality estimates for the north of Italy, 40,001; official number of COVID-19 deaths, 29,752. Gap between these estimates on June 30, 10,249). C From February 21, 2020, to September 30, 2020 (note: excess mortality estimates for the north of Italy, 39,362; official number of COVID-19 deaths, 30,580. Gap between these estimates on September 30, 8,782)

Our study allows drawing preliminary conclusions about the effects of the pandemic on the spatio-temporal evolution of the local excess deaths during the first wave. The estimated excess mortality for some municipalities between February 21 and June 30 is lower than the values estimated for the period February 21 to March 31. As pointed out by Istat and ISS (2020b), the worst-affected provinces in the first phase of the crisis start to experience a decrease of deaths in the second half of April. This decrease may be connected with demographic changes, behavioral factors, pandemic control measures, and also with the improved diagnostic capacity and a minor pressure on the national health system (Istat and ISS, 2020a). However, by June 30, the estimated excess mortality remains high in many municipalities in northern Italy and abnormal clusters of excess mortality persist in well-defined areas. The decrease in the estimates of excess mortality emerges even more when looking at the whole period from February 21 to September 30. Although small clusters of municipalities with high excess deaths persist, above all in the Lombardy region, the local mortality excess is lower and substantially less widespread. The period June to September is considered a transition period in the spread of the COVID-19 infection, and according to the Istat and ISS (2020a), during the summer months, the mortality trend was in line with the average observed in the previous years. Therefore, the cumulative mortality excess observed in our map at the end of September in several northern municipalities is entirely due to the excess of the all-cause deaths during the first phase of the pandemic.

By aggregating the municipality excess mortality estimates in the north of Italy, we find an increase of 23,603 deaths compared to the counterfactual scenario for the period from February 21 to March 31, 2020. This estimate is in line with the figures obtained by INPS, Istat, and ISS (2020a) by employing the intuitive approach and suggests that the “official” number of deaths directly due to COVID-19 (11,011) might be severely underestimated. However, the gap between our excess mortality estimates (39,362) and the “official” number of deaths directly due to COVID-19 in Northern Italy (30,580) becomes slightly smaller by the end of September. This finding suggests that, despite delays in reporting the attribution of deaths to COVID-19 and irrespective of potential harvesting effects, the tracking capacity of Italian authorities improved over the course of the pandemic.

5 Conclusions

In this work, we propose a more sophisticated approach to produce local estimates of excess mortality during the first wave (February to September 2020) of the COVID-19 pandemic in Italy. Specifically, as COVID-19 is a major global exogenous “shock” affecting all municipalities, we adopt a novel MLCM to build a more plausible counterfactual scenario of mortality figures for 2020 in the absence of the pandemic. We show that the ability of ML methods to build the artificial control group in a “no-COVID-19” scenario outperforms the mainstream intuitive approach adopted by Italian institutions, with particularly sizable predictive gains at the more granular level of small- and medium-sized municipalities. Importantly, our methodological improvements do not depend on any additional identifying condition, i.e., our assumptions are exactly the same as those underpinning the intuitive method.

After showing that these methodologies improve performance by up to 17.8%, we build a municipality-level dataset of the first-wave excess death mortality figures by comparing counterfactual and observed mortality data in each municipality. This dataset, which is shared jointly with the paper, is intended to be available to the general public and to researchers interested in investigating local determinants and territorial factors that may have contributed to the rapid and heterogeneous spread of the first wave of the pandemic across Italy, as well as for evaluating policy responses at a local level and compare spatio-temporal differences with the evolution of the ongoing second wave. We hope our methodological contribution will lead to further refinements of the current approaches targeted at estimating mortality during the current pandemic and, in turn, to a broader understanding of the spread of the virus in Italy and the efficacy of the policies adopted to contain its impacts.

Our methodological framework could be extended to other countries and, possibly, to the entire European Union, to study the temporal and spatial evolution of the first as well as the ongoing second wave. Besides, we emphasize that there is room for more research as, in principle, the inclusion of a richer set of covariates, a comprehensive integration of spatial autocorrelation patterns (such as the inclusion of additional features capturing economic links across municipalities), or the adoption of other ML methods, may further increase the statistical accuracy of our approach. Nonetheless, our main conclusion remains that the intuitive approach can lead to imprecise excess mortality estimates, especially when applied at a high spatial granularity, and should be possibly replaced by more advanced data-driven techniques.