Much ado about two: reconsidering retransformation and the two-part model in health econometrics

https://doi.org/10.1016/S0167-6296(98)00030-7Get rights and content

Abstract

In health economics applications involving outcomes (y) and covariates (x), it is often the case that the central inferential problems of interest involve E[y|x] and its associated partial effects or elasticities. Many such outcomes have two fundamental statistical properties: y≥0; and the outcome y=0 is observed with sufficient frequency that the zeros cannot be ignored econometrically. This paper (1) describes circumstances where the standard two-part model with homoskedastic retransformation will fail to provide consistent inferences about important policy parameters; and (2) demonstrates some alternative approaches that are likely to prove helpful in applications.

Introduction

Many outcomes (y) studied empirically in health economics have two fundamental statistical properties: (a) y≥0; and (b) the outcome y=0 is observed sufficiently frequently that the zeros cannot be ignored econometrically. Such data structures are observed in health applications as diverse as health care utilization/expenditure, the use of unhealthy commodities like tobacco and alcohol, and physicians' time allocations to alternative uses. Given exogenous covariates x, econometric applications in which such data structures are encountered have typically relied on one or more of the following (and, generally, competing) three well-known strategies.

The two-part model (2PM) assumes that Pr(y>0|x) is governed by a parametric binary probability model like logit or probit (part one), and that E[ln(y)|y>0,x] is a linear function of x, e.g., E[ln(y)|y>0,x]=xβ (part two).1 The sample selection model (SSM) assumes that there are two linear equations determining the observed outcome. The first equation is z=xξ1+ν, the second equation is w=xξ2+υ, where the error terms (ν, υ) are typically assumed to follow a bivariate normal distribution. In this model, the outcome ln(y)=w is observed only if z>0; regression methods like Heckman's approach (Heckman, 1979) then estimate a Mills-ratio-corrected linear regression of ln(y) on x using only the subsample of observations for which z>0.2 Tobit and related models assume that y|x−N(xω,τ2) and that the observed y is given by y=max(0,y*).

While the choice among these or other competing estimation strategies is clearly a first-order analytical issue,3 this paper tackles a set of somewhat more subtle issues in estimation and inference encountered with applications of two-part models. To wit: While part two of the 2PM has been demonstrated in many empirical settings to be a useful estimator of the parameters β, how these estimates are used is an altogether separate matter. In addressing such concerns, this paper has two main purposes.

The first is to suggest that reliable/consistent estimates of β—while necessary elements of the 2PM framework for conducting inference about important policy parameters—will generally not be sufficient for such purposes. The second is to demonstrate some alternative approaches that are likely to prove useful in applications. In particular, since it will often be the case that many inferential problems of interest involve E[y|x] and its associated partial effects δ(x)=[δj(x)]=[∂E[y|x]/∂xj] and/or elasticities η(x)=[ηj (x)]=[∂ln(E[y|x])/∂ln(xj)], it is fundamentally important to recognize that the parameters β are but one feature of such expectations and related quantities.

That is, how one proceeds from inferences about properties of E[ln(y)|y>0, x] (where the elements of β are the key to inference) to inferences about properties of E[y|x] entails at least two separate considerations. The first is removing the conditioning on y>0; the second is transforming back from ln(y)-space to y-space. Both issues have been discussed extensively in the literature and both are involved in the following analysis, although the perspective here departs materially from that typically maintained in the literature.

Some prominent areas of potential applicability of the ideas discussed here are noted at this juncture.

Analysts working in the fields of outcomes research, disease management, etc., often utilize large samples of individual-level outcomes on various measures of health care utilization, expenditures, or outcomes (`claims data'). Such datasets typically contain information only on individuals for whom some positive amount of utilization or expenditure is observed over some specific time period. As such, a common objective in such research is to draw inferences about the determinants of E[y|y>0, x], perhaps augmenting such inferences with information about Pr(y>0|x) obtained from other data sources. As the main arguments of this paper will illustrate, using common methods like loglinear regression with retransformation must be approached with care if inferences about properties of E[y|y>0, x] drawn from analysis of claims datasets are to be reliable. The alternative approaches proposed here should be useful in many applications of interest to outcomes researchers.

It is common practice in empirical health economics to model individuals' demands for health care in a two-part context: whether, over some time period, the individual obtains or uses any care at all; and, if so, how much care (e.g., how many physician visits) is obtained. The two components of this process may differ in their economic determinants as well as their policy relevance (e.g., Pohlmeier and Ulrich, 1995). Consider the example of childhood immunizations (Mullahy, 1997b): An analyst may be concerned both about whether a child has obtained any immunizations by age two, as well as the extent to which the child is on-schedule for immunizations by that age. Another example is screening: Issues may arise regarding both whether an individual has ever been screened for a particular disease and, if so, the frequency with which such screening occurs.

In many econometric studies of substance use/abuse (tobacco, alcohol, illicit drugs), analysts have often examined phenomena that bifurcate naturally into two components: whether or not individuals consume the commodity, and how much of the commodity is consumed by users; or whether or not the use of the commodity influences labor market participation and, if so, whether or not the commodity's use affects hours worked or wages earned.4 Lost in much of this discourse, however, is a consideration of what these two sets of estimates imply overall for key parameters like E[y|x] and its associated partial effects. In some cases, it may be the case that parameters other than E[y|x] are of interest (e.g., Manning et al., 1995), but in other applications the conditional mean is likely to be a prominent consideration (e.g., Mullahy, 1997a).

The following sections describe some fundamental properties of the 2PM model that—while overlooked often in applications—turn out to have critical implications for inference, and suggest some reformulations of the 2PM that provide for straightforward inference in the context of some nonlinear regression structures.

The plan for the paper is as follows. Section 2presents the statistical preliminaries and describes in detail the two-part model. Section 3discusses issues involved in inference based on the 2PM about properties of E[y|x]. Section 4suggests alternatives to the 2PM, discusses issues involved in their estimation, and proposes a set of specification tests. Section 5presents results of a simulation exercise. Section 6reports an empirical study of doctor visits based on the 1992 National Health Interview Survey. Section 7offers conclusions.

Section snippets

Fundamental statistical issues

It assumed that the analyst observes a random sample of N observations on (yi, xi), where xi=(1,xi1) is a k-vector of covariates. There are assumed to be N+ observations for which yi>0 and N0 observations for which yi=0, with N=N++N0. The index sets for observations i corresponding to these samples are denoted S+={i|yi>0} and S0={i|yi=0}. Unless necessary for clarity, the `i' subscripts will be suppressed.

With y≥0, it is meaningful

Identification and estimation

Should the data be up to the task of identifying the parameters α and β, then the 2PM is—in one sense—identified. Yet, in the absence of further assumptions (e.g., lognormality of f(y|y>0,x)), it is important to note that the standard specification of the 2PM (, ) does not generally permit one to recover E[y|y>0,x] and, therefore, E[y|x] since identification of E[ln(y)|y>0,x] (as given in Eq. (4)) is not sufficient to identify E[y|y>0,x]. As such, the 2PM thus formulated does not have an

Main ideas

The central idea of this paper is the following. A model that captures the basic essence of—but is in general not identical to—part two of the homoskedastic 2PM replaces Eq. (4)with the assumption thatE[y|y>0,x]=exp(xβM)M(x)so thaty=exp(xβMexpM),y>0where E[exp(εM)|y>0, x]=1

A simulation experiment

One clear implication of the preceding discussion is that while the 2PM may be a consistent estimator of the parameters β, its utility as concerns estimation of the partial effects ∂E[y|y>0,x]/∂x and, therefore, δ(x), may be limited. A brief simulation experiment underscores the importance of this distinction. (For purposes of this section, β will denote both β from 2PM as well as βM from M2PM.)

The design is intentionally one where neither the 2PM nor the M2PM-2 estimator will be a consistent

An empirical example of health care utilization

This section presents some empirical illustrations of the concepts and issues discussed in the previous sections. The various estimators are compared and contrasted in terms of their performance in a single sample, and the results of some specification tests are reported.

Summary and discussion

Both the algebraic and the empirical results presented here suggest that one should approach use of the standard (homoskedastic) 2PM with considerable caution in microeconometric applications where interest centers on E[y|x] and its associated partial effects. The basic identifying assumption for β in that model, namely E[ε|y>0,x]=0, is not sufficiently powerful to identify other parameters of interest—E[y|x], δ(x), etc.—even if π(x) is properly specified and identified. One may make the

Acknowledgements

This research has been supported by Grant AA10393 from the NIH Office of Research on Women's Health and NIAAA to NBER, by NIAAA Grant AA10392 to the University of Minnesota, and by a grant from the David and Lucile Packard Foundation to NBER. The initial stimulus for this paper was provided by some enlightening remarks by Will Manning on heteroskedastic retransformations (see Manning, 1998, for a formal exposition). Thanks are owed to Will Manning, Joao Santos Silva, Jon Skinner, Frank

References (31)

  • Davidson, R., MacKinnon, J.G., 1993. Estimation and Inference in Econometrics. Oxford Univ. Press, New...
  • N. Duan

    Smearing estimate: a nonparametric retransformation method

    J. Am. Stat. Assoc.

    (1983)
  • N. Duan

    A comparison of alternative models for the demand for medical care

    J. Business Econ. Stat.

    (1983)
  • Efron, B., Tibshirani, R.J., 1993. An Introduction to the Bootstrap. Chapman & Hall, New...
  • Eichner, M., et al., 1997. Health expenditure persistence and the feasibility of medical savings accounts. In: Poterba,...
  • Cited by (522)

    View all citing articles on Scopus
    View full text