Benford’s law and household surveys reported income in Latin America

This is an experiment to contrast labor income as reported in household surveys -for a rich set of Latin American countries- with Benford’s law. This ‘first digit law’ states that in a collection of numbers the first digits [a.k.a. ‘1’, ‘2’, ‘3’] concentrate most of the distribution. Benford’s law has been applied to a variety of data, from physical constants and natural observational data, to street addresses, and social media statistics. Probably its most famous use-case is in the United States and in Europe, where it serves to detect tax fraud (this was an advice given by Hal Varian in a 1972 issue of The American Statistician journal). The intuit is that fabricated figures do not follow Benford’s law distribution, while real data does. Other uses include, inter alia, fraud detection on scientific research and to spot non-human social media accounts.

Applications of Benford’s law to asses the trustworthiness of survey data include Kayser (2019) and Judge & Schechter (2009). In a nutshell, the former study finds that individual reported income for a set of harmonized longitudinal surveys on aging do not follow Benford’s law strictly, and the latter suggests that enumerators fabrication of data is common in developing countries, specially in the case of longer (tiresome) surveys.

I test Benford’s law on the monthly reported main-occupation labor income for the following countries’ official and publicly available household surveys: Argentina’s 2018 EPHC, Brazil’s 2018 PNADC, Chile’s 2017 CASEN, Costa Rica’s 2018 ENAHO, Ecuador’s 2018 ENEMDU, Mexico’s 2018 ENIGH, and Uruguay’s 2018 ECH. The following table summarizes each survey’s sample for employed individuals with non-missing reported income greater than 0. No treatment for outliers was performed; and labor income is reported for each country in local currency nominal units.

So, can we trust reported income data from household surveys?

Figure 1 illustrates the distribution of first significant digits for main-occupation labor income in Latin American countries, without using survey weights. As a benchmark, the black line represents Benford’s law distribution of first significant digits. The underlying data was created using the following code in Stata:

use "survey_data.dta", clearforeach name in ARG BRA CHL CRI ECU MEX URY {
preserve
keep if country_name== "`name'"
keep if labor_income!=. & labor_income>0
g benfords_individual_income = real(substr(string(labor_income), 1, 1))
dis in red "`name'"
tab benfords_individual_income
tab benfords_individual_income if employee==1
tab benfords_individual_income if self_employed==1
restore
}

Figure 1 mainly serves to show that these datasets do not adhere strictly to Benford’s law. This could mean these workers’s reported income is not fully reliable (in comparison, for example, to administrative records). However, while this could be true, Benford’s law does not shed light into the underlying phenomenons for misreporting. For example, it could be, as Judge & Schechter (2009) points to, that longer surveys raise the probability of enumerators fabricating data, particularly if they are not well-paid. It could as well come from worker’s under-reporting their income (or over-reporting it). Or it could be a simple problem of recall. But for what matters we can only speculate.

Perhaps what I found most interesting is that self-employed individual’s reported income actually follows Benford’s law more closely (Figure 2b). This could be explained as a recall issue, where self-employed, mostly small-business owners, are more aware of their daily cash-flows than employees are. Could it be that they have less fear of the ‘tax-man’ because most of them benefit from simplified tax-regimes for ‘small-businesses’?

What do you think?

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store