How I Learned to Stop Worrying and built a Walmart Consumer Price Index (CPI) (Part I)
The intuit for this story can be dark, so let’s keep it short. I created this Python-Selenium crawler to assess whether data from Walmart could help build a CPI for countries where authoritarian regimes may be misrepresenting inflation statistics.
The case study for this exercise is Nicaragua, but the code should work for Mexico, Guatemala, Honduras, El Salvador, and Costa Rica (fork & loop ;). In Particular, I used a Selenium Chrome driver to crawl through Walmart’s grocery store items and download their name, price, weight, and discount information. A couple of post-extraction variables are country and a pandas datetime stamp for creating the time series later on.
The repository for the code is here, and the video below is a showcase of the scrapping snippet at work. The basic components of the crawler continuously scroll down Walmart’s page to load the asynchronous ajax (lazy-loading) at the end of each page (the site uses numbered pages, but infinite loading). On average, Nicaragua’s Walmart site holds 50 pages of food and beverages (~1,500 single items).
A little text wrangling and we get sorted lists of grocery items by popularity, price, and weight. A simple top 15 for those categories is illustrated in the following set of graphs.
More interesting than the rankings is the ability to extract consumer price data from Walmart sites for several countries. Given the prevalence and popularity of the retail giant in these countries, it may serve as a useful, albeit imperfect source of consumer wellbeing.
In the future, I plan to populate the series with bimonthly downloads of new items and prices. This time series of foods and beverages will be the basis for creating a Consumer Price Index for Nicaragua. I will obtain the Index’s weight for each product from Nicaragua’s CPI methodology; if not to make the series comparable, at least to follow the same end-result coefficients taken from income-expenditure household surveys that helped build it. We could also develop measures of text similarity as a first-tier filter for identifying which items to include in the index, among the hundreds downloaded each time. In Part II hopefully, we will have a long enough series to do some time series decomposition analysis in Python!
Hop on and let me know what do you think!