David Huh

Statistical Analysis as Consultants Part I: Percentiles and Map

Blog Post created by David Huh on Dec 21, 2017

A client recently asked me if I could develop a report which analyzes the primary and secondary area of their business. I told her if she could elaborate the meanings of "primary" and "secondary" markets, I would be glad to help. She and her analysis team went away and discussed; after a few days the team still could not come to a consensus. Simply, one can define what constitutes primary and secondary markets one way and another person in another way.


I understood the client’s vision and the difficulty that she was facing in articulating the exact specifications of primary and secondary market. But the lack of definitions gave me an opportunity to be creative. And I had to consider that clients typically prefer tools that are intuitive than most sophisticated tools that they do not understand.


Imagine a situation like this. A task that involves two participants. The first participant is a regular person and the second participant is a car technician. On a table there are two sets of the following, one set for each person: 100 nails, a wooden plank, a regular hammer, and a sophisticated powered hammer from the shop. Each participant is given 5 minutes to nail as many nails as they can on the wooden plank.


Given that the regular person has no experience with the power tool compared to a seasoned car technician who has an ample amount of experiences, which tool does each participant use for their tasks? Due to the time constraint, the regular guy probably went with the basic hammer, rather than trying to learn and use the sophisticated tool from the shop. Even though the powered hammer is the ideal tool for the task, any untrained person would choose a tool that he or she is more comfortable with especially in an environment with resource constraints.


In this spirit, while developing the report, I decided to introduce percentiles as a simple statistical tool that captures the client’s intent.


Percentile is a familiar statistical method that is often seen in, for example, ACT and SAT scores. The School of Public Health in Boston University defines percentile as “a value in the distribution that holds a specified percentage of the population below it.” If a student receives a test score in the 95th percentile, her score is above 95 percent of the population or is within top 5%. Simple enough.


By integrating percentiles into a map, the combination can capture the areas with high business activity. Activities can be abstracted to be any measurable volume. For the purpose of this article, the workbook linked to this article uses the technique which combines percentiles and maps using public data from the City of Chicago (Figure 1).


Fig. 1 - Food Inspection Density in Chicago and surrounding areas


The city’s food inspection data includes information on the status, types and locations of inspections conducted by the officials. The heatmap represents the volume of inspections. Percentiles are calculated over the distribution of inspections. Zip codes are used to aggregate the individual locations on the map. With this, you can see zip codes with more inspections are assigned to darker red colors and areas with sparse volumes are colored with shades of green.


Fig. 2 - Areas of top 1% inspections in ChicagoFig. 3 - Area of top 1% inspections did not pass


Looking at Figure 2, by selecting Top 1% (which is 99th percentile) of the number of inspections, we can see that zip codes 60614 and 60647 have the highest number of inspections compared to other areas. We can take a step further and select the ‘Failed’ value under the Result dropdown (Figure 3). From this, we see that 60614 has the most food safety violations in the metropolitan area and represents top 1% of the distribution.


Fig. 4 - Inspections in EvanstonFig. 5 - Areas of top 1% inspections in Evanston


Now instead of looking at the bigger market, let’s shift our focus to a suburb of Chicago and see how the technique responds to the change. In the City dropdown, I selected Evanston as seen in Figure 4. As expected, the market in Evanston is much smaller with only two zip codes compared to Chicago metropolitan area. In statistics verbiage, the population distribution shrunk in size. Population distribution is a distribution of inspections.


The change in the market and population distribution does not affect the core concept and functionality of percentiles. When the Top 1% is selected, as seen in Figure 5, we can see that only one of the two zip codes remains, 60201 with 7 inspections. Note that zip code 60614 which we have seen before with all values checked under the City dropdown no longer shows in the visualization, because 60614 is not part of Evanston.


The technique is able to retrieve relevant information before calculating the percentile distribution. Even though the size of the markets changed between the City of Chicago and suburban Evanston, the technique adapts to the change in the population distribution while retaining the concept of the top and bottom markets. The top 1% of food code violations in a large metropolitan city like Chicago as well as the top 1% of violations in smaller areas like Evanston are both alarming. The top 1% of Chicago is of more concern because the market is much bigger. But regardless of the size of the regions, the top 1% represents the primary area of concern.


The percentile technique provides quantitative descriptions to qualitative business concepts. At the beginning of the article, I shared a story where the client and her team struggled to articulate the meanings of primary and secondary markets, especially when the scope of the business changes depending on which subsets of their business are selected (Chicago metropolitan and Evanston). By using the report that integrated the percentile technique, one can elaborate what a person means by primary and secondary markets.


For example, a marketing manager proposes that for the upcoming quarter, her team plans to expand the business’s primary market by introducing a new internet campaign. If an executive asks her what she means by "primary market", she can direct to the specific percentile figure such as the area where top quarter (75th percentile) of sales originated. The technique enables users to have precise communications when formulating business strategies.


The technique illustrated is an example of a template model which can be adopted to provide insights to a wide range of business problems. In the food inspections report, we saw that the report adapts to a change of environment from a large city to a suburb when retrieving relevant top areas of inspections. The technique can be applied to capture where a local office is attracting its revenue or where the highest achieving students are coming from. The technique does not have to depend on a geographical map. We can abstract the notion of a map into its elemental structure, a mathematical space. Without going into technical details, this technique requires two things: a space like a geographical map and a population distribution where percentiles can be calculated.


One example that does not use geographical space is that we can map medical profiles of patients who have cancer. The former notion of the map can be replaced with medical profiles of each person. Specific areas, zip codes, from previous example can be replaced with features within medical profiles such as age groups, eating habits, genetic attributes, symptoms and previous diagnosis. The population distribution includes the volumes of each feature in the medical profiles. Instead of comparing zip codes in the food inspection example, this case explores which features contribute the most to breast cancer. For example, the model may show that genetic defect BRCA1 as the top 1% (or 99th percentile) feature across medical profiles of many breast cancer patients.


The technique is a model and not a solution to business problems. Because the model says that genetic mutation in BRCA1 is within the top 1% of features (i.e. most common among patients) for breast cancer does not necessarily mean that the mutation causes cancer. Milton Friedman, a prominent figure in economics, explains the scope of models in his essay "Positive Economics":

“abstract model and its ideal types but also of a set of rules…. The ideal types are not intended to be descriptive; they are designed to isolate the features that are crucial for a particular problem.” - Milton Friedman in "Positive Economics"

The strength of this technique is its ability to highlight the important elements of the business problem at hand: it could be the area where large portion of revenue is generated, the area of largest food code violations, or a leading feature that affects cancer. The technique has the ability to highlight specific features by simplifying the business problem and the environment with assumptions and rules. In the food inspection report, the bad neighborhood with food code violation is only determined by the volume of inspections. Further analysis is required to conclude definitely that indeed the identified area is plagued with kitchens infested with roaches. The model is effective in its ability to deliver a clear answer given the simplified environment. And the simplicity in the model allows both technical and business users to understand the scope of the model.


The article is the first part of a series where I will unveil techniques that I found useful in consulting in the data industry. While I find sophisticated techniques attractive, I believe choosing a parsimonious technique as a consultant is more prudent than a methodology that cannot be easily utilized by clients in their decision makings. The future parts of the series will continue to unveil specific examples and techniques that are aligned with the spirit of the story.


External Links:

Food Inspection Density Map of Chicago

Food Inspection Data