2) Does a White-Sounding Name Help to Get a Job Interview?

Vitor Kamada

E-mail: econometrics.methods@gmail.com

Last updated: 9-15-2020

Let’s load the dataset from Bertrand & Mullainathan (2004).

import pandas as pd
path = "https://github.com/causal-methods/Data/raw/master/" 
df = pd.read_stata(path + "lakisha_aer.dta")
df.head(4)
id ad education ofjobs yearsexp honors volunteer military empholes occupspecific ... compreq orgreq manuf transcom bankreal trade busservice othservice missind ownership
0 b 1 4 2 6 0 0 0 1 17 ... 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
1 b 1 3 3 6 0 1 1 0 316 ... 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
2 b 1 4 1 6 0 0 0 0 19 ... 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
3 b 1 3 4 6 0 1 0 1 313 ... 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0

4 rows × 65 columns

Let’s restrict the analysis to the variables ‘call’ and ‘race’.

call: 1 = applicant was called back to interview; and 0 otherwise.

race: w = White, and b = Black.

callback = df.loc[:, ('call', 'race')]
callback
call race
0 0.0 w
1 0.0 w
2 0.0 b
3 0.0 b
4 0.0 w
... ... ...
4865 0.0 b
4866 0.0 b
4867 0.0 w
4868 0.0 b
4869 0.0 w

4870 rows × 2 columns

Let’s calculate the number of observations (size) and the mean of the variable ‘call’ broken by race.

We have the same quantity (2435) curriculum vitae (CVs) for Black and White applicants.

Only 6.4% of Blacks received a callback; whereas 9.7% of Whites received a callback.

Therefore, White applicants are about 50% more likely to receive a callback for interview.

In other words, for each 10 CVs that White applicants send to get 1 job interview, Black applicants need to send 15 CVs to get the same outcome.

# Round 2 decimals
pd.set_option('precision', 4)

import numpy as np
callback.groupby('race').agg([np.size, np.mean])
call
size mean
race
b 2435.0 0.0645
w 2435.0 0.0965

Somebody might argue that this difference of 3.2% (9.65 - 6.45) does not necessary implies discrimination against Blacks.

You could argue that White applicants receives more callbacks, because they have more education, experience, skills, and not because of the skin color.

Specifically, you could argue that White applicant is more likely to receive a callback, because they are more likely to have a college degree (signal of qualifications) than Blacks.

If you extract a random sample of US population or check data from US Census, you will see that Blacks are less likely to have a college degree than Whites. This is an undisputable fact.

Let’s check the proportion of Blacks and Whites with college degree in the dataset from Bertrand & Mullainathan (2004).

Originally, college graduate was coded as 4 in the variable ‘education’, 3 = some college, 2 = high school graduate, 1 = some high school, and 0 not reported.

Let’s create the variable ‘college’ = 1, if a person completes a college degree; and 0 otherwise.

df['college'] = np.where(df['education'] == 4, 1, 0)

We can see that 72.3% of Black Applicants have a college degree. The proportion of Whites with college degree is very similar 71.6%.

Why these numbers are not representative of US population and the values are closer to each other?

Because the data is not a random sample extraction from reality.

college = df.loc[:, ('college', 'race')]
college.groupby('race').agg([np.size, np.mean])
college
size mean
race
b 2435 0.7228
w 2435 0.7162

Bertrand & Mullainathan (2004) produced experimental data. They created the CVs. They randomly assigned a Black sounding name (ex: Lakish or Jamal) to half of the CVs and a White sounding name (ex: Emily or Greg) to the other half.

Randomization of the race via name makes the two categories White and Black equal (very similar) to each other for all observable and non-observable factors.

Let’s check this statement for other factors in the CVs. The names of the variables are self-explanatory, and more information can be obtained reading the paper: Bertrand & Mullainathan (2004).

resume = ['college', 'yearsexp', 'volunteer', 'military',
          'email', 'workinschool', 'honors',
          'computerskills', 'specialskills']
both = df.loc[:, resume]
both.head()
college yearsexp volunteer military email workinschool honors computerskills specialskills
0 1 6 0 0 0 0 0 1 0
1 0 6 1 1 1 1 0 1 0
2 1 6 0 0 0 1 0 1 0
3 0 6 1 0 1 0 0 1 1
4 0 22 0 0 1 1 0 1 0

Let’s use a different code to calculate the mean of the variables for the whole sample (both Whites and Blacks) and broken samples between Blacks and Whites.

See that the average years of experience (yearsexp) is 7.84 for the whole sample, 7.83 for Blacks, and 7.86 for Whites.

If you check all variables, the mean values for Blacks are very closer to the mean values for Whites. This is the consequence of randomization.

We also calculate the standard deviation (std), a measure of variation around the mean. Note that the standard deviation is pretty much the same between the whole sample and split samples. Like the mean, you don’t suppose to see much difference among standard deviations in the case of experimental data.

The standard deviation of the variable years of experience is 5 years. We can state roughly the most part of observations (about 68%) is between 1 std below the mean and 1 std above the mean, that is, between [2.84, 12.84].

black = both[df['race']=='b']
white = both[df['race']=='w']
summary = {'mean_both': both.mean(),   'std_both': both.std(),
           'mean_black': black.mean(), 'std_black': black.std(),
           'mean_white': white.mean(), 'std_white': white.std()}
pd.DataFrame(summary)
mean_both std_both mean_black std_black mean_white std_white
college 0.7195 0.4493 0.7228 0.4477 0.7162 0.4509
yearsexp 7.8429 5.0446 7.8296 5.0108 7.8563 5.0792
volunteer 0.4115 0.4922 0.4144 0.4927 0.4086 0.4917
military 0.0971 0.2962 0.1018 0.3025 0.0924 0.2897
email 0.4793 0.4996 0.4797 0.4997 0.4789 0.4997
workinschool 0.5595 0.4965 0.5610 0.4964 0.5581 0.4967
honors 0.0528 0.2236 0.0513 0.2207 0.0542 0.2265
computerskills 0.8205 0.3838 0.8324 0.3735 0.8086 0.3935
specialskills 0.3287 0.4698 0.3273 0.4693 0.3302 0.4704

Why we care so much about the table above that shows that the average White and average Black applicants are pretty much the same?

Because the argument that White applicants are more likely to receive a callback due to their higher level of education cannot hold, if both groups White and Black applicants have similar level of education.

Neither unobserved factors nor unmeasurable factors like motivation, psychological traits, etc., can be used to justify the different rate of callbacks.

In an experiment, only the treatment variable (race) is exogenously manipulated. Everything else is kept constant, consequently variations in the outcome variable (callbacks) can only be attribute to the variations in the treatment variable (race).

Therefore, experimental study eliminates all confound factors presented in observational studies.

Experiment is the gold standard in Hard Science. The most rigorous way to claim causality.

Surveys, census, observational data direct extracted from reality cannot be used to establish causality. It might be useful to capture association, but not causation.

Formally, we can write a causal model below. The 3 lines are equivalent. We can claim that \(\beta\) has “causal” interpretation, only if the treatment variables (\(T\)) was randomized. In the absence of randomization, \(\beta\) captures only correlation.

\[Outcome = Intercept + Slope*Treatment + Error\]
\[Y=\alpha+\beta T +\epsilon\]
\[callbacks = \alpha+\beta race+\epsilon\]

Let’s estimate the model above, using the ordinary least square (OLS) method.

In the stasmodels library of Python, the intercept is a constant with value of 1.

Let’s create the variable ‘treatment’: 1 = Black applicant, and 0 = White applicant.

The variable ‘call’ is the outcome variable (Y).

df['Intercept'] = 1
df['Treatment'] = np.where(df['race'] =='b', 1, 0)
import statsmodels.api as sm
ols = sm.OLS(df['call'], df[['Intercept', 'Treatment']],
                    missing='drop').fit()
C:\Anaconda\envs\textbook\lib\site-packages\statsmodels\tools\_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

Let’s print the results.

print(ols.summary().tables[1])
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0965      0.006     17.532      0.000       0.086       0.107
Treatment     -0.0320      0.008     -4.115      0.000      -0.047      -0.017
==============================================================================

Now we can write the fitted model as:

\[\widehat{callbacks} = 0.0965-0.032\widehat{Treatment}\]

We already got the coefficients above with the code in the beginning of this section that I will reproduce again:

callback.groupby('race').agg([np.size, np.mean])
call
size mean
race
b 2435.0 0.0645
w 2435.0 0.0965

See that the value of Intercept is 9.65%. This is the proportion of White applicants that received a callback for interview.

The coefficient of the treatment variable is 3.2%. The interpretation is that being a Black apllicant “causes” to receive -3.2% ( 6.45% - 9.65%) less callbacks for interview.

Remember that 3.2% is a big magnitude, as it represents about 50% differential. In practical terms, Black applicants has to send 15 CVs to secure one interview rather than 10 CVs for White applicants.

The coefficient of the treatment variable is also statistically significant at level of significance (\(\alpha\) = 1%).

The t-value of -4.115 is the ratio:

\[t = \frac{coefficient}{standard\ error} =\frac{-0.032}{0.008} = -4.115\]

The null hypothesis is:

\[H_0: \beta=0\]

The t-value of -4 means that the observed value (-3.2%) is 4 standard deviation below the mean (\(\beta=0\)). The p-value or probability of getting this value at chance is virtually 0. Therefore, we reject the null hypothesis that the magnitude of treatment is 0.

What defines an experiment?

The randomization of the treatment variable (T).

It automatically makes the treatment variable (T) independent of other factors:

\[T \perp Other \ Factors\]

In an experiment, the addition of other factors in the regression cannot affect the estimation of the coefficient of the treatment variable (\(\beta\)). If you see substantial changes in \(\beta\), you can infer that you are not working with experimental data.

Note that in observational studies, you must always control for other factors. Otherwise, you will have the omitted variable bias problem.

Let’s estimate the multiple regression below:

\[y=\alpha+\beta T + Other\ Factors+\epsilon\]
other_factors = ['college', 'yearsexp', 'volunteer', 'military',
          'email', 'workinschool', 'honors',
          'computerskills', 'specialskills']
multiple_reg = sm.OLS(df['call'],
                      df[['Intercept', 'Treatment'] + other_factors],
                      missing='drop').fit()

We can see that the coefficient of the Treatment (-3.1%) didn’t change much as expected with the additional control variables.

print(multiple_reg.summary().tables[1])
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept          0.0547      0.015      3.727      0.000       0.026       0.083
Treatment         -0.0311      0.008     -4.032      0.000      -0.046      -0.016
college            0.0068      0.009      0.768      0.443      -0.010       0.024
yearsexp           0.0029      0.001      3.630      0.000       0.001       0.005
volunteer         -0.0032      0.011     -0.295      0.768      -0.024       0.018
military          -0.0033      0.014     -0.232      0.817      -0.032       0.025
email              0.0143      0.011      1.285      0.199      -0.008       0.036
workinschool       0.0008      0.009      0.093      0.926      -0.016       0.018
honors             0.0642      0.018      3.632      0.000       0.030       0.099
computerskills    -0.0202      0.011     -1.877      0.061      -0.041       0.001
specialskills      0.0634      0.009      7.376      0.000       0.047       0.080
==================================================================================

Exercises

1| In the literature of racial discrimination, there are more than 1000 observational studies for each experimental study. Suppose you read 100 observational studies that indicate that racial discrimination is real. Suppose that you also read 1 experimental study that claims no evidence of racial discrimination. Are you more inclined to accept the result of 100 observational studies or the result of one experimental study? Justify your answer.

2| When the difference in group means is biased and not capture the average causal effect? Justify your answer, using math equations.

3| Interpret the 4 values of the contingency table below. Specifically, state the meaning and compare the values.

The variable ‘h’: 1 = higher quality curriculum vitae; 0 = lower quality curriculum vitae. This variable was randomized.

Other variables were previously defined.

contingency_table = pd.crosstab(df['Treatment'], df['h'], 
                                values=df['call'], aggfunc='mean')
contingency_table
h 0.0 1.0
Treatment
0 0.0850 0.1079
1 0.0619 0.0670

4| I created an interaction variable ‘h_Treatment’ that is the pairwise multiplication of the variable ‘h’ and ‘treatment’.

How can you use the coefficients of the regression below to get the values of the contingency table in exercise 3? Show the calculations.

df['h_Treatment'] = df['h']*df['Treatment']
interaction = sm.OLS(df['call'],
                      df[['Intercept', 'Treatment', 'h', 'h_Treatment'] ],
                      missing='drop').fit()
print(interaction.summary().tables[1])                      
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       0.0850      0.008     10.895      0.000       0.070       0.100
Treatment      -0.0231      0.011     -2.094      0.036      -0.045      -0.001
h               0.0229      0.011      2.085      0.037       0.001       0.045
h_Treatment    -0.0178      0.016     -1.142      0.253      -0.048       0.013
===============================================================================

5| I run the regression below without the interaction term ‘h_Treatment’. Could I use the coefficients below to get the values of the contingency table in exercise 3? If yes, show the exact calculations.

interaction = sm.OLS(df['call'],
                      df[['Intercept', 'Treatment', 'h'] ],
                      missing='drop').fit()
print(interaction.summary().tables[1])    
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0894      0.007     13.250      0.000       0.076       0.103
Treatment     -0.0320      0.008     -4.116      0.000      -0.047      -0.017
h              0.0141      0.008      1.806      0.071      -0.001       0.029
==============================================================================

6| Write a code to get a contingency table below:

firstname h

0.0

1.0

Aisha

0.010000

0.037500

Allison

0.121739

0.068376

Inside the table are the callback rates broken by Curriculum Vitae quality. What is the callback rate for Kristen and Lakisha? Why the rates are so different? Could we justify the rate difference, arguing that one is more educated and qualified than other?

7| Use the data from Bertrand & Mullainathan (2004) to test if Whites and Blacks have the same average years of experience. State the null hypothesis. Write the mathematical formula of the test. Interpret the result.

8| Think outside of the box like Bertrand & Mullainathan (2004). Propose a practical way to figure out if blue eyes and blond hair cause higher salary. Be specific on how to implement a randomization strategy in practice.

Reference

Bertrand, Marianne, and Sendhil Mullainathan. (2004). Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination. American Economic Review, 94 (4): 991-1013.