# 8) Do Hosts Discriminate against Black Guests in Airbnb?

[Vitor Kamada](https://www.linkedin.com/in/vitor-kamada-1b73a078)

E-mail: econometrics.methods@gmail.com

Last updated: 11-1-2020

Edelman et al. (2017) found that Black sounding-names are 16% less likely to be accepted as a guest in Airbnb than White sounding-names. This result is not a mere correlation. The variable race was randomized. The only difference between Blacks and Whites is the name. For everything else, Black and White guests are the same.

Let's open the dataset of Edelman et al. (2017). Each row is a property of Airbnb in July 2015. The sample is composed of all properties in Baltimore, Dallas, Los Angeles, St. Louis, and Washington, DC.

In [1]:
import numpy as np
import pandas as pd
pd.set_option('precision', 3)

# Data from Edelman et al. (2017)
path = "https://github.com/causal-methods/Data/raw/master/" 
df = pd.read_csv(path + "Airbnb.csv")
df.head(5)

Unnamed: 0,host_response,response_date,number_of_messages,automated_coding,latitude,longitude,bed_type,property_type,cancellation_policy,number_guests,bedrooms,bathrooms,cleaning_fee,price,apt_rating,property_setup,city,date_sent,listing_down,number_of_listings,number_of_reviews,member_since,verified_id,host_race,super_host,host_gender,host_age,host_gender_1,host_gender_2,host_gender_3,host_race_1,host_race_2,host_race_3,guest_first_name,guest_last_name,guest_race,guest_gender,guest_id,population,whites,...,host_gender_FF,host_gender_M,host_gender_MM,host_gender_MF,host_gender_same_sex,host_age_cat,ten_reviews,five_star_property,multiple_listings,shared_property,shared_bathroom,has_cleaning_fee,strict_cancellation,young,middle,old,pricey,price_median,log_price,white_proportion,black_proportion,asian_proportion,hispanic_proportion,tract_listings,log_tract_listings,simplified_host_response,graph_bins,yes,baltimore,dallas,los_angeles,sl,dc,total_guests,raw_black,prop_black,any_black,past_guest_merge,filled_september,pr_filled
0,Yes,2015-07-19 08:26:17,2.0,1.0,34.081,-118.27,Real Bed,House,Flexible,3.0,3.0,3.0,30.0,99.0,5.0,Private Room,Los-Angeles,2015-07-19 01:34:00,0.0,1.0,8.0,March 2008,1.0,white,,M,young/middle,M,M,.,white,white,.,Brad,Walsh,white,male,6.0,3340.0,1789.0,...,0,1,0,0,0,1.0,0,1,0,1,0,1,0,0,1,0,0,0,4.595,0.536,0.03,0.145,0.557,16,2.773,Yes,Yes,1.0,0,0,1,0,0,11.0,0.0,0.0,0.0,matched (3),1,0.412
1,No or unavailable,2015-07-14 14:13:39,,1.0,38.911,-77.02,,House,Moderate,2.0,2.0,2.0,,125.0,5.0,Private Room,Washington,2015-07-14 09:53:00,0.0,3.0,185.0,September 2008,1.0,hisp,,F,young,F,F,F,white,hisp,hisp,Brad,Walsh,white,male,6.0,2143.0,847.0,...,0,0,0,0,0,0.0,1,1,1,1,0,0,0,1,0,0,0,1,4.828,0.395,0.448,0.057,0.089,19,2.944,No,No,0.0,0,0,0,0,1,167.0,0.0,0.0,0.0,matched (3),1,0.686
2,Request for more info (Can you verify? How man...,2015-07-20 16:24:08,2.0,0.0,34.005,-118.481,Pull-out Sofa,Apartment,Strict,1.0,1.0,1.0,100.0,135.0,5.0,Private Room,Los-Angeles,2015-07-20 11:25:00,0.0,2.0,20.0,September 2008,0.0,white,,F,middle/young,F,F,.,white,white,.,Brad,Walsh,white,male,6.0,5700.0,4648.0,...,0,0,0,0,0,1.0,1,1,1,1,1,1,1,0,1,0,0,1,4.905,0.815,0.046,0.054,0.119,21,3.045,Requests more information,Conditional No,0.0,0,0,1,0,0,19.0,0.0,0.0,0.0,matched (3),0,0.331
3,I will get back to you,2015-07-20 06:47:38,,0.0,34.092,-118.282,,House,Strict,8.0,8.0,8.0,115.0,319.0,5.0,Entire Place,Los-Angeles,2015-07-20 02:44:00,0.0,1.0,42.0,September 2008,1.0,white,,mix,middle,M,mix,mix,white,white,mult,Tanisha,Jackson,black,female,15.0,2235.0,1393.0,...,0,0,0,0,0,2.0,1,1,0,0,0,1,1,0,1,0,1,1,5.765,0.623,0.043,0.109,0.381,11,2.398,Not sure or check later,Conditional No,0.0,0,0,1,0,0,41.0,0.0,0.0,0.0,matched (3),0,0.536
4,Message not sent,.,,1.0,38.83,-76.897,Real Bed,House,Strict,2.0,2.0,2.0,35.0,40.0,5.0,Private Room,Washington,.,0.0,1.0,37.0,October 2008,0.0,mult,,FF,middle/young,FF,FF,.,mult,mult,.,Lakisha,Jones,black,female,11.0,4696.0,482.0,...,1,0,0,0,1,1.0,1,1,0,1,0,1,1,0,1,0,0,0,3.689,0.103,0.809,0.034,0.057,2,0.693,,,,0,0,0,0,1,28.0,0.0,0.0,0.0,matched (3),1,0.555


The chart below shows that a Black guest receives less "Yes" from the hosts than a White guest. Somebody might argue that the results of Edelman et al. (2017) are driven by differences in host responses, such as conditional or non-response. For example, you could argue that Blacks are more likely to have fake accounts categorized as spam. However, note that discrimination results are driven by "Yes" and "No" and not by intermediate responses.

In [2]:
# Data for bar chart
count = pd.crosstab(df["graph_bins"], df["guest_black"])

import plotly.graph_objects as go

node = ['Conditional No', 'Conditional Yes', 'No',
        'No Response', 'Yes']
fig = go.Figure(data=[
    go.Bar(name='Guest is white', x=node, y=count[0]),
    go.Bar(name='Guest is African American', x=node, y=count[1]) ])

fig.update_layout(barmode='group',
  title_text = 'Host Responses by Race',
  font=dict(size=18) )

fig.show()

Let's replicate the main results of Edelman et al. (2017).

In [3]:
import statsmodels.api as sm

df['const'] = 1 

# Column 1
#  The default missing ='drop' of statsmodels doesn't apply
# to the cluster variable. Therefore, it is necessary to drop
# the missing values like below to get the clustered standard 
# errors.
df1 = df.dropna(subset=['yes', 'guest_black', 'name_by_city'])
reg1 = sm.OLS(df1['yes'], df1[['const', 'guest_black']])
res1 = reg1.fit(cov_type='cluster',
                cov_kwds={'groups': df1['name_by_city']})

# Column 2
vars2 = ['yes', 'guest_black', 'name_by_city', 
        'host_race_black', 'host_gender_M']
df2 = df.dropna(subset = vars2)
reg2 = sm.OLS(df2['yes'], df2[['const', 'guest_black',
                    'host_race_black', 'host_gender_M']])
res2 = reg2.fit(cov_type='cluster',
                cov_kwds={'groups': df2['name_by_city']})

# Column 3
vars3 = ['yes', 'guest_black', 'name_by_city', 
         'host_race_black', 'host_gender_M',
         'multiple_listings', 'shared_property',
         'ten_reviews', 'log_price']
df3 = df.dropna(subset = vars3)
reg3 = sm.OLS(df3['yes'], df3[['const', 'guest_black',
                    'host_race_black', 'host_gender_M',
                    'multiple_listings', 'shared_property',
                    'ten_reviews', 'log_price']])
res3 = reg3.fit(cov_type='cluster',
                cov_kwds={'groups': df3['name_by_city']})

columns =[res1, res2, res3]


pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.



In [4]:
#  Library to print professional publication
# tables in Latex, HTML, etc.
!pip install stargazer



 In column 1, White-sounding names are accepted 49% of the time; whereas, Black-
sounding names are accepted by around 41% of the time. Therefore, a Black name carries a penalty of 8%. This result is remarkably robust to a set of control variables in columns 2 and 3.

In [5]:
# Settings for a nice table
from stargazer.stargazer import Stargazer
stargazer = Stargazer(columns)
stargazer.title('The Impact of Race on Likelihood of Acceptance')
stargazer

0,1,2,3
,,,
,Dependent variable:yes,Dependent variable:yes,Dependent variable:yes
,,,
,(1),(2),(3)
,,,
const,0.488***,0.497***,0.755***
,(0.012),(0.013),(0.067)
guest_black,-0.080***,-0.080***,-0.087***
,(0.017),(0.017),(0.017)
host_gender_M,,-0.050***,-0.048***


The table below presents the summary statistics about the hosts and properties. In an experiment, the mean values of the control variables are identical to the mean values broken by the treatment group and control group. 

In [6]:
control = ['host_race_white', 'host_race_black', 'host_gender_F', 
	'host_gender_M', 'price', 'bedrooms', 'bathrooms', 'number_of_reviews', 
	'multiple_listings', 'any_black', 'tract_listings', 'black_proportion']

df.describe()[control].T          

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
host_race_white,6392.0,0.634,0.482,0.0,0.0,1.0,1.0,1.0
host_race_black,6392.0,0.078,0.269,0.0,0.0,0.0,0.0,1.0
host_gender_F,6392.0,0.376,0.485,0.0,0.0,0.0,1.0,1.0
host_gender_M,6392.0,0.298,0.457,0.0,0.0,0.0,1.0,1.0
price,6302.0,181.108,1280.228,10.0,75.0,109.0,175.0,100000.0
bedrooms,6242.0,3.177,2.265,1.0,2.0,2.0,4.0,16.0
bathrooms,6285.0,3.169,2.264,1.0,2.0,2.0,4.0,16.0
number_of_reviews,6390.0,30.869,72.505,0.0,2.0,9.0,29.0,1208.0
multiple_listings,6392.0,0.326,0.469,0.0,0.0,0.0,1.0,1.0
any_black,6390.0,0.282,0.45,0.0,0.0,0.0,1.0,1.0


The balanced treatment tests (t-tests) below show that the Black and White guests are identical.

In [7]:
result = []

for var in control:
    # Do the T-test and save the p-value
    pvalue = sm.OLS(df[var], df[['const', 'guest_black']],
               missing = 'drop').fit().pvalues[1]
    result.append(pvalue)

In [8]:
ttest = df.groupby('guest_black').agg([np.mean])[control].T
ttest['p-value'] = result
ttest

Unnamed: 0,guest_black,0.0,1.0,p-value
host_race_white,mean,0.643,0.626,0.154
host_race_black,mean,0.078,0.078,0.972
host_gender_F,mean,0.381,0.372,0.439
host_gender_M,mean,0.298,0.299,0.896
price,mean,166.429,195.815,0.362
bedrooms,mean,3.178,3.176,0.962
bathrooms,mean,3.172,3.167,0.927
number_of_reviews,mean,30.709,31.03,0.86
multiple_listings,mean,0.321,0.33,0.451
any_black,mean,0.287,0.277,0.382


## Exercises

1| To the best of my knowledge, the 3 most important empirical papers in the literature of racial discrimination are Bertrand & Mullainathan (2004), Oreopoulos (2011), and Edelman et al. (2017). These 3 papers use a field experiment to capture causality and rule out confound factors. Search on the Internet and return a reference list of experimental papers about racial discrimination.

2| Tell me a topic that you are passionate. Return a reference list of experimental papers about your topic.

3| Somebody argues that specific names drive the results of Edelman et al. (2017). In the tables below, you can see that there are not many different names representing Black and White. How can this critic be refuted? What can you do to show that results are not driven by specific names?

In [9]:
female = df['guest_gender']=='female'
df[female].groupby(['guest_race', 'guest_first_name'])['yes'].mean()

guest_race  guest_first_name
black       Lakisha             0.433
            Latonya             0.370
            Latoya              0.442
            Tamika              0.482
            Tanisha             0.413
white       Allison             0.500
            Anne                0.567
            Kristen             0.486
            Laurie              0.508
            Meredith            0.498
Name: yes, dtype: float64

In [10]:
male = df['guest_gender']=='male'
df[male].groupby(['guest_race', 'guest_first_name'])['yes'].mean()

guest_race  guest_first_name
black       Darnell             0.412
            Jamal               0.354
            Jermaine            0.379
            Kareem              0.436
            Leroy               0.371
            Rasheed             0.409
            Tyrone              0.377
white       Brad                0.419
            Brent               0.494
            Brett               0.466
            Greg                0.467
            Jay                 0.581
            Todd                0.448
Name: yes, dtype: float64

4| Is there any potential research question that can be explored based on the table below? Justify.

In [11]:
pd.crosstab(index= [df['host_gender_F'], df['host_race']],
            columns=[df['guest_gender'], df['guest_race']], 
            values=df['yes'], aggfunc='mean')

Unnamed: 0_level_0,guest_gender,female,female,male,male
Unnamed: 0_level_1,guest_race,black,white,black,white
host_gender_F,host_race,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
0,UU,0.4,0.542,0.158,0.381
0,asian,0.319,0.378,0.474,0.511
0,black,0.444,0.643,0.419,0.569
0,hisp,0.464,0.571,0.375,0.478
0,mult,0.568,0.727,0.408,0.357
0,unclear,0.444,0.5,0.444,0.333
0,unclear_three votes,0.476,0.392,0.368,0.367
0,white,0.383,0.514,0.386,0.449
1,UU,0.444,0.25,0.333,0.75
1,asian,0.429,0.607,0.436,0.46


5| In Edelman et al. (2017), the variable "name_by_city" was used to cluster the standard errors. How was the variable "name_by_city" created based on other variables? Show the code.



6| Use the data from Edelman et al. (2017) to test the homophily hypothesis that hosts might prefer guests of the same race. Produce a nice table using the library Stargazer. Interpret the results. 

7| Overall, people know that socioeconomic status is correlated with race. Fryer & Levitt (2004) showed that distinct/unique African American names are correlated with lower socioeconomic status. Edelman et al. (2017: 17) clearly state: "Our findings cannot identify whether the discrimination is based on race, socioeconomic status, or a combination of these two."
Propose an experimental design to disentangle the effect of race from socioeconomic status. Explain your assumptions and describe the procedures in detail.

## Reference

Bertrand, Marianne, and Sendhil Mullainathan. (2004). [Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination](https://github.com/causal-methods/Papers/raw/master/Are%20Emily%20and%20Greg%20More%20Employable%20than%20Lakisha%20and%20Jamal.pdf). American Economic Review, 94 (4): 991-1013. 

Edelman, Benjamin, Michael Luca, and Dan Svirsky. (2017). [Racial Discrimination in the Sharing Economy: Evidence from a Field Experiment](https://github.com/causal-methods/Papers/raw/master/Racial%20Discrimination%20in%20the%20Sharing%20Economy.pdf). American Economic Journal: Applied Economics, 9 (2): 1-22.

Fryer, Roland G., Jr., and Steven D. Levitt. (2004). The Causes and Consequences of Distinctively Black Names. Quarterly Journal of Economics 119 (3): 767–805.

Oreopoulos, Philip. (2011). [Why Do Skilled Immigrants Struggle in the Labor Market? A Field Experiment with Thirteen Thousand Resumes](https://github.com/causal-methods/Papers/raw/master/Oreopoulos/Why%20Do%20Skilled%20Immigrants%20Struggle%20in%20the%20Labor%20Market.pdf). American Economic Journal: Economic Policy, 3 (4): 148-71.
