3) Are Females More Likely to Complete High School Under Islamic or Secular Regime?

Vitor Kamada

E-mail: econometrics.methods@gmail.com

Last updated: 10-3-2020

Let’s open the data from Meyersson (2014). Each row represents a municipality in Turkey.

# Load data from Meyersson (2014)
import numpy as np
import pandas as pd
path = "https://github.com/causal-methods/Data/raw/master/" 
df = pd.read_stata(path + "regdata0.dta")
df.head()
province_pre ilce_pre belediye_pre province_post ilce_post belediye_post vshr_islam2004 educpop_1520f hischshr614m hischshr614f ... jhischshr1520m jhivshr1520f jhivshr1520m rpopshr1520 rpopshr2125 rpopshr2630 rpopshr3164 nonagshr1530f nonagshr1530m anyc
0 Adana Aladag Aladag Adana Aladag Aladag 0.367583 540.0 0.0 0.0 ... 0.478448 0.012963 0.025862 1.116244 1.113730 0.955681 0.954823 0.046778 0.273176 1.0
1 Adana Aladag Akoren Adana Aladag Akoren 0.518204 342.0 0.0 0.0 ... 0.513089 0.023392 0.020942 1.002742 0.993227 1.093731 1.018202 0.020325 0.146221 0.0
2 Adana Buyuksehir Buyuksehir Adana Buyuksehir 0.397450 76944.0 0.0 0.0 ... 0.348721 0.036871 0.060343 1.006071 1.094471 1.039968 0.990001 0.148594 0.505949 1.0
3 Adana Ceyhan Sarimazi Adana Ceyhan Sarimazi 0.559827 318.0 0.0 0.0 ... 0.331343 0.022013 0.074627 1.124591 0.891861 0.816490 0.916518 0.040111 0.347439 0.0
4 Adana Ceyhan Sagkaya Adana Ceyhan Sagkaya 0.568675 149.0 0.0 0.0 ... 0.503650 0.053691 0.043796 1.079437 1.208691 1.114033 0.979060 0.070681 0.208333 0.0

5 rows × 236 columns

The variable ‘hischshr1520f’ is the proportion of female aged 15-20 that completed high school according to the 2000 census. Unfortunately, the age is aggregated. It is unlikely that 15 and 16 year old teenagers can finish high school in Turkey. It would be better to have the data broken by age. As the 15 and 16 year old cannot be removed from the analysis, the proportion of female aged 15-20 that completed high school is very low: 16.3%.

The variable ‘i94’ is 1 if an Islamic mayor won the municipality election in 1994, and 0 if a secular mayor won. The Islamic party governed 12% of the municipalities in Turkey.

# Drop missing values
df = df.dropna(subset=['hischshr1520f', 'i94'])

# Round 2 decimals
pd.set_option('precision', 4)

# Summary Statistics
df.loc[:, ('hischshr1520f', 'i94')].describe()[0:3]
hischshr1520f i94
count 2632.0000 2632.0000
mean 0.1631 0.1197
std 0.0958 0.3246

The average high school attainment for the females aged 15-20 is 14% in the municipalities governed by an Islamic major versus 16.6% in the municipalities governed by a secular major.

This is a naive comparison, because the data is not from an experiment. The mayor type was not randomized and cannot be randomized in practice. For example, poverty might lead to a higher level of religiosity and a lower educational achievement. It might be poverty that causes lower rate of high school attainment rather than religion.

df.loc[:, ('hischshr1520f')].groupby(df['i94']).agg([np.size, np.mean])
size mean
i94
0.0 2317.0 0.1662
1.0 315.0 0.1404

The graphic “Naive Comparison” shows that control group and treatment group are determined based on the variable ‘iwm94’: Islamic win margin. This variable was centralized to 0. Therefore, if win margin is above 0, the Islamic mayor won on the election. By the other hand, if win margin is below 0, the Islamic mayor lost the election.

In terms of average high school attainment, the difference between treatment group (14%) and control group (16.6%) is -2.6%. The problem of comparing municipal outcomes using observational data is that the treatment group is not similar to the control group. Therefore, confound factors might bias the results.

import matplotlib.pyplot as plt

# Scatter plot with vertical line
plt.scatter(df['iwm94'], df['hischshr1520f'], alpha=0.2)
plt.vlines(0, 0, 0.8, colors='red', linestyles='dashed')

# Labels
plt.title('Naive Comparison')
plt.xlabel('Islamic win margin')
plt.ylabel('Female aged 15-20 with high school')

# Control vs Treatment
plt.text(-1, 0.7, r'$\bar{y}_{control}=16.6\%$', fontsize=16,
         bbox={'facecolor':'yellow', 'alpha':0.2})
plt.text(0.2, 0.7, r'$\bar{y}_{treatment}=14\%$', fontsize=16,
         bbox={'facecolor':'yellow', 'alpha':0.2})
plt.show()
_images/3)_Are_Females_More_Likely_to_Complete_High_School_Under_Islamic_or_Secular_Regime_9_0.png

This 2.6% difference between high school attainment governed by an Islamic major and a secular major is statistically significant at 1% level of significance. The magnitude is also relevant given that the mean value of high school completion is 16.3%. However, note that it is a naive comparison and likely to be biased.

# Naive Comparison
df['Intercept'] = 1
import statsmodels.api as sm
naive = sm.OLS(df['hischshr1520f'], df[['Intercept', 'i94']],
                    missing='drop').fit()
print(naive.summary().tables[1])
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.1662      0.002     83.813      0.000       0.162       0.170
i94           -0.0258      0.006     -4.505      0.000      -0.037      -0.015
==============================================================================

One way to figure out if the naive comparison is likely to be biased is to check if the municipalities ruled by the Islamic major is different from the municipalities ruled by a secular major.

The municipalities, where the Islamic major won, have higher Islamic vote share in 1994 (41% vs 10%), bigger number of parties receiving votes (5.9 vs 5.5), bigger log population (8.3 vs 7.7), higher population share below 19 year old (44% vs 40%), bigger household size (6.4 vs 5.75), higher proportion of district center (39% vs 33%), and higher proportion of province center (6.6% vs 1.6%).

df = df.rename(columns={"shhs"   : "household",
                        "merkezi": "district",
                        "merkezp": "province"})

control = ['vshr_islam1994', 'partycount', 'lpop1994',
           'ageshr19', 'household', 'district', 'province']
full = df.loc[:, control].groupby(df['i94']).agg([np.mean]).T
full.index = full.index.get_level_values(0)
full
i94 0.0 1.0
vshr_islam1994 0.1012 0.4145
partycount 5.4907 5.8889
lpop1994 7.7745 8.3154
ageshr19 0.3996 0.4453
household 5.7515 6.4449
district 0.3375 0.3937
province 0.0168 0.0667

One way to make control and treatment group similar to each other is to use multiple regression. The interpretation of the coefficient of the treatment variable ‘i94’ is ceteris paribus, that is, the impact of Islamic major on high school attainment considering everything else constant. The trick here is the “everything else constant” that means only the factors that is controlled in the regression. This is an imperfect solution, because in practical terms, it is not possible to control for all factors that affect the outcome variable. However, compared to the simple regression, the multiple regression is likely to suffer less from the omitted variable bias.

The multiple regression below challenges the result of the naive comparison. Islamic regime has a positive impact of 1.4% higher high school completion compared with a secular regime. The result is statistically significant at 5%.

multiple = sm.OLS(df['hischshr1520f'],
                      df[['Intercept', 'i94'] + control],
                      missing='drop').fit()
print(multiple.summary().tables[1])                      
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept          0.2626      0.015     17.634      0.000       0.233       0.292
i94                0.0139      0.006      2.355      0.019       0.002       0.026
vshr_islam1994    -0.0894      0.013     -6.886      0.000      -0.115      -0.064
partycount        -0.0038      0.001     -3.560      0.000      -0.006      -0.002
lpop1994           0.0159      0.002      7.514      0.000       0.012       0.020
ageshr19          -0.6125      0.021    -29.675      0.000      -0.653      -0.572
household          0.0057      0.001      8.223      0.000       0.004       0.007
district           0.0605      0.004     16.140      0.000       0.053       0.068
province           0.0357      0.010      3.499      0.000       0.016       0.056
==================================================================================

The result of the multiple regression looks counter intuitive. How the sign of the treatment variable can change?

Let’s look at data from other perspective. The graph “Naive Comparison” is the scatterplot of all municipalities. Each dot is one municipality. It is hard to see any pattern or trends.

Let’s plot the same graphic, but with municipalities aggregated in 29 bins based on similarity of the outcome variable high school completion. These bins are the blue balls in the graphic below. The size of the ball is proportional to the number of municipalities used to calculate the mean value of high school completion.

If you look carefully near the cut-off (vertical red line), where the variable Islamic win margin = 0, you will see a discontinuity or a jump in the level of high school completion.

# Library for Regression Discontinuity
!pip install rdd
Collecting rdd
  Using cached rdd-0.0.3.tar.gz (4.4 kB)
Requirement already satisfied: pandas in c:\anaconda\envs\textbook\lib\site-packages (from rdd) (1.1.3)
Requirement already satisfied: numpy in c:\anaconda\envs\textbook\lib\site-packages (from rdd) (1.19.2)
Requirement already satisfied: statsmodels in c:\anaconda\envs\textbook\lib\site-packages (from rdd) (0.12.0)
Requirement already satisfied: pytz>=2017.2 in c:\anaconda\envs\textbook\lib\site-packages (from pandas->rdd) (2020.1)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\anaconda\envs\textbook\lib\site-packages (from pandas->rdd) (2.8.1)
Requirement already satisfied: patsy>=0.5 in c:\anaconda\envs\textbook\lib\site-packages (from statsmodels->rdd) (0.5.1)
Requirement already satisfied: scipy>=1.1 in c:\anaconda\envs\textbook\lib\site-packages (from statsmodels->rdd) (1.5.3)
Requirement already satisfied: six>=1.5 in c:\anaconda\envs\textbook\lib\site-packages (from python-dateutil>=2.7.3->pandas->rdd) (1.15.0)
Building wheels for collected packages: rdd
  Building wheel for rdd (setup.py): started
  Building wheel for rdd (setup.py): finished with status 'done'
  Created wheel for rdd: filename=rdd-0.0.3-py3-none-any.whl size=4723 sha256=e114b2f6b2da5c594f5856bd364a95d11da3dc2327c28b419668f2230f7c18c1
  Stored in directory: c:\users\vitor kamada\appdata\local\pip\cache\wheels\f0\b8\ed\f7a5bcaa0a1b5d89d33d70db90992fd816ac6cff666020255d
Successfully built rdd
Installing collected packages: rdd
Successfully installed rdd-0.0.3
from rdd import rdd

# Aggregate the data in 29 bins
threshold = 0
data_rdd = rdd.truncated_data(df, 'iwm94', 0.99, cut=threshold)
data_binned = rdd.bin_data(data_rdd, 'hischshr1520f', 'iwm94', 29)

# Labels
plt.title('Comparison using aggregate data (Bins)')
plt.xlabel('Islamic win margin')
plt.ylabel('Female aged 15-20 with high school')

# Scatterplot 
plt.scatter(data_binned['iwm94'], data_binned['hischshr1520f'],
    s = data_binned['n_obs'], facecolors='none', edgecolors='blue')

# Red Vertical Line
plt.axvline(x=0, color='red')

plt.show()
_images/3)_Are_Females_More_Likely_to_Complete_High_School_Under_Islamic_or_Secular_Regime_18_0.png

Maybe you are not convinced that there is a discontinuity or a jump in the cut-off point. Let’s plot the same graphic with 10 bins and restrict the bandwidth (range) of the variable Islamic win margin. Rather than choosing an arbitrary bandwidth (h), let’s use a method developed by Imbens & Kalyanaraman (2012) to get the optimal bandwidth that minimizes the mean squared error.

The optimal bandwidth (\(\hat{h}\)) is 0.23, that is, let’s get a window of 0.23 below and above the cut-off to create the 10 bins.

#  Optimal Bandwidth based on Imbens & Kalyanaraman (2012)
#  This bandwidth minimizes the mean squared error.
bandwidth_opt = rdd.optimal_bandwidth(df['hischshr1520f'],
                              df['iwm94'], cut=threshold)
bandwidth_opt
0.2398161605552802

Below are the 10 bins. There are 5 bins in the control group, where the Islamic win margin < 0, and 5 bins in the treatment group, where the Islamic win margin > 0. See that high school completion jumps from 13.8% to 15.5% between index 4 and 5, respectively bins 5 and 6. The values 13.8% and 15.5% were computed based on respectively 141 and 106 municipalities (‘n_obs’).

#  Aggregate the data in 10 bins using Optimal Bandwidth
data_rdd = rdd.truncated_data(df, 'iwm94', bandwidth_opt, cut=threshold)
data_binned = rdd.bin_data(data_rdd, 'hischshr1520f', 'iwm94', 10)
data_binned
0 hischshr1520f iwm94 n_obs
0 0.0 0.1769 -0.2159 136.0
1 0.0 0.1602 -0.1685 142.0
2 0.0 0.1696 -0.1211 162.0
3 0.0 0.1288 -0.0737 139.0
4 0.0 0.1381 -0.0263 141.0
5 0.0 0.1554 0.0211 106.0
6 0.0 0.1395 0.0685 81.0
7 0.0 0.1437 0.1159 58.0
8 0.0 0.1408 0.1633 36.0
9 0.0 0.0997 0.2107 19.0

In the graphic “Comparison using Optimum Bandwidth (h = 0.27)”, a blue line was fitted to the control group (5 bins), and an orange line was fitted to the treatment group (5 bins). Now, the discontinuity or jump is clear. We call this method Regression Discontinuity (RD). The red vertical line (\(\hat{\tau}_{rd}=3.5\)%) is the increase of high school completion. Note that this method mimics an experiment. The municipalities where the Islamic party barely won and barely lost are likely to be similar to each other. The intuition is that “barely won” and “barely lost” is a random process like flip a coin. The reverse result in the election could occur at random. By the other hand, it is hard to imagine that Islamic mayors could lose in the municipalities where they won by a stronger margin of 30%.

# Scatterplot
plt.scatter(data_binned['iwm94'], data_binned['hischshr1520f'],
    s = data_binned['n_obs'], facecolors='none', edgecolors='blue')

# Labels
plt.title('Comparison using Optimum Bandwidth (h = 0.27)')
plt.xlabel('Islamic win margin')
plt.ylabel('Female aged 15-20 with high school')

# Regression
x = data_binned['iwm94']
y = data_binned['hischshr1520f']

c_slope , c_intercept = np.polyfit(x[0:5], y[0:5], 1)
plt.plot(x[0:6], c_slope*x[0:6] + c_intercept)

t_slope , t_intercept  = np.polyfit(x[5:10], y[5:10], 1)
plt.plot(x[4:10], t_slope*x[4:10] + t_intercept)

# Vertical Line
plt.vlines(0, 0, 0.2, colors='green', alpha =0.5)
plt.vlines(0, c_intercept, t_intercept, colors='red', linestyles='-')

# Plot Black Arrow
plt.axes().arrow(0, (t_intercept + c_intercept)/2, 
         dx = 0.15, dy =-0.06, head_width=0.02,
         head_length=0.01, fc='k', ec='k')

# RD effect
plt.text(0.1, 0.06, r'$\hat{\tau}_{rd}=3.5\%$', fontsize=16,
         bbox={'facecolor':'yellow', 'alpha':0.2})

plt.show()
C:\Anaconda\envs\textbook\lib\site-packages\ipykernel_launcher.py:25: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
_images/3)_Are_Females_More_Likely_to_Complete_High_School_Under_Islamic_or_Secular_Regime_24_1.png
# RD effect given by the vertical red line
t_intercept - c_intercept
0.03584571077550233

Let’s restrict the sample to the municipalities where the Islamic mayor won or lost by a margin of 5%. We can see that the control group and the treatment group are more similar to each other than the comparison using the full sample in the beginning of this chapter.

However, this similarity is not closer to a “perfect experiment”. Part of the reason is the small sample size of the control and treatment group. Therefore, when we run the Regression Discontinuity, it is advisable to add the control variables.

# bandwidth (h) = 5%
df5 = df[df['iwm94'] >= -0.05]
df5 = df5[df5['iwm94'] <= 0.05]

sample5 = df5.loc[:, control].groupby(df5['i94']).agg([np.mean]).T

sample5.index = full.index.get_level_values(0)
sample5
i94 0.0 1.0
vshr_islam1994 0.3026 0.3558
partycount 5.9730 5.8807
lpop1994 8.2408 8.2791
ageshr19 0.4415 0.4422
household 6.2888 6.4254
district 0.4595 0.4037
province 0.0338 0.0826

Let’s formalize the theory of Regression Discontinuity.

Let the \(D_r\) be a dummy variable: 1 if the unit of analysis receives the treatment, and 0 otherwise. The subscript \(r\) indicates that the treatment (\(D_r\)) is a function of the running variable \(r\).

\[D_r = 1 \ or \ 0\]

In the Sharp Regression Discontinuity, the treatment (\(D_r\)) is determined by the running variable (\(r\)).

\[D_r = 1, \ if \ r \geq r_0\]
\[D_r = 0, \ if \ r < r_0\]

where, \(r_0\) is an arbitrary cutoff or threshold.

The most basic specification of the Regression Discontinuity is:

\[Y = \beta_0+\tau D_r+ \beta_1r+\epsilon\]

where \(Y\) is the outcome variable, \(\beta_0\) the intercept, \(\tau\) the impact of the treatment variable (\(D_r\)), \(\beta_1\) the coefficient of the running variable (\(r\)), and \(\epsilon\) the error term.

Note that in an experiment, the treatment is randomized, but in Regression Discontinuity, the treatment is completely determined by the running variable. The opposite of a random process is a deterministic process. It is counter-intuitive, but the deterministic assignment has the same effect of randomization, when the rule (cutoff) that determines the treatment assignment is arbitrary.

In general, the credibility of observational studies is very weak, because of the fundamental problem of the omitted variable bias (OVB). Many unobserved factors inside the error term might be correlated with the treatment variable. Therefore, the big mistake in a regression framework is to leave the running variable inside the error term.

Among all estimators, Regression Discontinuity is probably the closer method to the golden standard, randomized experiment. The main drawback is that Regression Discontinuity only captures the local average treatment effect (LATE). It is unreasonable to generalize the results to the entities outside the bandwidth.

The impact of Islamic mayor is 4% higher female school completion, using a Regression Discontinuity with bandwidth of 5%. This result is statistically significant at level of significance 5%.

#  Real RD specification
#  Meyersson (2014) doesn't use the interaction term, because 
# the results are unstable. In general the coefficient,
# of the interaction term is not statistically significant.
# df['i94_iwm94'] = df['i94']*df['iwm94']
# RD = ['Intercept', 'i94', 'iwm94', 'i94_iwm94']

RD = ['Intercept', 'i94', 'iwm94']

# bandwidth of 5%
df5 = df[df['iwm94'] >= -0.05]
df5 = df5[df5['iwm94'] <= 0.05]
rd5 = sm.OLS(df5['hischshr1520f'],
                      df5[RD + control],
                      missing='drop').fit()
print(rd5.summary()) 
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          hischshr1520f   R-squared:                       0.570
Model:                            OLS   Adj. R-squared:                  0.554
Method:                 Least Squares   F-statistic:                     36.32
Date:                Wed, 28 Oct 2020   Prob (F-statistic):           1.67e-40
Time:                        17:41:01   Log-Likelihood:                 353.09
No. Observations:                 257   AIC:                            -686.2
Df Residuals:                     247   BIC:                            -650.7
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept          0.3179      0.043      7.314      0.000       0.232       0.403
i94                0.0399      0.016      2.540      0.012       0.009       0.071
iwm94             -0.4059      0.284     -1.427      0.155      -0.966       0.154
vshr_islam1994    -0.0502      0.060     -0.842      0.401      -0.168       0.067
partycount        -0.0003      0.003     -0.074      0.941      -0.007       0.007
lpop1994           0.0091      0.005      1.718      0.087      -0.001       0.020
ageshr19          -0.7383      0.065    -11.416      0.000      -0.866      -0.611
household          0.0075      0.002      3.716      0.000       0.004       0.011
district           0.0642      0.010      6.164      0.000       0.044       0.085
province           0.0191      0.019      1.004      0.316      -0.018       0.057
==============================================================================
Omnibus:                       17.670   Durbin-Watson:                   1.658
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               19.403
Skew:                           0.615   Prob(JB):                     6.12e-05
Kurtosis:                       3.546   Cond. No.                         898.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The impact of Islamic mayor is 2.1% higher female school completion, using a Regression Discontinuity with optimal bandwidth 0.27 calculated based on Imbens & Kalyanaraman (2012). This result is statistically significant at level of significance 1%.

Therefore, the Regression Discontinuity estimators indicate that the naive comparison is biased in the wrong direction.

# bandwidth_opt is 0.2715
df27 = df[df['iwm94'] >= -bandwidth_opt]
df27 = df27[df27['iwm94'] <= bandwidth_opt]
rd27 = sm.OLS(df27['hischshr1520f'],
                      df27[RD + control],
                      missing='drop').fit()
print(rd27.summary()) 
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          hischshr1520f   R-squared:                       0.534
Model:                            OLS   Adj. R-squared:                  0.530
Method:                 Least Squares   F-statistic:                     128.8
Date:                Wed, 28 Oct 2020   Prob (F-statistic):          5.95e-161
Time:                        17:41:01   Log-Likelihood:                 1349.9
No. Observations:                1020   AIC:                            -2680.
Df Residuals:                    1010   BIC:                            -2630.
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept          0.2943      0.023     12.910      0.000       0.250       0.339
i94                0.0214      0.008      2.775      0.006       0.006       0.036
iwm94             -0.0343      0.038     -0.899      0.369      -0.109       0.041
vshr_islam1994    -0.0961      0.030     -3.219      0.001      -0.155      -0.038
partycount        -0.0026      0.002     -1.543      0.123      -0.006       0.001
lpop1994           0.0135      0.003      4.719      0.000       0.008       0.019
ageshr19          -0.6761      0.032    -20.949      0.000      -0.739      -0.613
household          0.0072      0.001      6.132      0.000       0.005       0.010
district           0.0575      0.006     10.364      0.000       0.047       0.068
province           0.0390      0.010      3.788      0.000       0.019       0.059
==============================================================================
Omnibus:                      179.124   Durbin-Watson:                   1.610
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              373.491
Skew:                           1.001   Prob(JB):                     7.90e-82
Kurtosis:                       5.186   Cond. No.                         270.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Exercises

1| Use the data from Meyersson (2014) to run a Regression Discontinuity: a) with full sample, and b) another with bandwidth of 0.1 (10% for both sides). Use the same specification of the two examples of this chapter. Interpret the coefficient of the treatment variable. What is more credible the result of “a” or “b”? Justify.

2| Below is the histogram of the variable Islamic win margin. Do you see any discontinuity or abnormal pattern where the cutoff = 0? What is the rationality of investigating if something weird is going on around the cutoff of the running variable?

import plotly.express as px
fig = px.histogram(df, x="iwm94")
fig.update_layout(shapes=[
    dict(
      type= 'line',
      yref= 'paper', y0 = 0, y1 = 1,
      xref= 'x', x0 = 0, x1 = 0)])
fig.show()

3| I modified the variable “Islamic win margin” for educational purpose. Suppose this is the real running variable from Meyersson (2014). See the histogram below. In this hypothetical situation, what can you infer about the elections in Turkey? Is there a problem in running Regression Discontinuity in this situation? If yes, what can you do to solve the problem?

def corrupt(variable):
    if variable <= 0 and variable >= -.025:
        return 0.025
    else:   
        return variable

df['running'] = df["iwm94"].apply(corrupt)

fig = px.histogram(df, x="running")
fig.update_layout(shapes=[
    dict(
      type= 'line',
      yref= 'paper', y0 = 0, y1 = 1,
      xref= 'x', x0 = 0, x1 = 0)])
fig.show()

4| Explain the graphic below for somebody who is an expert in Machine Learning, but is not trained in Causal Inference? Could the variable “Islamic vote share” be used as running variable? Speculate.

def category(var):
    if var <= 0.05 and var >= -.05:
        return "5%"
    else:   
        return "rest"

df['margin'] = df["iwm94"].apply(category)

fig = px.scatter(df, x="vshr_islam1994", y="iwm94", color ="margin",
                 labels={"iwm94": "Islamic win margin",
                         "vshr_islam1994": "Islamic vote share"})
fig.update_layout(shapes=[
    dict(
      type= 'line',
      yref= 'paper', y0 = 1/2, y1 = 1/2,
      xref= 'x', x0 = 0, x1 = 1)])
fig.show()

5| Are males more likely to complete high school under Islamic or secular regime? Justify your answer based on data and rigorous analysis. The variable “hischshr1520m” is the proportion of males aged 15-20 with high school education.

Reference

Imbens, G., & Kalyanaraman, K. (2012). Optimal Bandwidth Choice for the Regression Discontinuity Estimator. The Review of Economic Studies, 79(3), 933-959.

Meyersson, Erik. (2014). Islamic Rule and the Empowerment of the Poor and Pious. Econometrica, 82(1), 229-269.