Skip to content

Set up Random Data for Regression using Data Simulation in order to Run Regression in Two Ways in Python

Use-case: The relation and difference between sample mean regression coefficients in linear data models

Use-case:

How to generate random data for linear regression in python. This is a python tutorial to set up or generate random sample data or simulated data with the goal of understanding how regression work. In this tutorial, we provide a practical statistical lecture and relevant python code to

  1. simulate data based on a user defined linear model,
  2. run regression models in two ways (using python black box function OLS and using user custom defined function with linear algebra formal on regression coefficients), and
  3. compare the difference sample mean difference and regression results in calculation of average treatment effect.

This statistical regression and python tutorial for beginners also provides an example to learn the conditional dependence concept in regression and how covariate features in regression affect estimation of average treatment effect in linear bivariate or multivariate regression or OLS (Ordinary Least Squares) models.

Video Tutorial for this blog is available here: in here I explain more about the steps in below

Check out these related tutorial in your convenient time:

  • For python related tutorials, see this playlist of video tutorials: https://www.youtube.com/playlist?list=PL_b86y-oyLzAQOf0W7MbCXYxEkv27gPN6
  • For statistical and econometric related tutorials, see this playlist of video tutorials: https://www.youtube.com/watch?v=aHBquefG6dQ&list=PL_b86y-oyLzDTtPT8Y1zTt4kLpdOBhxOZ

Step1: package

# basic packages
import pandas as pd
import numpy as np
# data viz packages
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib import style
# stats one
import statsmodels.formula.api as smf
from scipy.special import expit

import graphviz as gr
%matplotlib inline

Step2: data input

Simulating linear model with below information

  • features: a binary treatment effect W, which is NOT randomly assigned
  • covariate X
  • outcome variables y

The linear data model and some related concepts:

The code for simulating the data and the true parameter for regression coefficients and feature distributions:

style.use("fivethirtyeight")

np.random.seed(123)
n = 100000
X = np.random.normal(20, 100, n).round()
W = np.random.binomial(1, expit((X - X.mean()) / X.std())).astype(bool)
y = np.random.normal(10 - 5 * W + 2 * X, 5)


data = pd.DataFrame(dict(y=y, X=X, W=W))

Step 4: Sample mean difference is not equal to regression coefficient of binary variables for multivariate regressions

As you see here, although the coefficient W feature on Y is negative, the simple sample mean difference calculated as below is positive!

Why? the reason is presence of covariate X in regression. In a linear bivariate regression with a binary feature variable, the sample mean difference is equal to the coefficient regression.

In here, since X and W has some correlation, we need to look at conditional sample mean difference

Step5: Visualize the relation between Y, X and W

plt.figure(figsize=(10,6))
sns.scatterplot(x="X", y="y", hue="W", data=data, s=70)
plt.show()

Step 6.1: Performing regression in two methods
One: the ready made functon ols from statsmodels.formula.api

model1 = smf.ols('y ~ X+W', data=data).fit()
model1.summary().tables[1]

Step 6.2: Method Two: regression using the linear algebraic formula

Lecture for formula which is explained in the video more

The code:

data.loc[:,'W'] = data['W'].astype('int')
Z = data[['X','W']].assign(intercep=1)
y = data["y"]

beta = regress(y,Z)
e = y - Z.dot(beta)

Related Links

Leave a Reply

Your email address will not be published. Required fields are marked *