In Slide #40 with regards to Estimation: I feel it should be sigma_i rather than sigma_x; Currently it's 1/n * ( Sigma_x ( { E[Y | T=1, x ] - E[Y | T=0, x] } )) I feel it should be 1/n * Sigma_i ( { E[Y | T_i=1, X_i ] - E[Y | T_i=0, X_i] } )) which we can re-written as Sigma_x ( P(X=x) * (E[Y|T=1, X=x] - E[Y|T=0, X=x]) )
Reason: Let's there are four sub-groups with following conditional average treatment effect: 1, 0.5, 1.5, 2.5 Let's say P(X=x) = [ 0.5, 0.2, 0.2, 0.1] Let's say there are total 100 subjects with the first equation: ATE will be (1/100) * ( 1 + 0.5 + 1.5 + 2.5 ) = (1/100) * 5.5 = 0.055 with the second equation ATE will be ( 0.5*1 + 0.2*0.5 + 0.2*1.5 + 0.1*2.5 ) = 1.15
Brady, do the casual theory literature say anything about "knowing the presence of confounding variables, but not being able to know or measure what they are". This would hint the domain expert that there's something else that is influencing the decision. Also, in terms of the shoe example, since we know being drunk is contributing to the outcome it wouldn't really be a confounder if we know it, right.
On Final Estimation example: Question 1: By controlling for age, our estimated ATE is matching with actual ATE; but whereas by controlling for both age & 'protein excreted in urine', our estimated ATE is just 0.85; Question 2: What's the causal graph with both age & protein excreted in the urine age blood_pressure } where age is confounding variable Actual ATE: 1.05 & estimated_ate: 1.05 ( Both from the "mean of differences" & from regression coefficient )
I'm not sure I see a question in there haha. It sounds like you are describing the code. Note: some of that code is for Chapter 4, where we acually write down the causal graph, so it might not all make sense without Chapters 3 and 4.
@@BradyNealCausalInference Cool, will wait for chapter 3 & 4 to be covered
2 роки тому
In Unconfoundedness, does the conditioning to X means if we fill the group "went to sleep with shoes" with ALL PEOPLE DRUNK, and fill the group "went sleep without shoes" ALSO WITH DRUNK PEOPLE, is a workaround for to fill both groups with random people, selected by a coin? The negative aspect of this is that some data will be lost because we only care about a subset of the dataset (e.g. DRUNK=1, ignoring all data with DRUNK = 0)?
Hi Brady,thanks for your awesome lecture.But I have a question about the Ignorability and Exchangeablility.In the Causal Inferences: What if ,Randomization refer to the joint independence of potential oucome as full exchangeability. Randomization makes the potential outcome jointly independent of treatment T which implies, but is not implied by exchangeability.So why the Randomization/Ignorability means joint independence rather than marginal distribution?
Whereas in your example of "Golden retriever or other dog" which I guess violating "only one way of getting treatment assumption" your putting consistency assumption
@@Theviswanath57 Not entirely sure i understand you comment, but are you saying this: "SUTVA is satisfied if unit (individual) i's outcome is simply a function of unit i's treatment. Therefore, SUTVA is a combination of consistency and no interference (and also deterministic potential outcomes)." If so, that sounds right to me. That's taken from Section 2.3.5 of the course book (not everything makes it into the lecture)
Thanks for the great lecture again! I learnt a lot and I have a few questions: 1. The fundamental problem in causal inference refers to that for each individual, we only get to observe one potential outcome. Ways to get around this is to make assumptions, therefore convert a causal estimand to a statistical estimand. So far in the course, it seems that we cope with average treatment effects. To estimate individual treatment effects, is it that we need more assumption there? Will we cover that in the course? 2. For positivity assumption, if for some covariates, P(T = 1 | X = x) is very close to 0 or 1. The estimation will be fine if we have access to the full distribution. When it goes to the estimation using the finite samples, it will leads to big variance. So to have good estimate to the treatment effects, we would want P(T = 1 | X = x) not go to the extreme, is this correct? This also reminds me of the bias-variance tradeoff: including more covariates reduce confoundedness (bias), but may lead to estimate with high variance (variance). Does this make sense? 3. This is more of a comment: I think the lectures mentions that including more covariates is better (correct me if I am wrong). I think it may worthwhile to mention, this is not always the case, for example X -> C
1. Awesome question. Makes me think you already know the answer haha ;). To move from ATEs to ITEs, we do need to make stronger assumptions. The stronger assumptions we need to make have to do with the specific functional form and noise distribution (in addition to the causal graph). This corresponds to moving from Level 2 to Level 3 of Pearl's ladder. We will see this later in the course when we get to counterfactuals.
2. You are exactly right on both counts. When we get to estimation in week 5, we will actually see that people sometimes just drop specific examples where P(T = 1 | X = x) is too close to 0 or 1. Your bit about the bias-variance tradeoff is also right (usually).
3. Right again. I mention this in sidenote 8 of Chapter 2 in the book (www.bradyneal.com/Introduction_to_Causal_Inference-Sep1_2020-Neal.pdf). I think I meant to use weak language in the lecture (e.g. "there is a general perception that this is the case"). If I used strong language (e.g. "this is the case"), would you mind linking me to it, as I should probably correct that with an annotation.
4. I do everything with PowerPoint and TikZ (since I use TikZ for the book, might as well just reuse those figures in the slides). I sometimes use Inkscape when I need more flexibility than both of those can easily provide.
@@BradyNealCausalInference Thanks for the detailed explanation! For 3, it could be just my perceptual bias :) You did mention this is not the general case. But just for the reference, 34:32 "for unconfoundedness, the general idea (which is not always true) is that the more covariates you condition on, the more likely you are to have satisfied unconfoundedness." For 4, may i know how do you integrate the latex with powerpoint?
How is the two groups (shoe sleepers and non-shoe sleepers) not being comparable considered a separate reason for association not being causation? Isn't it indirectly a confounder as well?
my understanding, because the condition is T=1, so Y(T) = Y(1) = Y(all) = Y. My own way of explaining this. If T could be 1 or 0, it can't be simplified like this.
Y(0) corresponds to "take a random person in the whole population and force them to take treatment 0." Y | T = 0 corresponds "take a random person from the subpopulation that happened to take treatment 0." Some of the comments in the threads on this video might also be helpful: ua-cam.com/video/eg-bFhNKbnY/v-deo.html
@@BradyNealCausalInference This is a very helpful formulation, that I recommend to be included in the course (unless it's already there and I missed it)
Hi Brady! Thank you so much for those lovely pedagogical videos! There is something I am struggling to wrap my head around though and I was wondering if somebody (you or some other kind soul) could help me with here. You presented ignorability as resulting from an assumption of independence between the causal variables Y(1) and Y(0), leading to E[Y(1)|T=0] = E[Y(1)|T=1]. Isn't this independence meaning that basically the treatment has no causal effect on Y? Instead of removing the arrow from X to T, aren't we removing all arrows leading to T? If I try to explain in other words my confusion: if the expectation of the outcome Y(1) does not change whether we give T or not, doesn't it mean that T is not causal for Y? I am obviously having a logic flaw here somewhere so I would be glad if someone could help me seeing it :)
Hey, you say that the approach at the end, where you train a regression of the form y=at + bx only works because the treatment effect is the same for all individuals (ATE=CATE). I don't think this is correct. In fact, the paper which introduced the Double Machine Learning approach starts off by showing that for the case of y = at + g(x), standard approaches which predict y well will give biased estimators for a (although granted, the Double Machine Learning approach really starts to shine when y=f(x)t + g(x)). Do you have any intuition on why the linear regression approach works so well here? Is it because the outcome variable depends linearly on both the treatment and the feature? Will it always work well in such cases? My intuition says no, that confoundedness can still mess you up. Maybe it's just a quirk of this exact dataset?
Hello Brady, thank you for the awesome video :) I come over here to get an intuitive understanding on causality. I have a question on lecture slide 14. If the group from T=1 and T=0 are comparable, shouldn't it be drunk on the right if it is sober on the left. Based on my understanding, let's say I am the topmost guy in both groups (T=1, T=0). How can I be included in a group 'go to sleep with shoes on' and the other group 'without shoes on' under the same condition 'drunk'? Please correct me if I am wrong. Thanks!
Hi Brady, in page 18, I understand your point here, but I have a question about the definition of E[Y(1)|T=0]. If we observe T=0, then what the meaning of Y(1) here?
Y(1) given that you observe T = 0 is the outcome you would have observed if you had taken T = 1. It isn't something that we can observe (usually)! I think I give the intuition for this on the potential outcomes intuition slide.
@@BradyNealCausalInference So observation T=0 is independent of the do-operation Y(1), then we also can get E[Y(1)] - E[Y(0)] = E[Y(1)|T=0] - E[Y(0)|T=1] , right? But we cannot use consistency law here, therefore, in ICI, Eq.(2.3), it's E[Y(1)] - E[Y(0)] = E[Y(1)|T=1] - E[Y(0)|T=0]. Is my understanding correct?
Hi Brady, thanks for great lectures! I read the book of why by Judea Pearl. Any difference between potential outcomes framework and counterfactual calculation in the Peral's book ? I saw some comments in the book that Judea thought missing value interpretation was wrong. What methodology do you recommend in practical applications ? or they are just the same ?
I think the two languages share a lot more than a lot of people seem to think. To me, they are simply different notations and different ways to formulate the assumptions. You should be able to understand both, so I include them both in the first month of the course. I use both, depending on the setting or who I'm talking to.
Hi Brady, Thanks for this lecture. It is super great. I have one question about the fourth assumption for identification, i.e. consistency. To illustrate the concept, you mentioned an example with two different types of dogs as multiple versions of the treatment. I am wondering, is it really a problem? I guess one can always define a specific version of treatments as the T, right? Thank you!
In the consistency example I got the point that we can't have multiple treatments (like different type of drugs as a treatment). But does it has to have the same outcome always? I mean, is it possible having a case where I take a pill one day and I get better but I take a pill another day and the headache does not get better?
Not following consistency is like adding more nodes to the causal graph. For example, the dog type in the given example, along with whether the person got a dog. Similarly, if the pill's effect is different each day, a day node needs to be added to the causal graph.
Not quite. That simple regression and taking the coefficient from the regression is actually what I describe for slide *41*. And in your comment, *beta* hat is actually the ATE estimate (5.33), not alpha hat. In the notation I use in slide 41 (different from yours), it is alpha hat that is 5.33.
Thanks for the grate lecture again. I have a few questions about the text book In page 8. "A natural quantity that comes to mind is the associational difference: ~~~~Then, maybe E[Y(1)]-E[Y(0)] equals E[Y|T=1]-E[Y|T=0]." From these sentences, I got little confused What "maybe ~ equals" means.....
In addition, I have a question about the description section of "Consistency" on page 14. I understand Y(t) intuitively, but I don't understand "whereas Y(T) is the potential outcome for the actual value of treatment that we observe" intuitively. Do you have an example?
Basically, it's just like a train of thought that is common to go down. "maybe E[Y(1)]-E[Y(0)] equals E[Y|T=1]-E[Y|T=0]" is the more formal way of writing "maybe causation equals association (correlation equals causation)." Of course, this thinking is often incorrect :)
@@chadpark9248 For a given individual, they will observe a specific value, say t', for the random variable T. That means that they will observe the potential outcome Y(t'). So the realized value, t', of T gets connected to the observed outcome Y in that way (assuming consistency). Similarly, Y(T) corresponds to the potential outcome that we observe when we know the realized value of the treatment random variable T. It is distinct from Y(1), Y(0), or Y(t) which is meant to denote a specific potential outcome, that isn't related to the random variable T at all (even though, we use the same letter, but in lower case, for Y(t)).
Yes, but T=0 is *conditioning* on T=0, not doing T=0. So condition on T=0 means "look at the people how happened to not take the treatment." Then for those people, Y(1) means "what would have happened had they taken the treatment?"
Yes! You'd use the same estimator that is used in slide 40, but with a nonlinear model instead of linear regression. You can also use any of the other estimators that we discuss in week 6 of the course.
@@BradyNealCausalInference Thanks will definitely check the week 6 course. I asked as if there is non-linearity with respect to T, then Y_hat = alpha * T + alpha' * T^2 + alpha'' * T^3.... + beta_X. Then which coefficient would give us the causal effect of T on Y.
@@Theviswanath57 In slide 40, it is that equation that you write, assuming that you meant "E[Y | T=1, X=x] - E[Y | T=0, X=x] " when you wrote "E[ Y/T=1, X=x] - P(Y/T=0, X=x)." However, in slide 41, we use a completely different way to estimate the ATE: linear regression and then using the coefficient of the regression. In general is not equal to the correct equation from slide 40. It is only equal when E[Y | T=1, X=x] - E[Y | T=0, X=x] is the same for all x (i.e. the treatment effect is the same for all individuals). I don't actually include the specific equation for the estimate in slide 41, but you can get it using the closed-form solution to linear regression. You can see the exact code that I used for this in Section 2.5 of the course book.
Understood on "It is only equal when E[Y | T=1, X=x] - E[Y | T=0, X=x] is the same for all x (i.e. the treatment effect is the same for all individuals)."
Amazing video. One question. The example at the end of the lecture seems like a simple linear regression. Does it mean that when we run linear regression, we are doing causal inference? What is the difference between regression and causal inference, here?
Thanks for the lecture! I have a question around ua-cam.com/video/5x_pPemAVxs/v-deo.html: Is E[Y(1) - Y(0)] (here the individual subscript i is implicit) properly defined since some data are missing?
A masterpiece of clarity.
Nice death star easter egg!
wonderful lecture bravo!
In Slide #40 with regards to Estimation: I feel it should be sigma_i rather than sigma_x;
Currently it's 1/n * ( Sigma_x ( { E[Y | T=1, x ] - E[Y | T=0, x] } ))
I feel it should be 1/n * Sigma_i ( { E[Y | T_i=1, X_i ] - E[Y | T_i=0, X_i] } )) which we can re-written as Sigma_x ( P(X=x) * (E[Y|T=1, X=x] - E[Y|T=0, X=x]) )
You are absolutely right. Unfortunatley, some typos might stay in the videos, even if they have been fixed in the book.
Reason:
Let's there are four sub-groups with following conditional average treatment effect: 1, 0.5, 1.5, 2.5
Let's say P(X=x) = [ 0.5, 0.2, 0.2, 0.1]
Let's say there are total 100 subjects
with the first equation: ATE will be (1/100) * ( 1 + 0.5 + 1.5 + 2.5 ) = (1/100) * 5.5 = 0.055
with the second equation ATE will be ( 0.5*1 + 0.2*0.5 + 0.2*1.5 + 0.1*2.5 ) = 1.15
Brady, do the casual theory literature say anything about "knowing the presence of confounding variables, but not being able to know or measure what they are". This would hint the domain expert that there's something else that is influencing the decision.
Also, in terms of the shoe example, since we know being drunk is contributing to the outcome it wouldn't really be a confounder if we know
it, right.
On Final Estimation example:
Question 1: By controlling for age, our estimated ATE is matching with actual ATE; but whereas by controlling for both age & 'protein excreted in urine', our estimated ATE is just 0.85;
Question 2: What's the causal graph with both age & protein excreted in the urine
age blood_pressure } where age is confounding variable
Actual ATE: 1.05 & estimated_ate: 1.05 ( Both from the "mean of differences" & from regression coefficient )
I'm not sure I see a question in there haha. It sounds like you are describing the code. Note: some of that code is for Chapter 4, where we acually write down the causal graph, so it might not all make sense without Chapters 3 and 4.
@@BradyNealCausalInference Cool, will wait for chapter 3 & 4 to be covered
In Unconfoundedness, does the conditioning to X means if we fill the group "went to sleep with shoes" with ALL PEOPLE DRUNK, and fill the group "went sleep without shoes" ALSO WITH DRUNK PEOPLE, is a workaround for to fill both groups with random people, selected by a coin? The negative aspect of this is that some data will be lost because we only care about a subset of the dataset (e.g. DRUNK=1, ignoring all data with DRUNK = 0)?
Hi Brady,thanks for your awesome lecture.But I have a question about the Ignorability and Exchangeablility.In the Causal Inferences: What if ,Randomization
refer to the joint independence of potential oucome as full exchangeability. Randomization makes the potential outcome jointly independent of treatment T which implies,
but is not implied by exchangeability.So why the Randomization/Ignorability means joint independence rather than marginal distribution?
@Brady: In Jason A. Roy's Coursera course both no-interference assumption & only one way of getting treatment is clubbed under SUTVA;
Whereas in your example of "Golden retriever or other dog" which I guess violating "only one way of getting treatment assumption" your putting consistency assumption
@@Theviswanath57 Not entirely sure i understand you comment, but are you saying this:
"SUTVA is satisfied if unit (individual) i's outcome is simply a function of unit i's treatment. Therefore, SUTVA is a combination of consistency and no interference (and also deterministic potential outcomes)."
If so, that sounds right to me. That's taken from Section 2.3.5 of the course book (not everything makes it into the lecture)
@@BradyNealCausalInference make sense, thanks
Thanks for the great lecture again! I learnt a lot and I have a few questions:
1. The fundamental problem in causal inference refers to that for each individual, we only get to observe one potential outcome. Ways to get around this is to make assumptions, therefore convert a causal estimand to a statistical estimand. So far in the course, it seems that we cope with average treatment effects. To estimate individual treatment effects, is it that we need more assumption there? Will we cover that in the course?
2. For positivity assumption, if for some covariates, P(T = 1 | X = x) is very close to 0 or 1. The estimation will be fine if we have access to the full distribution. When it goes to the estimation using the finite samples, it will leads to big variance. So to have good estimate to the treatment effects, we would want P(T = 1 | X = x) not go to the extreme, is this correct? This also reminds me of the bias-variance tradeoff: including more covariates reduce confoundedness (bias), but may lead to estimate with high variance (variance). Does this make sense?
3. This is more of a comment: I think the lectures mentions that including more covariates is better (correct me if I am wrong). I think it may worthwhile to mention, this is not always the case, for example X -> C
1. Awesome question. Makes me think you already know the answer haha ;). To move from ATEs to ITEs, we do need to make stronger assumptions. The stronger assumptions we need to make have to do with the specific functional form and noise distribution (in addition to the causal graph). This corresponds to moving from Level 2 to Level 3 of Pearl's ladder. We will see this later in the course when we get to counterfactuals.
2. You are exactly right on both counts. When we get to estimation in week 5, we will actually see that people sometimes just drop specific examples where P(T = 1 | X = x) is too close to 0 or 1. Your bit about the bias-variance tradeoff is also right (usually).
3. Right again. I mention this in sidenote 8 of Chapter 2 in the book (www.bradyneal.com/Introduction_to_Causal_Inference-Sep1_2020-Neal.pdf). I think I meant to use weak language in the lecture (e.g. "there is a general perception that this is the case"). If I used strong language (e.g. "this is the case"), would you mind linking me to it, as I should probably correct that with an annotation.
4. I do everything with PowerPoint and TikZ (since I use TikZ for the book, might as well just reuse those figures in the slides). I sometimes use Inkscape when I need more flexibility than both of those can easily provide.
@@BradyNealCausalInference Thanks for the detailed explanation! For 3, it could be just my perceptual bias :) You did mention this is not the general case. But just for the reference, 34:32 "for unconfoundedness, the general idea (which is not always true) is that the more covariates you condition on, the more likely you are to have satisfied unconfoundedness." For 4, may i know how do you integrate the latex with powerpoint?
How is the two groups (shoe sleepers and non-shoe sleepers) not being comparable considered a separate reason for association not being causation? Isn't it indirectly a confounder as well?
Hey Brady, Thanks for the great course!! In slide 17: Why does E[Y(1)|T=1] becomes E[Y|T=1]? and same for E[Y(0)|T=0] = E[Y|T=0]?
my understanding, because the condition is T=1, so Y(T) = Y(1) = Y(all) = Y. My own way of explaining this. If T could be 1 or 0, it can't be simplified like this.
It's after applying the consistency assumption because we are guaranteed that for T=t, we will get Y(t), so Y | T = t is sufficient.
Hello Brady. I have a silly doubt, what is the difference between Y(0) and Y | T= 0 ?
Y(0) corresponds to "take a random person in the whole population and force them to take treatment 0." Y | T = 0 corresponds "take a random person from the subpopulation that happened to take treatment 0." Some of the comments in the threads on this video might also be helpful: ua-cam.com/video/eg-bFhNKbnY/v-deo.html
@@BradyNealCausalInference This is a very helpful formulation, that I recommend to be included in the course (unless it's already there and I missed it)
Hi Brady! Thank you so much for those lovely pedagogical videos! There is something I am struggling to wrap my head around though and I was wondering if somebody (you or some other kind soul) could help me with here. You presented ignorability as resulting from an assumption of independence between the causal variables Y(1) and Y(0), leading to E[Y(1)|T=0] = E[Y(1)|T=1]. Isn't this independence meaning that basically the treatment has no causal effect on Y? Instead of removing the arrow from X to T, aren't we removing all arrows leading to T?
If I try to explain in other words my confusion: if the expectation of the outcome Y(1) does not change whether we give T or not, doesn't it mean that T is not causal for Y? I am obviously having a logic flaw here somewhere so I would be glad if someone could help me seeing it :)
I think I am confusing Y(1) with Y=1 here, while in fact it is Y|do(T=1). Some getting used to...
Hey, you say that the approach at the end, where you train a regression of the form y=at + bx only works because the treatment effect is the same for all individuals (ATE=CATE). I don't think this is correct. In fact, the paper which introduced the Double Machine Learning approach starts off by showing that for the case of y = at + g(x), standard approaches which predict y well will give biased estimators for a (although granted, the Double Machine Learning approach really starts to shine when y=f(x)t + g(x)). Do you have any intuition on why the linear regression approach works so well here? Is it because the outcome variable depends linearly on both the treatment and the feature? Will it always work well in such cases? My intuition says no, that confoundedness can still mess you up. Maybe it's just a quirk of this exact dataset?
Hello Brady, thank you for the awesome video :) I come over here to get an intuitive understanding on causality. I have a question on lecture slide 14. If the group from T=1 and T=0 are comparable, shouldn't it be drunk on the right if it is sober on the left. Based on my understanding, let's say I am the topmost guy in both groups (T=1, T=0). How can I be included in a group 'go to sleep with shoes on' and the other group 'without shoes on' under the same condition 'drunk'? Please correct me if I am wrong. Thanks!
The same person cannot be included in both groups. just the number of people in both groups is almost same, due to randomization.
I SAW THE DEATH STAR!
Hi Brady, in page 18, I understand your point here, but I have a question about the definition of E[Y(1)|T=0]. If we observe T=0, then what the meaning of Y(1) here?
Y(1) given that you observe T = 0 is the outcome you would have observed if you had taken T = 1. It isn't something that we can observe (usually)! I think I give the intuition for this on the potential outcomes intuition slide.
@@BradyNealCausalInference So observation T=0 is independent of the do-operation Y(1), then we also can get E[Y(1)] - E[Y(0)] = E[Y(1)|T=0] - E[Y(0)|T=1] , right? But we cannot use consistency law here, therefore, in ICI, Eq.(2.3), it's E[Y(1)] - E[Y(0)] = E[Y(1)|T=1] - E[Y(0)|T=0]. Is my understanding correct?
Hi Brady, thanks for great lectures! I read the book of why by Judea Pearl. Any difference between potential outcomes framework and counterfactual calculation in the Peral's book ? I saw some comments in the book that Judea thought missing value interpretation was wrong. What methodology do you recommend in practical applications ? or they are just the same ?
I think the two languages share a lot more than a lot of people seem to think. To me, they are simply different notations and different ways to formulate the assumptions. You should be able to understand both, so I include them both in the first month of the course. I use both, depending on the setting or who I'm talking to.
Hi Brady, Thanks for this lecture. It is super great. I have one question about the fourth assumption for identification, i.e. consistency. To illustrate the concept, you mentioned an example with two different types of dogs as multiple versions of the treatment. I am wondering, is it really a problem? I guess one can always define a specific version of treatments as the T, right? Thank you!
Yes, that just means being sufficiently specific about how you define the treatment.
Independently & Identically distributed = Ignorability / Exchangability.
Agree ?
Great course!
Any particular books or review papers that you could recommend to read in more detail?
Have you found any?
In the consistency example I got the point that we can't have multiple treatments (like different type of drugs as a treatment). But does it has to have the same outcome always? I mean, is it possible having a case where I take a pill one day and I get better but I take a pill another day and the headache does not get better?
Not following consistency is like adding more nodes to the causal graph. For example, the dog type in the given example, along with whether the person got a dog. Similarly, if the pill's effect is different each day, a day node needs to be added to the causal graph.
31:13 that split of a second when you see the Death Star 😂
Was looking for this comment 😂
Slide #40: Naive estimate might have been estimated through following regression equation: Y_i = alpha + Beta * T_i;
alpha_hat is 5.33 ?
Not quite. That simple regression and taking the coefficient from the regression is actually what I describe for slide *41*. And in your comment, *beta* hat is actually the ATE estimate (5.33), not alpha hat. In the notation I use in slide 41 (different from yours), it is alpha hat that is 5.33.
@@BradyNealCausalInference yeah that's right, little confused; thanks
Where can I get the data
@@Theviswanath57 See the GitHub link in Section 2.5 of the book for the data generation and estimation code.
Thanks for the grate lecture again. I have a few questions about the text book
In page 8. "A natural quantity that comes to mind is the associational difference: ~~~~Then, maybe E[Y(1)]-E[Y(0)] equals E[Y|T=1]-E[Y|T=0]."
From these sentences, I got little confused What "maybe ~ equals" means.....
In addition, I have a question about the description section of "Consistency" on page 14. I understand Y(t) intuitively, but I don't understand "whereas Y(T) is the potential outcome for the actual value of treatment that we observe" intuitively. Do you have an example?
Basically, it's just like a train of thought that is common to go down. "maybe E[Y(1)]-E[Y(0)] equals E[Y|T=1]-E[Y|T=0]" is the more formal way of writing "maybe causation equals association (correlation equals causation)." Of course, this thinking is often incorrect :)
@@chadpark9248 For a given individual, they will observe a specific value, say t', for the random variable T. That means that they will observe the potential outcome Y(t'). So the realized value, t', of T gets connected to the observed outcome Y in that way (assuming consistency). Similarly, Y(T) corresponds to the potential outcome that we observe when we know the realized value of the treatment random variable T. It is distinct from Y(1), Y(0), or Y(t) which is meant to denote a specific potential outcome, that isn't related to the random variable T at all (even though, we use the same letter, but in lower case, for Y(t)).
@@BradyNealCausalInference Thank you for your detailed explanation.
Great lecture, but starting at 20:02 I become lost: How is E[Y(1) | T=0] not a contradiction? If you do(T=1) then doesn’t that force T=1?
Yes, but T=0 is *conditioning* on T=0, not doing T=0. So condition on T=0 means "look at the people how happened to not take the treatment." Then for those people, Y(1) means "what would have happened had they taken the treatment?"
@@BradyNealCausalInference Thanks so much for taking the time to respond! This clarification helped me be able to move forward.
@@scotth.hawley1560 Glad to hear it! Thanks for bearing with me on the slow response time haha.
Can we have a non-linear cause and effect relationship? In that case, how do we estimate the exact effect ?
Yes! You'd use the same estimator that is used in slide 40, but with a nonlinear model instead of linear regression. You can also use any of the other estimators that we discuss in week 6 of the course.
@@BradyNealCausalInference Thanks will definitely check the week 6 course. I asked as if there is non-linearity with respect to T, then Y_hat = alpha * T + alpha' * T^2 + alpha'' * T^3.... + beta_X. Then which coefficient would give us the causal effect of T on Y.
@Brady: In Slide #41, I am wondering estimation should be sigma_x ( P(X=x) * ( E[ Y/T=1, X=x] - E(Y/T=0, X=x) )
In your variant essentially we are saying that P(X=x) is same for all x; please correct me if I am wrong;
@@Theviswanath57 In slide 40, it is that equation that you write, assuming that you meant "E[Y | T=1, X=x] - E[Y | T=0, X=x] " when you wrote "E[ Y/T=1, X=x] - P(Y/T=0, X=x)." However, in slide 41, we use a completely different way to estimate the ATE: linear regression and then using the coefficient of the regression. In general is not equal to the correct equation from slide 40. It is only equal when E[Y | T=1, X=x] - E[Y | T=0, X=x] is the same for all x (i.e. the treatment effect is the same for all individuals). I don't actually include the specific equation for the estimate in slide 41, but you can get it using the closed-form solution to linear regression. You can see the exact code that I used for this in Section 2.5 of the course book.
@@BradyNealCausalInference regarding P(Y/T=0, X=x), yes I mean E(Y/T=0, X=x).
Understood on "It is only equal when E[Y | T=1, X=x] - E[Y | T=0, X=x] is the same for all x (i.e. the treatment effect is the same for all individuals)."
@brady if we have P(X=x) as part of the equation ATE is unbiased estimate even if "E[Y | T=1, X=x] - E[Y | T=0, X=x] is not same for all x "?
Reporting a mistake: Around 5:03 Brady said T=0 for taking the pill. It shld be T=1.
Is there a textbook or course website
Website: causalcourse.com
Book: www.bradyneal.com/causal-inference-course#course-textbook
Amazing video. One question. The example at the end of the lecture seems like a simple linear regression. Does it mean that when we run linear regression, we are doing causal inference? What is the difference between regression and causal inference, here?
Thanks for the lecture! I have a question around ua-cam.com/video/5x_pPemAVxs/v-deo.html: Is E[Y(1) - Y(0)] (here the individual subscript i is implicit) properly defined since some data are missing?
@mingmingchen7154 As you have pointed out it is a biased estimate. And Brady explains this clearly afterwards.