A list of puns related to "Omitted Variable Bias"
Is there a link between the two? My first two research questions suffer from p < .05 in the Hosmer Lemeshow goodness of fit test. My RQ2 has omitted variable bias due to a higher estimate of the coefficient. (I'm looking at all predictor variables in RQ 1 (three predictors), and only one of those same predictors in RQ 2). There is no omitted variable bias in RQs 3 and 4, and the Hosmer-Lemeshow test came back fine (p > .05)for those two RQs.
I understand that some consider the Hosmer Lemeshow to be obsolete, but I'm being required to use it.
Need help for an assignment. The problem says:
Suppose we have a model y = B0 + B1x1 + B2x2 + B3x3 + u. But, due to the lack of data, or by ignorance, we estimate the following equation:
y = B0 + B1x1
Given that the set of independent variables arenβt correlated, is B1 a biased estimator?
I know it is not, since all independent variables are statistically independent from one another. However, Iβd like to explain it a bit further. Appreciate your help in advance, dear colleagues.
Isn't it true that OVB can cause a regression coefficient be appear statistically significant due to less precise coefficient? If then we consider a model that had OVB, but to which necessary explanatory variables are added in, can the significance level of the aforementioned coefficient change? For example, consider the discussion in Wikipedia. If after determining the significance level of x without z we were to add z back into the model, can the significance of x be now different?
I am working on a project in stata and the ovtest command for OVB is consistently returning f stats in the 12 and above range, leading me to reject the null of no OVB. Even my kitchen sink model rejected the null of no OVB. Am I interpreting the test result wrong, or is it something else like functional misspecification or the model thatβs causing the numbers to get wonky? Also if it helps the model has two dummy interaction variables.
Hi r/askstatistics - I'm in the middle of the stats portion of a data science degree and we are covering regression, particularly omitted variable bias (OVB).
I've spent the past few days researching this topic, however I cannot seem to find any examples that utilize systems with more than a single X variable for the base/"false" model...what if the baseline regression model was say, Y = B_0 + B1X1 + B2X2 + e? I found an old post in this group from 8 years ago which gave a good overview (https://www.reddit.com/r/statistics/comments/p336c/omitted_variable_bias/), but I would like to see some worked examples if possible, if somebody could point me in the right direction.
This brings me onto a bigger question - aside from describing what OVB is and how the general direction of the OV might affect a particular coefficient - is it common practice to quantify exactly the OVB exerted by the omitted variable? Say I had a base model, then found an extra variable to add to it which altered the coefficients, reduced their SE's and decreased their p-values, while boosting r^2 and adjusted r^2...would highlighting this effect and the resultant directional change in the coefficient be sufficient to prove OVB, or is there a more meaty calculation/proof I could give?
Thanks in advance for your help!
Haven't posted in a good long while, so let's get down to it.
Because I'm a filthy SJW, I sometimes lurk /r/socialjustice101. Today, I saw this post, which draws heavily from this podcast episode. The post basically makes claims that ethnic diversity reduces group performance, productivity, and has other undesirable features. We'll talk through some simple econometrics first, then I'll address why making the claim that ethnic diversity is causally-related to the aforementioned outcomes MUST be made with care, and that in the case of the podcast/blog and the post in general, it usually isn't. Indeed, many of the papers cited don't even make that claim (or even the claims attributed to them).
To clarify, I'm not making the claim that diversity has positive economics effects, merely that the negative effects mentioned in the OP are misstated, overstated, or in some cases, are inherently unknowable in their current specification. Let's get started.
Spurious correlations are all around us. Ice cream sales are correlated with drownings. Obesity rates correlate with the housing bubble. We've all heard the maxim correlation doesn't imply causation, but it can be a hard thing to shake ourselves out of, especially when faced with messy causal inference we find intuitive. For instance, it may be intuitive for some to observe that increasing the aggregate number of differences in a population would make that population less productive. What goes unanswered here is the question: "why might ethnic differences make a country less productive?" No doubt you could make some broad appeal that ethnic homogeneity creates a sense of community germane to greater cooperation, and indeed that is the claim, but usually made in the reverse: that groups in diverse societies have low trust/sympathy for other groups dissimilar from themselves.
The subtle point here is that racial mistrust or animus is the channel through which productivity falls, not racial diversity. The thing to consider here is the unobserved counterfactual: that a country with high ethnic diversity, but no racism, has an unknown (though possibly nonnegative) relationship between productivity and diversity. This might seem like splitting hairs, but it's actually quite important in contending w
... keep reading on reddit β‘I will try to explain omitted variable bias because it was a concept that eluded me for a long time. I thought it was a problem with the OLS estimator. It turns out that OLS always gives us the population regression function, but sometimes PRF is not what we're looking for.
Imagine that we're trying to explain the relationship between income (y) and schooling (s). We set up a model
y=Ξ±+Ξ²s+Ο΅
where the stochastic error term Ο΅ captures the fact that there exists no deterministic relationship between schooling and income. The first thing to note is that the relationship above is a tautology. For every Ξ± and Ξ² I can construct an Ο΅ such that Ο΅=y-(Ξ±+Ξ²s). In order for the coefficients to be uniquely determined we need to place restrictions on the error term.
We can say that cov(s,Ο΅)=0. Note that this is true by construction and is not an assumption. The error term owes its life to the coefficients (Ξ±,Ξ²) and we've decided to define their value such that cov(s,Ο΅)=0 is true. We can now consistently estimate their values by OLS and the PRF is Ξ±+Ξ²s.
These coefficients are not very interesting by themselves. If we make the assumption that E[Ο΅|s] = 0 then the PRF is also CEF. In other words, if the CEF happen to be (approximately) linear then we can interpret Ξ² as the expected change in income if schooling increase by one unit. This number is interesting, but it has no causal interpretation. People with more schooling also tend to have higher ability. The coefficient Ξ² capture both the direct causal effect of schooling and the indirect effect of ability. In order to find a coefficient with causal interpretation we need to somehow isolate the effect of schooling. To make this more concrete, imagine that there exists a deterministic relationship in reality
y = Ξ±+Ξ΄s+Ξ³a+x'Ξ²
where a is ability and x consist of all other variables that affect income. The parameters in this function has a causal interpretation. If schooling increase by one unit then income increase by Ξ΄. This is the coefficient we're looking for.
We do not observe all the variables that affect income so we define a random variable which capture their aggregate contribution on earnings
u = Ξ³a+x'Ξ²
so that the relationship can be stated as
y = Ξ±+Ξ΄s+u
Note that the error term now has an economic interpretation. If we say that cov(s,u)=0 then this is an assumption. Since we already said that schooling is correlated with ability and ability affects income, it's not a very credible assumption. Imagine now
... keep reading on reddit β‘I've been seeing a lot of whining about the pump shotgun needing to be nerfed but I think people are forgetting a crucial aspect of why many feel the pump shotgun is overpowered.
It's pretty clear that it takes a certain amount of skill to use a pump shotgun effectively which is why most inexperienced players run the tactical shotgun to start out. This creates a weird dynamic where being a good player is associated with running a pump over a tactical shotgun.
What basically ends up happening is that people who chose to run tactical shotguns because it is easier to use end up losing more fights to players running pump shotguns. They then jump to the conclusion that the pump shotgun is overpowered without considering that it might not be the pump shotgun but the players who run pumps being better than those who run tac. This is very clear omitted variable bias because it assumes the pump shotgun, rather than the skill of the player, is what won the fight.
The pump allows you take more of an advantage from building and positioning which is why they are better in the hands of good players. There is no reason to nerf this so that worse players can beat better players more often in shotgun fights. Nerfing the pump to go toe to toe with the tac just lowers the skill ceiling overall which is bad for the game.
TLDR: Pumps are associated with being better so pump players kill tac players more often, creating the misconception that the pump is massively overpowered.
I cross-posted this question to Cross Validated if you'd prefer answering it there.
Say I'm using multiple logistic regression to help caterers in a large city predict the probability invited adults will come to a wedding. Say I have a proprietary dataset of likely relevant predictor variables for each invited guest's traits, like age
, gender
, marital status
, how far away they live from the event site
, and whether the event is on a Saturday
, and I have guests' attendance history to past weddings (1 = they attended; 0 = they didn't
).
Say the results show all of the coefficients are significant (age
: younger people are more likely to attend; gender
: women are more likely to attend; marital status
: single people are more likely to attend; how far away they live
: the closer you live, the more likely you'll attend; Saturday
: people are more likely to attend Saturday weddings).
Say I show those results to the caterers and teach them how to calculate the predicted probability of attendance given an invited guest's traits. However, the caterers unanimously proclaim they don't have access to all those data. They don't know the age
, gender
, or marital status
of the guests but they do know how far away they live
and whether the wedding is on a Saturday
.
The caterers ask me if I can re-run the model using only those variables they have information on, so that they can feasibly employ the predicted probability calculations with their own data. What are the implications of this? Is there a better strategy?
Usually, omitted-variable bias is a concern because leaving out relevant variables "results in the model attributing the effect of the missing variables to the estimated effects of the included variables". But is that a bad thing here?
For instance, I assume some of the omitted variables correlate, like marital status
and how fare they live away
, with married folk being more likely to live in the suburbs, farther away from the event locations in the downtown area of the city. Would the caterers' ability to control for how far they live away
capture the effects of marital status
too, then, so that all in all the reduced model will still be effective at predicting attendance?
So I feel like Iβve got the intuition of omitted variable bias down - excluding relevant variables leads to the over/underestimation of coefficients in your regression model as the estimated coefficients are βdoing the workβ of the excluded relevant variables.
However, Iβm struggling to understand the interpretation of the bias terms.
So suppose we have the true model is (in matrix form):
The true model:
y = XΞ² + ΞΞ΄ + Ξ΅ (1)
And the misspecified model which we estimate
y = XΞ² + Ξ½ (2)
The usual OLS estimate for Ξ² is (XβX)^-1 Xβy
Plugging in the true model and taking expectation gives
E(y) = Ξ² + (XβX)^-1 XβZΞ΄ So hereβs my problem:
I can see that bias is obviously present (the expectation of the estimated parameter isnβt equal to the parameter). However, I canβt interpret the bias term. Put differently, what does
(XβX)^-1 XβZΞ΄ mean
The most common example of the endogeneity loop that I refer to is the relationship between a city's number of police offers and its crime rate. Counterintuitively, cities with more police officers have more crime. This is because while the number of police has an effect on the crime rate, the crime rate also impacts the number of police.
Endogeneity occurs, in my understanding, when a regression's error term is correlated with an explanatory variable. When regressing the number of police on the crime rate, how would endogeneity occur? Is it the lack of lagged values of the crime rate (OVB)?
Hi, does anyone have a good source on omitted variable bias in regressions in more complex systems than 2 variables?
I am very familiar with the basic y = b1x1 + b2x2 --> if we run y = b1x1, then estimated b1 = b1 + b2bx1x2. I am looking for maybe articles/textbook that deals with specifications with more variables and consequences of correlations in larger systems when multiple variables are omitted and multiple included etc. If anyone has any references/ideas that would be great! cheers!
Hi guys!
First post, hopefully this won't be too disappointing/basic a question.
I am estimating a regression, let's say of the form
Y=B0+B1(X1)+B2(X1*X2)+u
1st question: If B2 (the coefficient on X1*X2) is significant, does this imply a correlation between X1 and X2 in determining Y (i.e. should we also include X2 on its own in the above regression) ?
2nd question: We now have another regression of the form
Y=A0+A1(X1)+u where X1 is the same X1 as above
If B2 is insignificant, but the inclusion of X1*X2 makes the coefficient on X1 (B1) significant (i.e. A1 was insignificant) does this also imply a correlation between X1 and X2 or X1*X2 in determining Y ?
Sorry if this is confusing...
Hello,
I am trying to show that the estimator for some variable X_1 (beta) is biased when I leave out a variable (X_3) from the true population model.
True population model: Y=X_1' \beta_1 + X_2' \beta_2 + X_3' \beta_3 +e
Model used: Y = X_1' \delta_1 + X_2' \delta_2 + u
My work (for notation let X1=X_1):
\delta_1= E[X1X1'|X2]^{-1} E[X1Y|X2]
replacing Y by the true population model:
\delta_1 = \beta_1 + X2' \beta_2 E[X1X1'|X2]^{-1} E[X1|X2] + \beta_3 E[X1X1'|X2]^{-1} E[X1X3'|X2] + 0
I am assuming that since I am controlling for X2 in the model, the term X2' \beta_2 *E[X1X1'|X2] *E[X1|X2] should be equal to 0. However I am not sure if this is the correct assumption to make, since it would imply that E[X1|X2]=0 or beta_2=0.
I believe that delta_1 will be the affected by the conditional covariance of X1,X3~E[X1X3'|X2] and \beta_3, which represents the conditional covariance of X3,Y given X2.
How do I demonstrate that \delta_1 is biased?
Thank you!
Hello!
I'm using panel data including the following variables,(y variable: abnormal return, cumulative abnormal return)(x variable:GovScore log_BM log_Size CAR AR EnvirScore ESGScore Size Lev SocialScore)These variable are all focused on a short event window for like a month, hence, especially for the score variable, it maintains the same, hence, that means it is time-invariant?
My goal was to carry out regressions and do the Hausman test to see if I should use fixed effect or random effect. And then hopefully to move on to find the relation between return and ESG scores.
But when I did the code:
This happened
I've read across in the thread of using a 'hybrid model' but few questions seems to encounter problems at the early stage of just doing the xtreg coding, hence, does it mean there is fundamentally something wrong with my dataset (my dataset is strongly balanced) and it should be better applied as an OLS model? But this would be very wired since I am carrying out an event study.
https://preview.redd.it/am3s9lyeigy61.png?width=730&format=png&auto=webp&s=0ab2ef0035f7429746c052f985920bc5ccd468a7
And when I did normal regressions, this was the result, does that mean the result is insignificant no matter if it was panel data or not?
https://preview.redd.it/xpkmstocpey61.png?width=721&format=png&auto=webp&s=ad73778de5b6531b21a8033305ea9ed9bdd84eaa
This relates to a previous question posted on the server:https://www.statalist.org/forums/forum/general-stata-discussion/general/1608634-princeton-s-event-study-did-not-generate-the-predicted-return
original post:
Thank you very much,βββββββEmma
Hi, I just wanna know like this title.
If we wrote like this,
my $omitted;
my @omitted;
What is omitted as default value?
I think undef
, ()
or both deserve it. but can't determine.
I have already researched with using my knowledge, but I wanna know it is really correct.
my test code is here.
Knowing that a change in the variable X (student/teacher ratio in a class) causes an inverse change in the variable Y (average SAT test points in that class), and knowing that X is also negatively correlated with the omitted variable Z (% of native english speakers in that class), the estimator betaCap overestimates or underestimates the real beta which is about the correlation between X and Y with Z constant?
Edit: betaCap counts the effect of Z as if It were part of the effect of X on Y, it doesn't "know" to be distorted
My teacher says underestimates but this answear doesn't make any sense to me
Hello friendly econometrics gurus,
I am looking for some help in solving an exercise.
I have the following task:
Prove that this expectation holds:
https://preview.redd.it/8l8s7g018mz51.png?width=342&format=png&auto=webp&s=338b0ae26118457aec4351ad9792e9d16d23ef25
I know the following:
https://preview.redd.it/x44gp7a58mz51.png?width=624&format=png&auto=webp&s=b7497d4037057e4cd07aad4843cae0a8741b71ea
https://preview.redd.it/efk349u68mz51.png?width=445&format=png&auto=webp&s=fe1718726fb8378db5c1db99ce35a0372ad4f4cb
https://preview.redd.it/zz0x6ot98mz51.png?width=303&format=png&auto=webp&s=a9c30f9c49d1fe0b1b589ca1f262b974c19cde5c
https://preview.redd.it/cnzoarfc8mz51.png?width=462&format=png&auto=webp&s=20278e60f3f8cd2c1764c6bd87e126965424ade6
I have attempted this my self, but i am not sure if i have made the right assumptions and followed the correct rules of conditional expectation to achieve my results, or if its just wishfull math :P
https://preview.redd.it/os0ndcip8mz51.jpg?width=2048&format=pjpg&auto=webp&s=6a0ed71566876f7daaf6fe5450689873b0dc0fb1
I appreciate any guidance and please ask if something is unclear.
The above link are 2 runs on my end, with the setups being completely identical thanks to the power of 1 stone. According to Dokkan's information on Chain Battle, the 3-pick event should ONLY be influenced by their synergy (links, categories, type). However, I picked 3 Pikkons as well, who clearly have way way less overall synergy, and somehow got a score a couple thousand above the Vegetas. Now, you could ascribe the 15k score difference to the minimal 00.02 seconds difference, but that would still not explain why Pikkons had the same result to begin with considering Vegeta is supposed to blow Pikkons out of the water based on official information.
I would run T-tests, but I can't because there's just too much uncertainty about the multipliers of each variable, so for now I will just have to leave it at some unknown variable (most likely red glow) that impacts the 3 pick event despite Dokkan's official information.
Hi Reddit, Thank you for helping. Please let me know if this belongs in a different sub.
For work, my team has a hypothesis about a relationship between two variables, and a data set with thousands of outcomes. The outcomes are suggesting the opposite of our hypothesis, but I believe there's a chicken-and-the-egg situation, and I'm wondering if there's a better way to define the problem and expected result, and account for the bias I believe is occurring.
Imagine you have two variables in each record of data; we'll call them X and Y. X represents the number of times a user takes an action for a particular record object. Y represents the total time, start to finish, of that object's existence; let's call it a turn-around time.
Y represents a numerical time value, made up of events A, the beginning, and B, the ending, for each record. Meanwhile, X is a whole number, equalling the sum of independent actions {a, b, c, [...], n} that occur for this object within event Y.
When we think about the real-world scenario, we expect each independent action to increase the likelihood of event B occurring, which, for the business objective, is a good thing. The turnaround time Y should be minimized; so if each action has a chance of causing event B and therefore ending variable Y, we expect that more actions, and by extension the higher the value is of X, would correspond to a lower value of Y. Our hypothesis was that the values of X carry a negative correlation with the values of Y (increased # of actions = lower turnaround time).
To frame it with an example:
However, the data has suggested the opposite; when looking at outcomes, the data set shows that typically more actions correspond to a longer value for Y. When we dive into examples to understand why, it feels that the TAT value Y is almost dependent on X; outcomes that carry a longer turnaround time have more opportunities for actions to occur. Also, when comparing examples in the outcomes data, the time spent between actions taken was not perfectly consistent, but fairly similar; John and Brock would more or less both be completing 100
... keep reading on reddit β‘In an omitted variable regression model with say 2 variables (of which one is omitted); if the omitted variable is uncorrelated with the other regression coefficient, will the OLS estimate of the regression coefficient of the variable retained in the model be consistent?
Please note that this site uses cookies to personalise content and adverts, to provide social media features, and to analyse web traffic. Click here for more information.