Is there a link between omitted variable bias and a significant Hosmer Lemeshow goodness of fit?

Is there a link between the two? My first two research questions suffer from p < .05 in the Hosmer Lemeshow goodness of fit test. My RQ2 has omitted variable bias due to a higher estimate of the coefficient. (I'm looking at all predictor variables in RQ 1 (three predictors), and only one of those same predictors in RQ 2). There is no omitted variable bias in RQs 3 and 4, and the Hosmer-Lemeshow test came back fine (p > .05)for those two RQs.

I understand that some consider the Hosmer Lemeshow to be obsolete, but I'm being required to use it.

👍︎ 2

💬︎

👤︎ u/ajfour1

📅︎ Aug 31 2021

🚨︎ report

Omitted variable bias. Need help!

Need help for an assignment. The problem says:

Suppose we have a model y = B0 + B1x1 + B2x2 + B3x3 + u. But, due to the lack of data, or by ignorance, we estimate the following equation:

y = B0 + B1x1

Given that the set of independent variables aren’t correlated, is B1 a biased estimator?

I know it is not, since all independent variables are statistically independent from one another. However, I’d like to explain it a bit further. Appreciate your help in advance, dear colleagues.

👍︎ 8

💬︎

👤︎ u/mirandag1099

📅︎ Apr 27 2021

🚨︎ report

[Q] Omitted-variable bias and statistical significance

Isn't it true that OVB can cause a regression coefficient be appear statistically significant due to less precise coefficient? If then we consider a model that had OVB, but to which necessary explanatory variables are added in, can the significance level of the aforementioned coefficient change? For example, consider the discussion in Wikipedia. If after determining the significance level of x without z we were to add z back into the model, can the significance of x be now different?

👍︎ 5

💬︎

👤︎ u/wabhabin

📅︎ Nov 07 2020

🚨︎ report

Help With Omitted Variable Bias in Stata

I am working on a project in stata and the ovtest command for OVB is consistently returning f stats in the 12 and above range, leading me to reject the null of no OVB. Even my kitchen sink model rejected the null of no OVB. Am I interpreting the test result wrong, or is it something else like functional misspecification or the model that’s causing the numbers to get wonky? Also if it helps the model has two dummy interaction variables.

👍︎ 3

💬︎

👤︎ u/louierosner

📅︎ Nov 05 2020

🚨︎ report

Omitted variable bias in multivariate linear regression models

Hi r/askstatistics - I'm in the middle of the stats portion of a data science degree and we are covering regression, particularly omitted variable bias (OVB).

I've spent the past few days researching this topic, however I cannot seem to find any examples that utilize systems with more than a single X variable for the base/"false" model...what if the baseline regression model was say, Y = B_0 + B1X1 + B2X2 + e? I found an old post in this group from 8 years ago which gave a good overview (https://www.reddit.com/r/statistics/comments/p336c/omitted_variable_bias/), but I would like to see some worked examples if possible, if somebody could point me in the right direction.

This brings me onto a bigger question - aside from describing what OVB is and how the general direction of the OV might affect a particular coefficient - is it common practice to quantify exactly the OVB exerted by the omitted variable? Say I had a base model, then found an extra variable to add to it which altered the coefficients, reduced their SE's and decreased their p-values, while boosting r^2 and adjusted r^2...would highlighting this effect and the resultant directional change in the coefficient be sufficient to prove OVB, or is there a more meaty calculation/proof I could give?

Thanks in advance for your help!

👍︎ 2

💬︎

👤︎ u/No-Designer-351

📅︎ Nov 10 2020

🚨︎ report

Omitted Variable Bias and the "Costs" of Ethnic Diversity

Haven't posted in a good long while, so let's get down to it.

Because I'm a filthy SJW, I sometimes lurk /r/socialjustice101. Today, I saw this post, which draws heavily from this podcast episode. The post basically makes claims that ethnic diversity reduces group performance, productivity, and has other undesirable features. We'll talk through some simple econometrics first, then I'll address why making the claim that ethnic diversity is causally-related to the aforementioned outcomes MUST be made with care, and that in the case of the podcast/blog and the post in general, it usually isn't. Indeed, many of the papers cited don't even make that claim (or even the claims attributed to them).

To clarify, I'm not making the claim that diversity has positive economics effects, merely that the negative effects mentioned in the OP are misstated, overstated, or in some cases, are inherently unknowable in their current specification. Let's get started.

1.) Basic Intuition

Spurious correlations are all around us. Ice cream sales are correlated with drownings. Obesity rates correlate with the housing bubble. We've all heard the maxim correlation doesn't imply causation, but it can be a hard thing to shake ourselves out of, especially when faced with messy causal inference we find intuitive. For instance, it may be intuitive for some to observe that increasing the aggregate number of differences in a population would make that population less productive. What goes unanswered here is the question: "why might ethnic differences make a country less productive?" No doubt you could make some broad appeal that ethnic homogeneity creates a sense of community germane to greater cooperation, and indeed that is the claim, but usually made in the reverse: that groups in diverse societies have low trust/sympathy for other groups dissimilar from themselves.

The subtle point here is that racial mistrust or animus is the channel through which productivity falls, not racial diversity. The thing to consider here is the unobserved counterfactual: that a country with high ethnic diversity, but no racism, has an unknown (though possibly nonnegative) relationship between productivity and diversity. This might seem like splitting hairs, but it's actually quite important in contending w

... keep reading on reddit ➡

👍︎ 100

💬︎

👤︎ u/iamelben

📅︎ Nov 03 2018

🚨︎ report

Omitted variable bias

I will try to explain omitted variable bias because it was a concept that eluded me for a long time. I thought it was a problem with the OLS estimator. It turns out that OLS always gives us the population regression function, but sometimes PRF is not what we're looking for.

Imagine that we're trying to explain the relationship between income (y) and schooling (s). We set up a model

y=α+βs+ϵ

where the stochastic error term ϵ captures the fact that there exists no deterministic relationship between schooling and income. The first thing to note is that the relationship above is a tautology. For every α and β I can construct an ϵ such that ϵ=y-(α+βs). In order for the coefficients to be uniquely determined we need to place restrictions on the error term.

We can say that cov(s,ϵ)=0. Note that this is true by construction and is not an assumption. The error term owes its life to the coefficients (α,β) and we've decided to define their value such that cov(s,ϵ)=0 is true. We can now consistently estimate their values by OLS and the PRF is α+βs.

These coefficients are not very interesting by themselves. If we make the assumption that E[ϵ|s] = 0 then the PRF is also CEF. In other words, if the CEF happen to be (approximately) linear then we can interpret β as the expected change in income if schooling increase by one unit. This number is interesting, but it has no causal interpretation. People with more schooling also tend to have higher ability. The coefficient β capture both the direct causal effect of schooling and the indirect effect of ability. In order to find a coefficient with causal interpretation we need to somehow isolate the effect of schooling. To make this more concrete, imagine that there exists a deterministic relationship in reality

y = α+δs+γa+x'β

where a is ability and x consist of all other variables that affect income. The parameters in this function has a causal interpretation. If schooling increase by one unit then income increase by δ. This is the coefficient we're looking for.

We do not observe all the variables that affect income so we define a random variable which capture their aggregate contribution on earnings

u = γa+x'β

so that the relationship can be stated as

y = α+δs+u

Note that the error term now has an economic interpretation. If we say that cov(s,u)=0 then this is an assumption. Since we already said that schooling is correlated with ability and ability affects income, it's not a very credible assumption. Imagine now

... keep reading on reddit ➡

👍︎ 19

💬︎

👤︎ u/Islamiyyah

📅︎ Feb 23 2020

🚨︎ report

NY City Economist - "Economic modelling on steroids, with omitted variable bias & using a static approach (w/out considering impact on GVCs, free movement restrictions or agents shock): “UK’s GDP would grow over that period -till 2030- by about 30%.” This is astrology, not serious economic work" twitter.com/Bengoa_Marta/…

👍︎ 28

💬︎

👤︎ u/chowieuk

📅︎ Oct 15 2018

🚨︎ report

Pump shotgun omitted variable bias

I've been seeing a lot of whining about the pump shotgun needing to be nerfed but I think people are forgetting a crucial aspect of why many feel the pump shotgun is overpowered.

It's pretty clear that it takes a certain amount of skill to use a pump shotgun effectively which is why most inexperienced players run the tactical shotgun to start out. This creates a weird dynamic where being a good player is associated with running a pump over a tactical shotgun.

What basically ends up happening is that people who chose to run tactical shotguns because it is easier to use end up losing more fights to players running pump shotguns. They then jump to the conclusion that the pump shotgun is overpowered without considering that it might not be the pump shotgun but the players who run pumps being better than those who run tac. This is very clear omitted variable bias because it assumes the pump shotgun, rather than the skill of the player, is what won the fight.

The pump allows you take more of an advantage from building and positioning which is why they are better in the hands of good players. There is no reason to nerf this so that worse players can beat better players more often in shotgun fights. Nerfing the pump to go toe to toe with the tac just lowers the skill ceiling overall which is bad for the game.

TLDR: Pumps are associated with being better so pump players kill tac players more often, creating the misconception that the pump is massively overpowered.

👍︎ 31

💬︎

👤︎ u/pugwalker

📅︎ Mar 21 2018

🚨︎ report

Is Omitted Variable Bias Always Bad? What are the implications of omitting variables from a regression that aren't easily obtained in the real-world?

I cross-posted this question to Cross Validated if you'd prefer answering it there.

Say I'm using multiple logistic regression to help caterers in a large city predict the probability invited adults will come to a wedding. Say I have a proprietary dataset of likely relevant predictor variables for each invited guest's traits, like age, gender, marital status, how far away they live from the event site, and whether the event is on a Saturday, and I have guests' attendance history to past weddings (1 = they attended; 0 = they didn't).

Say the results show all of the coefficients are significant (age: younger people are more likely to attend; gender: women are more likely to attend; marital status: single people are more likely to attend; how far away they live: the closer you live, the more likely you'll attend; Saturday: people are more likely to attend Saturday weddings).

Say I show those results to the caterers and teach them how to calculate the predicted probability of attendance given an invited guest's traits. However, the caterers unanimously proclaim they don't have access to all those data. They don't know the age, gender, or marital status of the guests but they do know how far away they live and whether the wedding is on a Saturday.

The caterers ask me if I can re-run the model using only those variables they have information on, so that they can feasibly employ the predicted probability calculations with their own data. What are the implications of this? Is there a better strategy?

Usually, omitted-variable bias is a concern because leaving out relevant variables "results in the model attributing the effect of the missing variables to the estimated effects of the included variables". But is that a bad thing here?

For instance, I assume some of the omitted variables correlate, like marital status and how fare they live away, with married folk being more likely to live in the suburbs, farther away from the event locations in the downtown area of the city. Would the caterers' ability to control for how far they live away capture the effects of marital status too, then, so that all in all the reduced model will still be effective at predicting attendance?

👍︎ 8

💬︎

👤︎ u/coip

📅︎ May 02 2018

🚨︎ report

[University Econometrics] Omitted Variable bias

So I feel like I’ve got the intuition of omitted variable bias down - excluding relevant variables leads to the over/underestimation of coefficients in your regression model as the estimated coefficients are “doing the work” of the excluded relevant variables.

However, I’m struggling to understand the interpretation of the bias terms.

So suppose we have the true model is (in matrix form):

The true model:

y = Xβ + Ζδ + ε (1)

And the misspecified model which we estimate

y = Xβ + ν (2)

The usual OLS estimate for β is (X’X)^-1 X’y

Plugging in the true model and taking expectation gives

E(y) = β + (X’X)^-1 X’Zδ So here’s my problem:

I can see that bias is obviously present (the expectation of the estimated parameter isn’t equal to the parameter). However, I can’t interpret the bias term. Put differently, what does

(X’X)^-1 X’Zδ mean

👍︎ 5

💬︎

👤︎ u/joedthat

📅︎ Jun 01 2019

🚨︎ report

Understanding the endogeneity loop: is it a form of omitted variable bias?

The most common example of the endogeneity loop that I refer to is the relationship between a city's number of police offers and its crime rate. Counterintuitively, cities with more police officers have more crime. This is because while the number of police has an effect on the crime rate, the crime rate also impacts the number of police.

Endogeneity occurs, in my understanding, when a regression's error term is correlated with an explanatory variable. When regressing the number of police on the crime rate, how would endogeneity occur? Is it the lack of lagged values of the crime rate (OVB)?

👍︎ 5

💬︎

👤︎ u/TomWeights_

📅︎ Aug 23 2017

🚨︎ report

The Phantom Menace: Omitted Variable Bias in Econometric Research journals.sagepub.com/doi/…

👍︎ 2

💬︎

👤︎ u/TrannyPornO

📅︎ Jan 28 2019

🚨︎ report

Omitted Variable Bias?

Hi, does anyone have a good source on omitted variable bias in regressions in more complex systems than 2 variables?

I am very familiar with the basic y = b1x1 + b2x2 --> if we run y = b1x1, then estimated b1 = b1 + b2bx1x2. I am looking for maybe articles/textbook that deals with specifications with more variables and consequences of correlations in larger systems when multiple variables are omitted and multiple included etc. If anyone has any references/ideas that would be great! cheers!

👍︎ 4

💬︎

👤︎ u/latortilla

📅︎ Jan 30 2012

🚨︎ report

Regressions, interaction term and omitted variable bias

Hi guys!

First post, hopefully this won't be too disappointing/basic a question.

I am estimating a regression, let's say of the form

Y=B0+B1(X1)+B2(X1*X2)+u

1st question: If B2 (the coefficient on X1*X2) is significant, does this imply a correlation between X1 and X2 in determining Y (i.e. should we also include X2 on its own in the above regression) ?

2nd question: We now have another regression of the form

Y=A0+A1(X1)+u where X1 is the same X1 as above

If B2 is insignificant, but the inclusion of X1*X2 makes the coefficient on X1 (B1) significant (i.e. A1 was insignificant) does this also imply a correlation between X1 and X2 or X1*X2 in determining Y ?

Sorry if this is confusing...

👍︎ 2

💬︎

👤︎ u/statadude

📅︎ Feb 21 2015

🚨︎ report

[Q] Omitted bias in multivariate regression model

Hello,

I am trying to show that the estimator for some variable X_1 (beta) is biased when I leave out a variable (X_3) from the true population model.

True population model: Y=X_1' \beta_1 + X_2' \beta_2 + X_3' \beta_3 +e

Model used: Y = X_1' \delta_1 + X_2' \delta_2 + u

My work (for notation let X1=X_1):

\delta_1= E[X1X1'|X2]^{-1} E[X1Y|X2]

replacing Y by the true population model:

\delta_1 = \beta_1 + X2' \beta_2 E[X1X1'|X2]^{-1} E[X1|X2] + \beta_3 E[X1X1'|X2]^{-1} E[X1X3'|X2] + 0

I am assuming that since I am controlling for X2 in the model, the term X2' \beta_2 *E[X1X1'|X2] *E[X1|X2] should be equal to 0. However I am not sure if this is the correct assumption to make, since it would imply that E[X1|X2]=0 or beta_2=0.

I believe that delta_1 will be the affected by the conditional covariance of X1,X3~E[X1X3'|X2] and \beta_3, which represents the conditional covariance of X3,Y given X2.

How do I demonstrate that \delta_1 is biased?

Thank you!

👍︎ 2

💬︎

👤︎ u/Lambdapie

📅︎ Dec 31 2021

🚨︎ report

Every variable is omitted when running with xtreg

Hello!

I'm using panel data including the following variables,(y variable: abnormal return, cumulative abnormal return)(x variable:GovScore log_BM log_Size CAR AR EnvirScore ESGScore Size Lev SocialScore)These variable are all focused on a short event window for like a month, hence, especially for the score variable, it maintains the same, hence, that means it is time-invariant?

My goal was to carry out regressions and do the Hausman test to see if I should use fixed effect or random effect. And then hopefully to move on to find the relation between return and ESG scores.

But when I did the code:

This happened

I've read across in the thread of using a 'hybrid model' but few questions seems to encounter problems at the early stage of just doing the xtreg coding, hence, does it mean there is fundamentally something wrong with my dataset (my dataset is strongly balanced) and it should be better applied as an OLS model? But this would be very wired since I am carrying out an event study.

https://preview.redd.it/am3s9lyeigy61.png?width=730&format=png&auto=webp&s=0ab2ef0035f7429746c052f985920bc5ccd468a7

And when I did normal regressions, this was the result, does that mean the result is insignificant no matter if it was panel data or not?

https://preview.redd.it/xpkmstocpey61.png?width=721&format=png&auto=webp&s=ad73778de5b6531b21a8033305ea9ed9bdd84eaa

This relates to a previous question posted on the server:https://www.statalist.org/forums/forum/general-stata-discussion/general/1608634-princeton-s-event-study-did-not-generate-the-predicted-return

original post:

https://www.statalist.org/forums/forum/general-stata-discussion/general/1608774-omitted-variables-when-running-with-xtreg

Thank you very much,Emma

👍︎ 3

💬︎

👤︎ u/Southern-Wafer2709

📅︎ May 11 2021

🚨︎ report

In Perl, what is omitted precisely when the variables are declared with omission?

Hi, I just wanna know like this title.

If we wrote like this,

my $omitted;
my @omitted;

What is omitted as default value? I think undef, () or both deserve it. but can't determine.

I have already researched with using my knowledge, but I wanna know it is really correct.

my test code is here.

👍︎ 6

💬︎

👤︎ u/worthmine

📅︎ Oct 22 2020

🚨︎ report

Help please! Effect of an omitted variable on a correlation

Knowing that a change in the variable X (student/teacher ratio in a class) causes an inverse change in the variable Y (average SAT test points in that class), and knowing that X is also negatively correlated with the omitted variable Z (% of native english speakers in that class), the estimator betaCap overestimates or underestimates the real beta which is about the correlation between X and Y with Z constant?

Edit: betaCap counts the effect of Z as if It were part of the effect of X on Y, it doesn't "know" to be distorted

My teacher says underestimates but this answear doesn't make any sense to me

👍︎ 6

💬︎

👤︎ u/alebaf

📅︎ Nov 23 2020

🚨︎ report

Help with omitted variable proof

Hello friendly econometrics gurus,

I am looking for some help in solving an exercise.

I have the following task:

Prove that this expectation holds:

https://preview.redd.it/8l8s7g018mz51.png?width=342&format=png&auto=webp&s=338b0ae26118457aec4351ad9792e9d16d23ef25

I know the following:

https://preview.redd.it/x44gp7a58mz51.png?width=624&format=png&auto=webp&s=b7497d4037057e4cd07aad4843cae0a8741b71ea

https://preview.redd.it/efk349u68mz51.png?width=445&format=png&auto=webp&s=fe1718726fb8378db5c1db99ce35a0372ad4f4cb

https://preview.redd.it/zz0x6ot98mz51.png?width=303&format=png&auto=webp&s=a9c30f9c49d1fe0b1b589ca1f262b974c19cde5c

https://preview.redd.it/cnzoarfc8mz51.png?width=462&format=png&auto=webp&s=20278e60f3f8cd2c1764c6bd87e126965424ade6

I have attempted this my self, but i am not sure if i have made the right assumptions and followed the correct rules of conditional expectation to achieve my results, or if its just wishfull math :P

https://preview.redd.it/os0ndcip8mz51.jpg?width=2048&format=pjpg&auto=webp&s=6a0ed71566876f7daaf6fe5450689873b0dc0fb1

I appreciate any guidance and please ask if something is unclear.

👍︎ 5

💬︎

👤︎ u/iSucre7

📅︎ Nov 16 2020

🚨︎ report

Help with omitted variable proof /r/econometrics/comments/…

👍︎ 2

💬︎

👤︎ u/iSucre7

📅︎ Nov 16 2020

🚨︎ report

I have created a dummy variables table and need to do a linear regression on the data (+threat bias data?) I managed to use lm() to get some results but it is only including low and mod (not high) data for effortful control? Anyone know how to help? Sorry this is confusing reddit.com/gallery/p46msh

👍︎ 3

💬︎

👤︎ u/sophiemae19

📅︎ Aug 14 2021

🚨︎ report

Modded Big Muff (Tone Bypass, Ramshead Tonestack w/Mids, switchable clipping LEDs & variable bias for input booster)

👍︎ 61

💬︎

👤︎ u/Tonelessgear

📅︎ Mar 08 2021

🚨︎ report

Official chain battle information seems to omit variables important to final score.

https://imgur.com/a/2ac8c9e

The above link are 2 runs on my end, with the setups being completely identical thanks to the power of 1 stone. According to Dokkan's information on Chain Battle, the 3-pick event should ONLY be influenced by their synergy (links, categories, type). However, I picked 3 Pikkons as well, who clearly have way way less overall synergy, and somehow got a score a couple thousand above the Vegetas. Now, you could ascribe the 15k score difference to the minimal 00.02 seconds difference, but that would still not explain why Pikkons had the same result to begin with considering Vegeta is supposed to blow Pikkons out of the water based on official information.

I would run T-tests, but I can't because there's just too much uncertainty about the multipliers of each variable, so for now I will just have to leave it at some unknown variable (most likely red glow) that impacts the 3 pick event despite Dokkan's official information.

👍︎ 54

💬︎

👤︎ u/Sunsinger_15

📅︎ Apr 21 2021

🚨︎ report

I can't understand how to define and account for the bias occurring when examining the relationship between two variables in data - please help

Hi Reddit, Thank you for helping. Please let me know if this belongs in a different sub.

For work, my team has a hypothesis about a relationship between two variables, and a data set with thousands of outcomes. The outcomes are suggesting the opposite of our hypothesis, but I believe there's a chicken-and-the-egg situation, and I'm wondering if there's a better way to define the problem and expected result, and account for the bias I believe is occurring.

Imagine you have two variables in each record of data; we'll call them X and Y. X represents the number of times a user takes an action for a particular record object. Y represents the total time, start to finish, of that object's existence; let's call it a turn-around time.

Y represents a numerical time value, made up of events A, the beginning, and B, the ending, for each record. Meanwhile, X is a whole number, equalling the sum of independent actions {a, b, c, [...], n} that occur for this object within event Y.

When we think about the real-world scenario, we expect each independent action to increase the likelihood of event B occurring, which, for the business objective, is a good thing. The turnaround time Y should be minimized; so if each action has a chance of causing event B and therefore ending variable Y, we expect that more actions, and by extension the higher the value is of X, would correspond to a lower value of Y. Our hypothesis was that the values of X carry a negative correlation with the values of Y (increased # of actions = lower turnaround time).

To frame it with an example:

Staff member John could be completing 1,000 actions in a minute, and when each one carries a chance of triggering event B, staff member A is highly likely to have B occur at the end of that minute.
Staff member Brock could be completing only 10 actions in a minute, and so event B is highly unlikely to occur after that minute, and so Y will be a higher value.

However, the data has suggested the opposite; when looking at outcomes, the data set shows that typically more actions correspond to a longer value for Y. When we dive into examples to understand why, it feels that the TAT value Y is almost dependent on X; outcomes that carry a longer turnaround time have more opportunities for actions to occur. Also, when comparing examples in the outcomes data, the time spent between actions taken was not perfectly consistent, but fairly similar; John and Brock would more or less both be completing 100

... keep reading on reddit ➡

👍︎ 2

💬︎

👤︎ u/falcons_united17

📅︎ Jun 04 2021

🚨︎ report

[Casual] hi guys! I’ve made a 2-4 minute survey to see how political preferences impact a particular variable. I’m only studying two variables but to account for bias, I’m not sharing which variable will be studied. K will reveal the variable with results. (U.S. residents) forms.gle/Kvmxt56EFoQTXR1…

👍︎ 30

💬︎

👤︎ u/thiccmemes99

📅︎ Dec 26 2020

🚨︎ report

There were 2,100 Spaniards AND 200,000 indigenous allies, something that is always omitted

👍︎ 31k

💬︎

👤︎ u/EvvPlay

📅︎ Nov 13 2021

🚨︎ report

It's like I intentionally omit random letters from words to make a variable name

👍︎ 285

💬︎

👤︎ u/--B_L_A_N_K--

📅︎ Mar 20 2021

🚨︎ report

Why did George tell her was engaged when asked how he was still single? “Liar” George would’ve omitted that and dumped Susan, goes against 6+ years of his character.

👍︎ 681

💬︎

👤︎ u/downloh

📅︎ Jan 03 2022

🚨︎ report

Consistency of regression coefficient in an omitted variable case

In an omitted variable regression model with say 2 variables (of which one is omitted); if the omitted variable is uncorrelated with the other regression coefficient, will the OLS estimate of the regression coefficient of the variable retained in the model be consistent?

👍︎ 6

💬︎

👤︎ u/msspezza

📅︎ Mar 20 2018

🚨︎ report