A list of puns related to "Clustered standard errors"
I kind of understand each one from a technical perspective but I still can't understand when is one better than the other or what are the main differences between each technique or why both are needed besides to solve the problem of constant variance violation.
Would someone clarify the concept with examples or point out a source to understand the difference between both and their applications?
Hi,
I am running a random effect regression with log wage being my dependent variable and year dummies, race dummies, years of education being my control variables. I am clustering over the id variable. However, I cannot make anything of the results as they are quite ambiguous. In theory, the clustered standard errors should be smaller. I would really appreciate it with someone could explain to me the rule of thumb related to variables that are needed to be looked at when making the comparisons.
Non-clustering standard errors results:
Random-effects GLS regression Number of obs = 4,360
Group variable: nr Number of groups = 545
R-sq: Obs per group:
within = 0.1625 min = 8
between = 0.1296 avg = 8.0
overall = 0.1448 max = 8
Wald chi2(10) = 819.51
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
lwage Coef. Std. Err. z P>z [95% Conf. Interval]
d81 .1193902 .021487 5.56 0.000 .0772765 .1615039
d82 .1781901 .021487 8.29 0.000 .1360764 .2203038
d83 .2257865 .021487 10.51 0.000 .1836728 .2679001
d84 .2968181 .021487 13.81 0.000 .2547044 .3389318
d85 .3459333 .021487 16.10 0.000 .3038196 .388047
d86 .4062418 .021487 18.91 0.000 .3641281 .4483555
d87 .4730023 .021487 22.01 0.000 .4308886 .515116
educ .0770943 .009177 8.40 0.000 .0591076 .0950809
black -.1225637 .0496994 -2.47 0.014 -.2199728 -.0251546
hispan .024623 .0446744 0.55 0.582 -.0629371 .1121831
_cons .4966384 .1122718 4.42 0.000 .2765897 .7166871
sigma_u .34337144
sigma_e .35469771
rho .48377912 (fraction of variance due to u_i)
Results from clustering:
Random-effects GLS regression Number of obs = 4,360
Group variable: nr Number of groups = 545
R-sq: Obs per group:
within = 0.1625 min = 8
between = 0.1296 avg = 8.0
overall = 0.1448 max = 8
Wald chi2(10) = 494.13
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
(Std. Err. adjusted for 545 clusters in nr)
Robust
lwage Coef. Std. Err. z P>z [95% Conf. Interval]
d81 .1193902 .0
... keep reading on reddit β‘I run OLS and got significant results but reviewer is asking for fixed effect, When I run fixed effect, my result disappers. Now Question is Endogenity problem is there, I run GMM, he said its not convincing for Endogenity? What Should I do?
Hello, I have a question regarding clustered standard errors. For my research I need to use these. I have a dataset containting observations for different firms over different year. What commands should I use for these standard clustered errors?
Hi All,
I have a quick econometrics questions. What is the best solution for clustering standard errors when you have few (N < 50 or even N < 25) clusters? Is it better to use a small-cluster error adjustment matrix (I.e. HC2 or HC3)? Or is it better to bootstrap standard errors? If bootstrapping, does it matter if it is pairwise/xy or "wild?"
This is for a scattered difference in difference BTW (panel data with unit level clusters), not clustered treatment (I.e. randomization at village level) if that matters. Its for my thesis, not homewok. Advisors did no have very useful advice to this question so asking here. I'm using "wild" bootstrapped SEs for my paper now but it is taking an eternity to run the models and adjust errors (because bootstrapping is a slow process) and I'm wondering if there is a better way to do it. However, I'm not sure if there are fundemental differences across these solutions when it comes to adjusting for few clusters (sorry, I need to brush up on my quant fundementals).
Thanks in advance for any help!
Hi,
I have a data set where people a buying a quantity of a good an a range of prices, under two different conditions. I've been instructed to compare conditions by running a paired t-test on the average demand clustering standard errors at the subject level.
Unfortunately, I'm somewhat out of my depth (probably in terms of stats and R knowledge) and would greatly appreciate some help.
In the image below, Column A is the subject ID, Column B is the condition (high and low dose), Row 1 is price, C3:M26 is the demand.
https://preview.redd.it/9yw3cqdqq6631.png?width=1766&format=png&auto=webp&s=1514c1d085df39a0d8ae62201c919361d35061c5
Thanks
Hi all,
I have a model including a regressor generated by another model and I cluster the standard errors by firm and year. Do I still need to bootstrap my standard errors to overcome the generated regressor problem?
Many thanks!
Hi all,
I'm doing a regression on the effect of voter ID laws. I have been informed I need to use clustered standard errors, but frankly am way, way out of my depth on this one.
Is anyone familiar enough with the concept to help me thought it? I'm using r.
Thanks a bunch.
I was asked to get cluster my standard errors in SAS models. This person I am working with uses STATA and showed me the cluster command that he uses at the end of his models. My SAS/STATA translation guide is not helpful here. All I am finding online is the surveyreg procedure, which presents robust standard errrors (I am assuming robust/clustered are the same things or similar based on what I am reading). However, the surveyreg procedure is not effective when I have models with dichotomous outcome variables.
I am trying to make a bar graph with clustered bars. I measured three variables repeatedly across 3 participants. I would like to graph the average value for each participant and each variable with error bars showing the standard deviation of each, clustered by the variable name.
My data is organized as follows:
AVG1 | SD1 | AVG2 | SD2 | AVG3 | SD3 | |
---|---|---|---|---|---|---|
Person 1 | 10 | 2 | 15 | 2 | 56 | 7 |
Person 2 | 25 | 3 | 45 | 4 | 76 | 10 |
Person 3 | 30 | 1 | 35 | 5 | 23 | 3 |
So ideally I would want three clusters of bars. The first cluster would have 3 bars of height 10 25 and 30 with error bars ranging 8-12, 22-28, and 29-31 repectively. Then I would want the send and3rd variables included in the chart.
The way my data is organized, I get 6 clusters instead of 3 and I cannot adjust the error bars to reflect the differences between individuals/variables.
How can I organize my data to achieve my goal? I have all the raw data still in the same document.
I saw I am supposed to include my excel version on here but I do not know what it is. I am on a PC running Windows 10 though.
Hi All,
Relatively straight forward question I could not find an answer to online. Are cluster-robust standard errors needed when analyzing panel data using a first difference model? I know you should cluster SEs at the unit level when analyzing panel data with FEs but doing so with a first difference seems wrong since the difference in the errors is unlikely to be serially correlated at the unit level. I'm presting relatively soon and don't want to get this wrong :/
I'm unlikely to use a REs model but since I'm asking about FD models.... might as well learn as much as possible from you stats geniuses!
Thanks!
Hi everyone,
I'm having problems replicating some research, where observations are firm-year level and the author has industry & time fixed effects, but standard errors clustered on firm level.
To simplify, my data looks something like this:
Firm Year Industry Y X1 X2 X3
1 2000 3 0.3 0 0 0.78
1 2001 3 0.4 0 0 0.70
2 2000 1 0.3 0 0 0.78
2 2001 1 0.3 1 0 0.78
3 2000 3 0.3 0 1 0.78
3 2001 3 0.3 0 1 0.78
I cannot add industry fixed effects by doing
FE <- plm( Y ~ X1 + X2 + X3, data=panel, index = (c("Industry", Year")), model = "within", effect = "twoways")
As multiple firms have same industry in same year. I saw some recommendations of adding the industry as dummy via factor, which I've tried to implement like this:
FE <- plm( Y ~ X1 + X2 + X3 + factor(Industry), data=panel, index = (c("Year")), model = "within", effect = "individual")
Is this equivalent of doing industry and time fixed effects? If so, how would I go about adding clustering of standard errors on firm level?
If my approach to the fixed effects is wrong, how could I do both the fixed effects and the clustering?
Thanks for any help! (crossposted this in /r/AskStatistics as both seem relevant)
Practically, I can simulate and see it on my own. However, I'm looking for an analytical explanation of this phenomena.
Any reference book/pdf is also appreciated.
Secondly, does that increase the error in test set or does the error stay same? Unfortunately I came across both answers (sources of which I can't recall right now). Any pointers are appreciated.
it seems to me that when thinking of endogeneity, or what the sampling distribution of the estimator looks like (i.e. cluster at state level)- these are all just arguments you would have to justify verbally/appeal to logic and know how, and ultimately are not questions that the data can tell you for sure (since standard errors and and all involve the error term which is unobservable). is this correct?
Dear all,
I am running an OLS regression using Stargazer and I would like to cluster standard errors two-ways. I am not sure that I have specified the two-way cluster correctly. Could you confirm that the following is the correct way to do it?
stargazer(
(coeftest(model1,vcovCL, cluster = ~ FundID1 + FundID2)),
(coeftest(model2,vcovCL, cluster = ~ FundID1 + FundID2))
,type='latex', font.size = 'tiny'
)
The code works. Standard errors are greater when I cluster as I did ( cluster = ~ FundID1 + FundID2
) then when clustering only with cluster = FundID1
or with only cluster = FundID2
. However, I could not find anything like this online. Did I do it correctly?
Many thanks,
I just want to understand the formula and not memorize it just to solve questions.
Based on an actual, fun debate I had last week:
Recall standard deviation is a measure of dispersion. Is the standard error (SE) truly a standard deviation? (There is only one correct answer and three false choices)
(same LI poll here https://www.linkedin.com/posts/bionicturtle_frm-activity-6887464252769206272-pCMY)
Seriously. One scanner is 6 hours, all four scanners is 1 hour. So you deploy them all and just sit...for an entire 60 minutes, doing other things than the game. Or dick around and poke around in some caves out of curiosity for exotics. That's it.
So I sat here for an entire hour of not playing a game. To me that's deeply flawed game design, requiring you to just sit there and NOT play it. I get that the game wanted me to repair the shelters the antennas sit in from severe weather, but in the end that just ended up not really mattering. Build the four shelters for antennas, wait an entire hour.
I think it's time for me to step away from this game for a few months until changes are made. I just landed on a prospect and ...got right back in the pod and took off again because what's the point -- to unlock some more workshop items? Nah. I think I've seen enough and waited around / grinded enough. 130hours in, I've got a feel for it all.
I hope the devs see this and watch their declining player counts carefully.
I see that with logistic regression that the standard error can be computed as in How to compute the standard errors of a logistic regressions coefficients which amounted to taking
`[;\sqrt{(X^TVX)^{-1}};]` where V is a diagonal matrix where the diagonal entries was probability of being in class A, `[;\pi_a * (1-\pi_a);]`
___
Looking at the same for linear regression (based on my understanding of Standard errors for multiple regression coefficients? ) we can compute the standard error of the coefficients by
`[;\sqrt{\sigma^2(X^TVX)^{-1}};]`
where s is the variance of the residuals (as per my understanding of
___
From the above I have 2 questions:
___
Normally I'd just use R or statsmodels, but I'm building a custom library for encrypted ML/stats and I need to build all of this from scratch
Hi All,
Full disclosure, this is a crosspost from r/AskEconomics
I have a quick econometrics questions. What is the best solution for clustering standard errors when you have few (N < 50 or even N < 25) clusters? Is it better to use a small-cluster error adjustment matrix (I.e. HC2 or HC3)? Or is it better to bootstrap standard errors? If bootstrapping, does it matter if it is pairwise/xy or "wild?"
This is for a scattered difference in difference BTW (panel data with unit level clusters), not clustered treatment (I.e. randomization at village level) if that matters. Its for my thesis, not homewok. Advisors did no have very useful advice to this question so asking here. I'm using "wild" bootstrapped SEs for my paper now but it is taking an eternity to run the models and adjust errors (because bootstrapping is a slow process) and I'm wondering if there is a better way to do it. However, I'm not sure if there are fundemental differences across these solutions when it comes to adjusting for few clusters (sorry, I need to brush up on my quant fundementals).
Thanks in advance for any help!
Hi everyone,
I'm having problems replicating some research, where observations are firm-year level and the author has industry & time fixed effects, but standard errors clustered on firm level.
To simplify, my data looks something like this:
Firm Year Industry Y X1 X2 X3
1 2000 3 0.3 0 0 0.78
1 2001 3 0.4 0 0 0.70
2 2000 1 0.3 0 0 0.78
2 2001 1 0.3 1 0 0.78
3 2000 3 0.3 0 1 0.78
3 2001 3 0.3 0 1 0.78
I cannot add industry fixed effects by doing
FE <- plm( Y ~ X1 + X2 + X3, data=panel, index = (c("Industry", Year")), model = "within", effect = "twoways")
As multiple firms have same industry in same year. I saw some recommendations of adding the industry as dummy via factor, which I've tried to implement like this:
FE <- plm( Y ~ X1 + X2 + X3 + factor(Industry), data=panel, index = (c("Year")), model = "within", effect = "individual")
Is this equivalent of doing industry and time fixed effects? If so, how would I go about adding clustering of standard errors on firm level?
If my approach to the fixed effects is wrong, how could I do both the fixed effects and the clustering?
Thanks for any help! (crossposted this in /r/rstats as both seem relevant)
Please note that this site uses cookies to personalise content and adverts, to provide social media features, and to analyse web traffic. Click here for more information.