Dummy variables - Multiple regression

Hello folks,

It seems for dummy variables, if we have N categories, we must use N-1 dummy variables.

I can't understand the following reason the CFA gives -- why would using the N variables imply a violation of assumption 2 (i.e. independent variables have no linear relation)? If you can explain it in a "dummy" way, it will be appreciated!

"The reason for the need to use n βˆ’ 1 dummy variables is that we must not violate assumption 2 that no exact linear relationship must exist between two or more of the independent variables. If we were to make the mistake of including dummy variables for all n categories rather than n βˆ’ 1, the regression procedure would fail due to the complete violation of assumption 2."

πŸ‘︎ 5
πŸ’¬︎
πŸ“…︎ Jan 28 2022
🚨︎ report
Getting value 1 for 2 dummy variables in the same year

Dear stata users,

I have a panel data set consisting of 300 municipalities for 2000 until 2020. I have created three dummy variables(high_dev,low_dev and med_dev) by using the Built_area(percentage),Agri_area(percentage) and Natural_area(percentage).

The code that I am performing in stata is:

gen high_dev=1 if (Built_area>32.71 & Agri_area<33.03 & NaturalForest_area<11.11) & !missing(Built_area, Agri_area, NaturalForest_area)

//dummy medium share development

gen med_dev=1 if !inrange(Built_area,10.13,32.71 ) & !inrange(Agri_area,33.03,56.32)& !inrange(NaturalForest_area,11.11,15.18) & !missing(Built_area,Agri_area,NaturalForest_area)

//dummy low share development

gen low_dev=1 if (Built_area<10.13 & Agri_area>56.32 & NaturalForest_area>15.18) & !missing(Built_area,Agri_area,NaturalForest_area)

However I am getting for a specific municipality a value 1 for med_dev and low_dev which is not what I want it is either high dev,med dev or low dev.

If someone could help me out I would appreciate it!

πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/nhlomid
πŸ“…︎ Jan 22 2022
🚨︎ report
[Q] stupid question about dummy variable

Hi everyone, i have collected data of 4 years for daily s&p 500 return.

I would like to do a regression analysis with time being the independent variable and stock returns being the dependant variable,

I would like to compare the risk/return ratio for pre covid and post covid.

So should i add a dummy variable, 1= covid, 0= pre covid?

Or do i just make two seperate linear regression models and seperate the data since they are both 2 years each?

If I add a dummy variable will I get 2 results? Sorry if it's a dumb question, I don't think I've used dummy variables before, or if i did it was really long time ago. Thanks you for the help !

πŸ‘︎ 6
πŸ’¬︎
πŸ‘€︎ u/internetxoutlaw
πŸ“…︎ Jan 08 2022
🚨︎ report
what’s a base category? I’ve got a project and it’s asking to create a dummy variables but I have to use one of the categories as a β€˜base category’ - please help thank you
πŸ‘︎ 6
πŸ’¬︎
πŸ“…︎ Jan 07 2022
🚨︎ report
Help with dummy variables

Hi everyone.

I am new to STATA and don't know how to use it yet. I am working with data from GSS survey. I want to code the variable ethnicity into several dummy variables. Because the list is so long, I want to group them say, Europe, East Asia, South America etc. There are 45 countries listed in my descriptive statistics, but no number. So the problem is how do I know how the countries are numbered? If, how do I find that out? Is there a way in Stata to look up variables in detail?

For example, I would run this code if I knew that France is the sixteenth on the list and Italy is twenty-fifth on the list:

gen europe=ethnic==16|ethnic==25

But right now I don't know what number they are on the list, so how would I find that out?

Help would be very appreciated.

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/DreadfulViewer
πŸ“…︎ Dec 08 2021
🚨︎ report
Demand Planning Help - POS Forecasting, Model Selection, Dummy Variables, Shipment Forecasting

Hi All,

Hope everyone's doing well.

I have entered Demand Planning about 2 months ago (switching from Production Planning). One of my first projects is to put together the tools and processes for generating a shipment plan as part of the monthly S&OP cycle. I thought I should share my approach to see what you guys think and ask a few questions along the way.

So my organization sits directly upstream of the retailers. There are about 660 retail stores that carry our products. With regards to products, we have ~300 SKUs within 6-7 categories. Approximately half of the SKUs have stable demand with some seasonality. The other half contains items with sporadic demand as well as those that are available only once per year. This post is mostly about the items with regular demand.

Being a small company, we don't have access to any large-scale forecasting/S&OP software. So I am using a combination of R, Power BI, and Excel to support the process. Here are the steps in my workflow:

Raw data - POS data is available for each week and SKU. Store-level POS data is not available from the retailer. There are 3+ years of data. My company stores this data into a SQL database. We can easily pull this data using available tools.

Power BI - Being new to the role my first instinct was to pull all the POS data, Item Master, and the Fiscal Calendar into PBI to create some DAX measures - Total Sales, Sales LY, YTD Sales, YTD Sales LY, % Changes, etc. Then I used these measures to create some visuals and used filters to cycle through the items and get a feel for their demand patterns.

R - Next step was to fit time series models to the data. Here I mainly used the tidyverse, tsibble, and fable packages. I fitted the following models for each SKU in the dataset:

  1. ETS - the fable package automatically selects the appropriate components and smoothing parameters
  2. ARIMA - the fable package automatically chooses the pdq-PDQ components and their respective coefficients
  3. TSLM - I fitted a multiple regression model with trend and months as dummy variables
  4. Initially, I wanted to use the RMSE scores to pick the best model for each SKU. However, I wanted to get a feel for the outputs of each model by visualizing their forecasts. So once the models were fit, I asked R to create forecasts for each SKU in monthly buckets from Nov 2021 until the end of 2022. This led to having 3 candidate forecasts for each SKU.
  5. I then wrote these forecasts
... keep reading on reddit ➑

πŸ‘︎ 9
πŸ’¬︎
πŸ‘€︎ u/Effulgere
πŸ“…︎ Dec 07 2021
🚨︎ report
Vizzy For Dummies Pt. 3 Variables youtu.be/7gA0NmG8KTI
πŸ‘︎ 4
πŸ’¬︎
πŸ‘€︎ u/deltlead
πŸ“…︎ Dec 24 2021
🚨︎ report
can someone in stats 2b03 please explain dummy variables bc im a dummy

im having trouble understanding them

πŸ‘︎ 5
πŸ’¬︎
πŸ‘€︎ u/Forward_Duck_9524
πŸ“…︎ Dec 12 2021
🚨︎ report
help with adding dummy variables

I have a dataset of elementary grades and I created dummy variables equal to 1 if they were in advanced classes and 0 if they were in normal. I want to sum up each student using these dummy variables such as if they were in advanced in second and fourth grade and regular everywhere else the sum would be 2. Is there a function I could use to get started? I have tried creating a variable that is equal to the dummies but it doesn’t work as does mutating a variable that is equal to the dummies added together. I am very lost any help is great

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/LeeHyeri94
πŸ“…︎ Dec 08 2021
🚨︎ report
Create dummy if one variable matches another by group

I have a large data frame that looks like this.

I am pasting the dput output of my df below:

```

Polish_panel_hist_06 &lt;- structure(list(id = c(32, 32, 32, 32, 32, 32, 32, 12668031110,12668031110, 12668031110), survey_date = structure(c(17167, 17167,17167, 17167, 17167, 17167, 17167, 15034, 15034, 15034), class = "Date"),survey_year = c(2017, 2017, 2017, 2017, 2017, 2017, 2017,2011, 2011, 2011), mom_dob = c(1991, 1991, 1991, 1991, 1991,1991, 1991, 1987, 1987, 1987), date = structure(c(10592,10957, 11323, 11688, 12053, 12418, 12784, 14304, 14669, 15034), class = "Date"), date_year = c(1999, 2000, 2001, 2002,2003, 2004, 2005, 2009, 2010, 2011), mom_age = c(7, 8, 9,10, 11, 12, 13, 21, 22, 23), newborn = c(0, 0, 0, 0, 0, 0,0, 0, 0, 0), stock = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), family500_year = c(1,1, 1, 1, 1, 1, 1, 0, 0, 0), nchild1 = c(2015, 2015, 2015,2015, 2015, 2015, 2015, NA, NA, NA), nchild2 = c(NA, NA,NA, NA, NA, NA, NA, 2010, 2010, 2010), nchild3 = c(NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,NA_real_, NA_real_, NA_real_), nchild4 = c(NA_real_, NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,NA_real_, NA_real_), nchild5 = c(NA_real_, NA_real_, NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,NA_real_), nchild6 = c(NA_real_, NA_real_, NA_real_, NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), nchild7 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), nchild8 = c(NA,NA, NA, NA, NA, NA, NA, NA, NA, NA), nchild9 = c(NA, NA,NA, NA, NA, NA, NA, NA, NA, NA), nchild10 = c(NA, NA, NA,NA, NA, NA, NA, NA, NA, NA), educcat = c(2, 2, 2, 2, 2, 2,2, 3, 3, 3), educcat_college = c(0, 0, 0, 0, 0, 0, 0, 1,1, 1), hh_income_net = c(3410, 3410, 3410, 3410, 3410, 3410,3410, 7978.7001953125, 7978.7001953125, 7978.7001953125),hh_income_annual_usd = c(10912, 10912, 10912, 10912, 10912,10912, 10912, 25531.840625, 25531.840625, 25531.840625),hh_income_annual_log = c(9.29761838008324, 9.29761838008324,9.29761838008324, 9.29761838008324, 9.29761838008324, 9.29761838008324,9.29761838008324, 10.1476816041898, 10.1476816041898, 10.1476816041898), marital_stat = c(10, 10, 10, 10, 10, 10, 10, 20, 20, 20), maritalcat = c(0, 0, 0, 0, 0, 0, 0, 1, 1, 1), rural = c(1,1, 1, 1, 1, 1, 1, 0, 0, 0), age_sq = c(625, 625, 625, 625,625, 625, 625, 529, 529, 529), emp_stat = c(3, 3, 3, 3, 3,3, 3, 6, 6, 6), occupation = c("98", "98", "98", "98", "98","98", "98", "48", "48", "48"), disability_stat = c(2, 2,2, 2, 2,
... keep reading on reddit ➑

πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/amb1274
πŸ“…︎ Nov 22 2021
🚨︎ report
Are fixed effects the same as including dummy variables in a regression?

Say, I have a panel data and I include a dummy variable 'region' in my regression. Can I interpret the results as 'region' fixed effects?

πŸ‘︎ 6
πŸ’¬︎
πŸ‘€︎ u/vaishakh1000
πŸ“…︎ Nov 09 2021
🚨︎ report
Deflate the dependent variable or use time dummies?

Hi all :) I'm new here. I have a panel dataset and my dependent variable is nominal productivity at the firm level, including almost all industries/sectors in the economy. Since the DV is in nominal terms, I wish to remove the effect of prices as it might bias my estimates because prices may artificially inflate productivity. Unfortunately, data on firm-specific prices or industry-specific prices is not available. I was wondering whether someone would help me figure out what's the difference between including time dummies or deflating the productivity variable by the GDP deflator? I think the latter is a more accurate way of going about it, but at the same time, I will be deflating all firms by the same deflator rather than an industry-specific deflator. I think it will be even more ideal to deflate productivity while also including time dummies to capture national time-varying variables other than prices. Thanks in advance for your help.

πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/hye125
πŸ“…︎ Oct 20 2021
🚨︎ report
Violation of PH in dummy variables in Cox Ph model?

I ran a the COX model on python but the library says that my dummy variables are violating ph assumption. Now the dummy variables correspond to States in USA. How can a state be time varying? Like how can it violate PH assumption. And whats the solution to this?

πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/loaswera
πŸ“…︎ Nov 09 2021
🚨︎ report
Question on correct useage of dummy variables

Data set: Default of Credit Card Clients Data Set

So I am trying to create dummy variables for the X6 through X10 variables in the data set above. The measuremnt scale is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.

Now some of the variables do not have observations covering the full range of this scale. Given that one category should be left off when hotcoding for a regression and my goal is to ask what is the probabiloity of default givin a particualr level of payment delay, should I do any of the following:

  1. Simply leave off one category from each variable and accept that in some cases I will only be able to describe the probability of say having either level 1,2, or 6 given that none of the other observations were true for a particular variable
  2. add a column of 0's for each set of dummy variables so as to have the count match the full scale of possible values minus 1
  3. a third option I have missed

If anything I have said is unclear, please let me know.

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/helios1014
πŸ“…︎ Sep 17 2021
🚨︎ report
Issue with Difference in Differences Estimation in RStudio Using Dummy Variables

All,

Tl;Dr: Getting Error: "Coefficients: (1 not defined because of singularities). " First time using RStudio and doing a Difference in Differences Estimation.

I'm attempting to learn Rstudio to try out my first two period panel regression but I'm running into issues with my dummy variable independent variable. I'm attempting to model the effect that ending unemployment benefits had on a state's U3 unemployment rate between the May and July BLS' LAUS report. I have the month of May coded as 0 and July coded as 1 for the variable 'Month.' For whether or not a state ended unemployment benefits early I have the variable 'End.Early' as 0 for no and 1 for yes. I am using the 50 states so n=100.

Below is the the printout of what I'm putting into R Studio and what I receive back. I figured with two dummy variables this should be a very easy regression to see if ending the benefits would be statistically significant or not but I can't get EndEarly to generate useful information no matter how I try to regress it:

&gt; U3May=subset(testdat,Month=="0")

&gt; U3July=subset(testdat,Month=="1")

&gt; diff_U3=U3July$Unemployed..-U3May$Unemployed..

&gt; diff_EndEarly=U3July$End.Early-U3July$End.Early

&gt; model1=lm(diff_U3~diff_EndEarly)

&gt; summary(model1)

Call:

lm(formula = diff_U3 ~ diff_EndEarly)

Residuals:

Min 1Q Median 3Q Max

-1.096 -0.296 0.004 0.304 0.904

Coefficients: (1 not defined because of singularities)

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.09600 0.06833 1.405 0.166

diff_EndEarly NA NA NA NA

Residual standard error: 0.4832 on 49 degrees of freedom

πŸ‘︎ 5
πŸ’¬︎
πŸ‘€︎ u/ShockGryph
πŸ“…︎ Sep 03 2021
🚨︎ report
I have created a dummy variables table and need to do a linear regression on the data (+threat bias data?) I managed to use lm() to get some results but it is only including low and mod (not high) data for effortful control? Anyone know how to help? Sorry this is confusing reddit.com/gallery/p46msh
πŸ‘︎ 5
πŸ’¬︎
πŸ‘€︎ u/sophiemae19
πŸ“…︎ Aug 14 2021
🚨︎ report
How to handle lots of dummy control variables?

I am doing a regression in which one of my control variables is which country each of my species occurs in, however as they can occur in multiple countries I have a large table with each country and 0s and 1s for each species. Is there any way to get R to put these into the regression without having to manually add each column, and just consider which country it is from as a variable rather than each one individually? The issue with doing it manually is that a) it would be very long and clunky and b) it is more open to mistakes. By the way, I am not very experienced with regressions so apologies if this this really obvious!

πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/LeafHGG
πŸ“…︎ Jul 30 2021
🚨︎ report
[Q] [R] Does anyone have a citation for the equivalence of an interaction regression model between binary regressors and using dummy variables for each combination between the binary regressors?

It's pretty easy to show on paper that the two models must be equivalent, but my professor (who isn't the greatest at stats) wants a citation anyway. The problem is, this is such a trivial result that I'm having trouble finding an academic source that actually spells it out. Any recommendations on where to look?

πŸ‘︎ 16
πŸ’¬︎
πŸ‘€︎ u/post_it_notes
πŸ“…︎ Jun 15 2021
🚨︎ report
Linear Regression w/ Dummy Variables

I learning about regression using categorical variables. It mostly makes sense to me but all the video explanations I can find don't really explain the why of the "n-1" dummy variables. So, for example, if I'm using north(x1), south(x2), east(x3), and west as independent variables, then west is X1=0, x2=0, x3=0. If I do a regression using excel, where do I look to see the p-value and coefficient of west? It isn't clear to me why we have to drop one of the dummy variables.

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/topdeadcntr
πŸ“…︎ Jul 24 2021
🚨︎ report
[Calculus 2: US College] When is it appropriate to use a dummy variable in anti-differentiation? In this situation, it wasn’t appropriate. The mistake is circled in pink. But why? Also, my work goes from the left column (up to down) and then to the right column (up to down once more)
πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/theasianjose
πŸ“…︎ Aug 09 2021
🚨︎ report
Dummy variable on a VAR model

Hey guys, i'm recently learning about autorregressive vector models. I'm not really good at Rstudio but still learning. Also, sorry for my bad english, it's not my main lenguage

I'm attempting to run a VAR model on WTI oil international prices and colombian gas prices in order to analise if changes on oil prices afects colombian gas prices. I'm using data from 2017 to 2021.

The thing is, i have a dummy variable ('cause i need to mark the recent pandemic season) but i really don't know how to incorporate it to the model. I tried using the function "season" and "exogen" but it shows me an error

How can i estimate a VAR model with a dummy variable?

Any help will be really apreciated!

πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/Ophifella
πŸ“…︎ Sep 16 2021
🚨︎ report
[Q] Comparing linear regression models using dummy variables

For my research I've explored the effects of predictors on green space visitation before and during the COVID-19 pandemic. The result is two linear regression models over two periods of time for the same population group. I was wondering how to compare these two models, and have found online that an alternative to a Chow test is to 'run the model with dummy variables'. I know what dummy variables are, but am confused how this method would work in practice. Is anyone familiar with this method? Any help is appreciated!

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/Jimbawzz
πŸ“…︎ Jun 18 2021
🚨︎ report
[Q] Controlling for year effects using dummy variables

I am currently reading the master thesis of a friend of mine. The aim of the thesis is to measure the impact of the fund size of a private equity fund on the performance of the fund using a regression analysis. Among other control variables, the year in which a given fund was launched is included. My friend argues that this allows to control for differences between economic conditions, which are related to the specific year. What makes me wonder is the implementation of this control variable: for each year he creates a dummy variable in the following way:

2010: 1 if the fund was launched in 2010 and 0 otherwise

2011: 1 if the fund was launched in 2011 and 0 otherwise

2012: 1 if the fund was launched in 2012 and 0 otherwise

...

My understanding is that the variable year is an interval scaled variable. Is it legitimate to control for year effects in this way? What would be an alternative (better?) way to control for possible effects emanating from the launch year of the fund? I have no experience in panel data regression but I thought that this may be a more appropriate approach to use here.

Thanks

πŸ‘︎ 9
πŸ’¬︎
πŸ‘€︎ u/Fteddy91
πŸ“…︎ May 12 2021
🚨︎ report
Count if dummy variable occurs before another dummy variable

Dear all,

I have a dataset that looks something like the table provided below and I would like to count the number of times dummy_2 occurs before dummy_1 and vice versa. In my study, the dummies are news releases and my guess is that one reveals information about the other (the information content is similar in the two releases), which could explain why one is not significant.

Does anyone know how I could accomplish this?

Many thanks in advance!

Date variable dummy_1 dummy_2
YYYY-MM-DD x_1 0 1
YYYY-MM-DD x_2 1 0
YYYY-MM-DD x_3 0 0
YYYY-MM-DD x_4 0 0
YYYY-MM-DD x_5 1 0
YYYY-MM-DD x_6 0 1
πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/SPJ19
πŸ“…︎ Jul 06 2021
🚨︎ report
Whether to standardize a dummy coded variable...and how to interpret when I don't

I’m making some difficult decisions about standardized coefficients. I have a multilevel linear regression model with aΒ pre vs. post measureΒ (coded 0 and 1), aΒ continuous predictor,Β their interaction, and random intercepts for participants.

ToΒ obtain standardized coefficients for all of the variables, should I standardize the dummy coded variable? This changes the model because, now, holding this variableΒ atΒ 0 means holding it somewhere in the ether between pre and post. And the numbers don’t actually correspond to my coefficients.

For now,Β I opted not to standardize the dummy coded variable. Everything else in the model is standard, though--including the dependent variable. But I’m stuck on how to report the coefficient for the dummy code...is it technicallyΒ notΒ a standardized coefficient?

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/rachelebenjamin
πŸ“…︎ Jun 23 2021
🚨︎ report
Creating a dummy variable = 1 if another variable is ever 1 by ID.

Hello I have a data frame that looks roughly like this

ID Var1 NewVar
1 1 1
1 1 1
1 1 1
1 1 1
2 0 0
2 0 0
3 1 1
3 0 1
3 0 1

I want to create NewVar = 1 if Var1 is ever 1 for a particular ID. Not sure how to do this in R yet looking for any advice. Probably a simple thing to do I just don't know the syntax here.

-Thanks!

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/Caconym32
πŸ“…︎ Jul 19 2021
🚨︎ report
Can anyone explain what the dummy variable is and prove it?

Why can the integral with c(w)e^(iwx) and integral with c(t)e^(itx) be combined?

https://ibb.co/jrn6t58

I mean dt and dw are different?

πŸ‘︎ 7
πŸ’¬︎
πŸ‘€︎ u/JacksonSteel
πŸ“…︎ May 31 2021
🚨︎ report

Please note that this site uses cookies to personalise content and adverts, to provide social media features, and to analyse web traffic. Click here for more information.