A list of puns related to "Stratified Sampling"
Iβm a data analyst by trade whoβs been asked by work to do a more data sciencey thing. I can tell my boss Iβm not suited for this with zero repercussions, but it seemed like a good way to learn something new. My skill level with SQL is moderate-advanced, but Iβm very much a beginner to R. Iβm requesting some advice on where to start with a type of sampling Iβll explain below.
I have population A at an individual level (~20k people) along with their age, gender, and an estimate of income for their zip (Iβm aware that this isnβt the ideal way to estimate income - it was requested we donβt go deeper than that because itβs not that important). I have the same variables for population B (but with ~50m potential people).
What I need to do is pull a random sample (of equal size) from B that approximates population A in terms of age, gender, and estimated income. SQL is a horrible tool for this. Iβm inexperienced with R and donβt know where to start. Is this stratified random sampling? And if so do I need way more experience to pull something of this caliber off? Or is it not nearly as complex as Iβd imagine due to the few fields Iβm trying to group by?
Really just looking to get the right name for the type of sampling Iβm looking to do so I can do some further research, but also fine with hearing βthis is so far beyond your experience that you should just tell your boss this needs to be done by a data scientist.β Thanks!
I tried posting on r/statistics but don't have enough karma yet :( Hopefully someone here can help.
I'm using the following guide to calculate the margin of error for stratified sampling
Stratified Sampling: Analysis (stattrek.com)
If I'm following the same steps and calculate the ME for only the boys (stratum population 10,000 and total population 10,000) , I'd get an average of 70 with ME of 4.74 so the range at 95% confidence is 65.26 to 74.74. Is this the proper way to calculate the margin of error for each individual stratum? If it is not correct, can you point me to a good source to do this? thanks.
I have seen the model score being used as the basis of stratified sampling in order to evaluate a Machine Learning model. For example, let's say we have 1000 samples. We get the model scores for these 1000 samples and then perform stratified sampling on top of it to compute the final metric (say Precision and Recall).
So we would divide the 1000 scored samples according to model scores, let's say 3 buckets of scores as [0, 0.33), [0.33, 0.67) and [0.67, 1]. Now we sample 100 samples from each of these buckets, calculate PR on individual buckets and use the buckets weights to find the aggregate Precision Recall.
This whole strategy is new to me. Is there some academic research papers or youtube video/course where I can learn more about this?
I have a follow up question as well:
Let's say we have a model M1 for which we performed this kind of sampling and calculated the metrics. Now we have a new model M2 and we want to find whether the new model provides a boost to the performance or not? Would we evaluate the model M2 on the earlier stratified sample only or would we perform a similar kind of stratified sampling according to model scores of M2 (Since the model scores come from separate distributions now)? If we generate a new stratified sampled dataset for this, would the results be comparable i.e. the results of M1 and M2?
Hi there. I'm a humanitarian assistance manager, forced by circumstances to dust off my knowledge of probability sampling, due to the current circumstances. As above, is it possible to use a Stratified/Cluster Sampling method? The strata would be province level. The clusters would be geographic areas where assistance is being delivered. If it helps, see map. The orange boundaries are provinces and the greenish boundaries are our areas of operations. Many thanks!
https://preview.redd.it/v8q2nmrpcuu51.jpg?width=1435&format=pjpg&auto=webp&s=ba006c801642795275473c9edb5d96283432d972
I am trying to understand the advantages of stratified sampling over simple random sampling. Wikipedia says: "stratification gives smaller error in estimation" [1].
Do we know when exactly this is the case?
Also, can we know by how much is the sampling error is reduced in comparison to simple random sampling?
I am trying to downsample extremely uneven data sets. I have:
One with 40 million rows of data The other with 4K rows of data
I want to downsample in a meaningful way and compare to purely random sampling.
I am using Kmeans and randomly sampling proportionally within each cluster.
Do you know of any other ways to accomplish this?
I had a question regarding sampling methodology,
I want to interview PRMs (Disabled Passengers) with disabilities. However, If I chose not to interview passengers with mental disabilities or impairment, due to ethical procedures.
I had the notion that this sampling technique would be regarded as a Stratified Random Sampling methodology as I divided the total population into a smaller strata. Or would this still be regarded as just random sampling, or something else.
Within ... Airport, you can distinguish mentally impaired passengers from a multi-coloured lanyard provided to them and so I want to use this identification tool to exclude them from the interview process.
Thank you for any help,
Let's say you're conducting a job satisfaction survey of teachers in a county. The key here is that you will need to publish a report for every school surveyed, in addition to a report at the county level. So you decide on a stratified design, with schools as your stratification variable.
Now, let's say there are 10,000 teachers total in the county. For now, let's ignore the issue of non-response. For a 95% confidence level and 3% margin of error, you would need a sample size of 964.
Next, you do a proportional allocation to determine the sample size for each school. This should suffice for the county-level report.
Now let's say you have a school with 50 teachers. Proportional allocation would give you a sample size of 5, which again, should be fine when you're reporting results at the county-level. But is this sample size large enough if you need to report the exact same survey results, but just at the school level? Would techniques like weighting be enough to compensate?
I would think that it would not be, because in order to achieve a 95% confidence level with 3% margin of error for a population of 50, you would need a sample size of 48.
Essentially, my coworker is trying to make the case for why the sample size of 5 would be totally okay. Somehow that doesn't seem right to me, but I'm not sure how to explain why.
I am designing a survey for which I would like to stratify "events" by state / province, creating β=86 strata for the particular dataset I'm working with. However, there are some strata with low counts.
Using proportional allocation with a budgeted sample size of π=400 events out of π=9046, some strata are not selected because they occur so infrequently in the sampling frame i.e. π/πβ πβ<1 so the inclusion probability for units in some strata is 0. In this case, to ensure that all businesses have some probability of inclusion, should I instead use simple random sampling?
I'm trying to build a logistic regression model for some survey data where people list how likely they are to enroll in a program (Not at all, Not very, Somewhat, Very, and Extremely likely) based off factors like age and familiarity with the program.
The problem is, about 40% of people say Somewhat likely, and my models will only predict responses to be Not very likely, or Somewhat likely. I'm also dealing with a sample size of around 1100.
I've thought about doing some sort of stratified sampling, where I take a larger sample with replacement for those who say anything other than Somewhat likely, or Not very likely, but I'm unaware as to implementing this sort of strategy with R, or how effective it would be with such a small sample.
Basically, I've spent a few days on this, banged my head around, and can't come up with any meaningful way to represent the data. Any help would be appreciated.
Can I combine the two methods?
Or is it still a Convenience Sample if the firms that make up the total sampling population are not totally randomly selected?
I'm hoping I can get some help figuring out how I can randomly select either the first or second submission from a sample of 20,000 individuals.
Background:
What I need to do:
So I'm gathering I need to figure out how to do stratified random sampling on the 20,000 submission dataset to get to a 10,000 submission dataset, right? And each person would be a group, and I want to choose one submission from each of the groups.
I found this, but I'm getting lost around steps 4 & 5 when I try to conceptualize applying it to my dataset and ending up with exactly 10,000 submissions from exactly 10,000 people.
Can anyone help me figure this out?
Hello all,
I am very, very new to Stata. I need to figure out the best approach to do stratified random sampling.
Some background: I have ~170 words and their abstractness ratings from a very large word database. Within the 170 words, I grouped them up into five bins based on their abstractness ratings (lowest 20% of ratings to highest 20% of ratings). I want to randomly sample 50 total words for my study; 10 from each bin to ensure all abstractness levels are being represented fairly in the sample.
The first suggestion from my supervisor was to command "sample 50, count", however this appeared to just sample 50 words regardless of which bins they belong to - the resulting sample had an uneven representation of bins. Intuitively, I feel like my best go would to manually sample 10 from each bin somehow.
The bins are labelled - is there a way I can command Stata to sample from a particular bin by typing something like "sample 10, [bin label]? Is there a better way to do stratified sampling?
Thanks so much in advance!
Are there any good learning resources on how to implement stratified sampling in tensorflow? And with tf.records specifically?
Closest I've found is Hvass-Labs tutorial series which covers tf.records, but no stratified sampling implementation. https://www.youtube.com/watch?v=oxrcZ9uUblI https://github.com/Hvass-Labs/TensorFlow-Tutorials
What is your goto strategy when it comes to sampling from rather correlated data? Especially where the feature space is rather high e.g N<<D? I have seen approaches that use single-linkage clustering to guarantee a minimal distance between training folds and test sets. However it seems rather hard to optimize edge distance as a hyper parameter. At least when using AUROC or similar cost functions, as this would favor then again, correlation between training fold and test fold (or training set and test set).
I know that for histogram bin size, the Freedman-Diaconis rule is designed to select a good size without the statistician having to fist look at the data and decide their best guess at what the bin size should be.
Similar to chosing the bin sizes, statified random sampling needs sizes to be chosen for the intervals of prior data that will be the strata when basing it on a numerical variable.
Is the Freedman-Diaconis rule valid for deciding this? Or is there something similar for statified random sampling? Or does the statitian just have to look at the prior data of the population and make a guess?
I started wondering this after seeing a TEDx talk about sortition and the speaker suggested we use statification in the selection process. I wondered how, with many continuous variables (like wealth, earnings, and age [and possibly latent spaces of things like location, race, and gender]), the intervals could be decided upon.
Hi,
I'm trying to calculate some information (proportion, std. error) of a stratified random sample without knowing the population size.
I know the value of the sample size, that the sample stratification proportionally reflects the population stratification and that the population could be considered large.
Unfortunately here is where my process breaks down. All of the formulas I'm familiar with require population size N
, most require sample size n
as well eg. https://i.imgur.com/4lLWl8n.png
I've considered using the sample size in place of population size. I may be able to get an estimate this way although I'm not sure if this is appropriate.
Calculating the standard error is a bit more difficult as the formula I have requires both sample and population sizes. If i set them to equal it would produce 0 as (N-n) is a multiplier in the formula.
I feel like I'm overlooking something very obvious. If anyone could point me in the right direction on how to obtain these statistics without a known population size I'd really appreciate it.
Thanks in advance!
So my company has always used a proportional stratified sample weighting methodology that is:
stratum weight = ( Nh / N ) * (n/nh)
Where nh is the sample size for stratum h, Nh is the population size for stratum h, N is total population size, and n is total sample size.
But lately there is a desire to use a different feature to get proportion (one that isn't a population at all). so the new formula would be:
stratum weight = ( Xh / X ) * (n/nh)
Where Xh is an aggregate currency value for the stratum and X is the population estimate for this number.
Is this kind of mix ok to do?
Are there things we should consider when doing this?
What is an alternative?
I suggested using the currency value as a further strata dimension but data quality/availability is somewhat prohibitive and it also would likely make stratum very small and expose us to more bias.
Thoughts?
Is there an easier way or a general rule to follow.
https://vle.mathswatch.co.uk/images/questions/question1254.png
How do I solve this?
Whenever I bring in data for a neural net, I have to convert it into numpy arrays (which is fine), but then I have to use code to break the numpy array into stratified random arrays, and doing this well can take a lot of code. Upon googling, I can find no premade easy library to do this that's compatible with tensorflow/numpy arrays (the sklearn one can't be converted back to numpy arrays as far as I can tell). Does anyone know of one?
If not I'll make my own optimized little module for it and publish it to save the rest of the world (and me in the future) some time.
I am trying to downsample an extremely uneven data sets. I have:
One with 40 million rows of data The other with 4K rows of data
I want to downsample in a meaningful way and compare the model results of XGBoost, ANN, etc to the same models trained on purely random sampling.
I am using Kmeans and randomly sampling proportionally within each cluster.
Do you know of any other ways to accomplish this?
I had a question regarding sampling methodology,
I want to interview PRMs (Disabled Passengers) with disabilities. However, If I chose not to interview passengers with mental disabilities or impairment, due to ethical procedures.
I had the notion that this sampling technique would be regarded as a Stratified Random Sampling methodology as I divided the total population into a smaller strata. Or would this still be regarded as just random sampling, or something else.
Within ... Airport, you can distinguish mentally impaired passengers from a multi-coloured lanyard provided to them and so I want to use this identification tool to exclude them from the interview process.
Thank you for any help,
With stratified sampling, if you want to sample the population of the United states, you would take a random sample of people from each state.
With Cluster sampling, you would group all the states into random groups of 2 or more states, and randomly sample one or more of these groups, and all the people in it would be your sample ?
Please note that this site uses cookies to personalise content and adverts, to provide social media features, and to analyse web traffic. Click here for more information.