Stratified random sampling

I’m a data analyst by trade who’s been asked by work to do a more data sciencey thing. I can tell my boss I’m not suited for this with zero repercussions, but it seemed like a good way to learn something new. My skill level with SQL is moderate-advanced, but I’m very much a beginner to R. I’m requesting some advice on where to start with a type of sampling I’ll explain below.

I have population A at an individual level (~20k people) along with their age, gender, and an estimate of income for their zip (I’m aware that this isn’t the ideal way to estimate income - it was requested we don’t go deeper than that because it’s not that important). I have the same variables for population B (but with ~50m potential people).

What I need to do is pull a random sample (of equal size) from B that approximates population A in terms of age, gender, and estimated income. SQL is a horrible tool for this. I’m inexperienced with R and don’t know where to start. Is this stratified random sampling? And if so do I need way more experience to pull something of this caliber off? Or is it not nearly as complex as I’d imagine due to the few fields I’m trying to group by?

Really just looking to get the right name for the type of sampling I’m looking to do so I can do some further research, but also fine with hearing β€œthis is so far beyond your experience that you should just tell your boss this needs to be done by a data scientist.” Thanks!

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/GodKamnitDenny
πŸ“…︎ Dec 10 2021
🚨︎ report
Stratified sampling margin of error

I tried posting on r/statistics but don't have enough karma yet :( Hopefully someone here can help.

I'm using the following guide to calculate the margin of error for stratified sampling

Stratified Sampling: Analysis (stattrek.com)

If I'm following the same steps and calculate the ME for only the boys (stratum population 10,000 and total population 10,000) , I'd get an average of 70 with ME of 4.74 so the range at 95% confidence is 65.26 to 74.74. Is this the proper way to calculate the margin of error for each individual stratum? If it is not correct, can you point me to a good source to do this? thanks.

πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/_MyNameIs__
πŸ“…︎ Sep 07 2021
🚨︎ report
[D] Stratified Sampling using model score?

I have seen the model score being used as the basis of stratified sampling in order to evaluate a Machine Learning model. For example, let's say we have 1000 samples. We get the model scores for these 1000 samples and then perform stratified sampling on top of it to compute the final metric (say Precision and Recall).

So we would divide the 1000 scored samples according to model scores, let's say 3 buckets of scores as [0, 0.33), [0.33, 0.67) and [0.67, 1]. Now we sample 100 samples from each of these buckets, calculate PR on individual buckets and use the buckets weights to find the aggregate Precision Recall.

This whole strategy is new to me. Is there some academic research papers or youtube video/course where I can learn more about this?

I have a follow up question as well:

Let's say we have a model M1 for which we performed this kind of sampling and calculated the metrics. Now we have a new model M2 and we want to find whether the new model provides a boost to the performance or not? Would we evaluate the model M2 on the earlier stratified sample only or would we perform a similar kind of stratified sampling according to model scores of M2 (Since the model scores come from separate distributions now)? If we generate a new stratified sampled dataset for this, would the results be comparable i.e. the results of M1 and M2?

πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/MrBrightside73
πŸ“…︎ Mar 04 2021
🚨︎ report
Is there such a thing as a Stratified - Cluster Sampling Methodology Combo?

Hi there. I'm a humanitarian assistance manager, forced by circumstances to dust off my knowledge of probability sampling, due to the current circumstances. As above, is it possible to use a Stratified/Cluster Sampling method? The strata would be province level. The clusters would be geographic areas where assistance is being delivered. If it helps, see map. The orange boundaries are provinces and the greenish boundaries are our areas of operations. Many thanks!

https://preview.redd.it/v8q2nmrpcuu51.jpg?width=1435&format=pjpg&auto=webp&s=ba006c801642795275473c9edb5d96283432d972

πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/i2rsantos
πŸ“…︎ Oct 23 2020
🚨︎ report
What is Stratified Sampling?
πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/robofied
πŸ“…︎ Jan 30 2021
🚨︎ report
Stratified sampling should reduce the sampling error, but how do I know by how much?

I am trying to understand the advantages of stratified sampling over simple random sampling. Wikipedia says: "stratification gives smaller error in estimation" [1].

Do we know when exactly this is the case?

Also, can we know by how much is the sampling error is reduced in comparison to simple random sampling?

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/mct2011
πŸ“…︎ Feb 15 2020
🚨︎ report
Machine learning for stratified sampling

I am trying to downsample extremely uneven data sets. I have:

One with 40 million rows of data The other with 4K rows of data

I want to downsample in a meaningful way and compare to purely random sampling.

I am using Kmeans and randomly sampling proportionally within each cluster.

Do you know of any other ways to accomplish this?

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/CkmCpvis
πŸ“…︎ Jun 26 2020
🚨︎ report
[Q] Sampling Methodology Random or Stratified

I had a question regarding sampling methodology,

I want to interview PRMs (Disabled Passengers) with disabilities. However, If I chose not to interview passengers with mental disabilities or impairment, due to ethical procedures.

I had the notion that this sampling technique would be regarded as a Stratified Random Sampling methodology as I divided the total population into a smaller strata. Or would this still be regarded as just random sampling, or something else.

Within ... Airport, you can distinguish mentally impaired passengers from a multi-coloured lanyard provided to them and so I want to use this identification tool to exclude them from the interview process.

Thank you for any help,

πŸ‘︎ 7
πŸ’¬︎
πŸ‘€︎ u/dimension3000
πŸ“…︎ Aug 22 2019
🚨︎ report
[Question] Stratified sampling and sample size

Let's say you're conducting a job satisfaction survey of teachers in a county. The key here is that you will need to publish a report for every school surveyed, in addition to a report at the county level. So you decide on a stratified design, with schools as your stratification variable.

Now, let's say there are 10,000 teachers total in the county. For now, let's ignore the issue of non-response. For a 95% confidence level and 3% margin of error, you would need a sample size of 964.

Next, you do a proportional allocation to determine the sample size for each school. This should suffice for the county-level report.

Now let's say you have a school with 50 teachers. Proportional allocation would give you a sample size of 5, which again, should be fine when you're reporting results at the county-level. But is this sample size large enough if you need to report the exact same survey results, but just at the school level? Would techniques like weighting be enough to compensate?

I would think that it would not be, because in order to achieve a 95% confidence level with 3% margin of error for a population of 50, you would need a sample size of 48.

Essentially, my coworker is trying to make the case for why the sample size of 5 would be totally okay. Somehow that doesn't seem right to me, but I'm not sure how to explain why.

πŸ‘︎ 2
πŸ’¬︎
πŸ“…︎ Apr 10 2020
🚨︎ report
[Q] Should I use simple random sampling instead of stratified sampling when some strata have low counts?

I am designing a survey for which I would like to stratify "events" by state / province, creating β„Ž=86 strata for the particular dataset I'm working with. However, there are some strata with low counts.

Using proportional allocation with a budgeted sample size of 𝑛=400 events out of 𝑁=9046, some strata are not selected because they occur so infrequently in the sampling frame i.e. 𝑛/π‘β‹…π‘β„Ž<1 so the inclusion probability for units in some strata is 0. In this case, to ensure that all businesses have some probability of inclusion, should I instead use simple random sampling?

πŸ‘︎ 7
πŸ’¬︎
πŸ‘€︎ u/cdlm89
πŸ“…︎ Nov 27 2019
🚨︎ report
Stratified Sampling

I'm trying to build a logistic regression model for some survey data where people list how likely they are to enroll in a program (Not at all, Not very, Somewhat, Very, and Extremely likely) based off factors like age and familiarity with the program.

The problem is, about 40% of people say Somewhat likely, and my models will only predict responses to be Not very likely, or Somewhat likely. I'm also dealing with a sample size of around 1100.

I've thought about doing some sort of stratified sampling, where I take a larger sample with replacement for those who say anything other than Somewhat likely, or Not very likely, but I'm unaware as to implementing this sort of strategy with R, or how effective it would be with such a small sample.

Basically, I've spent a few days on this, banged my head around, and can't come up with any meaningful way to represent the data. Any help would be appreciated.

πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/laisant
πŸ“…︎ Jan 07 2019
🚨︎ report
Combining Convenience Sampling Method with Stratified Random Sampling Method?

Can I combine the two methods?

Or is it still a Convenience Sample if the firms that make up the total sampling population are not totally randomly selected?

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/throwbitz
πŸ“…︎ Jan 09 2020
🚨︎ report
Stratified random sampling

I'm hoping I can get some help figuring out how I can randomly select either the first or second submission from a sample of 20,000 individuals.

Background:

  • I'm working with a dataset of 100,000 submissions made over 2 years. Each row is a submission.
  • Because someone could make a submission each year, about 20,000 of the 100,000 submissions were made by someone submitting one in Year 1 and one in Year 2.
  • It's capped at one submission a year, so there are no individuals with more than 2 submissions.
  • Individuals have unique identifiers (ID)
  • Years are labeled as 1 or 2.

What I need to do:

  • From the 20,000 multiple submission subsample, I need to randomly choose exactly one submission from each individual. I want to end up with a 10,000 submission dataset with one of the two submissions from every repeat submitter.
  • I can't randomly select 10,000. I need to ensure that I have one entry from each of those 10,000 people who submitted two. If I were to randomly select, I might end up with Year 1 and Year 2 submissions for one person, and zero submissions from someone else.
  • For reasons of avoiding bias, I can't just choose everyone's first submission or everyone's second submission. I have to randomly choose first or second.
  • When I have my 10,000 subsample, I will merge it back into the original dataset, creating a 90,000 submission dataset.

So I'm gathering I need to figure out how to do stratified random sampling on the 20,000 submission dataset to get to a 10,000 submission dataset, right? And each person would be a group, and I want to choose one submission from each of the groups.

I found this, but I'm getting lost around steps 4 & 5 when I try to conceptualize applying it to my dataset and ending up with exactly 10,000 submissions from exactly 10,000 people.

Can anyone help me figure this out?

πŸ‘︎ 3
πŸ’¬︎
πŸ“…︎ Dec 19 2019
🚨︎ report
Stratified sampling - best approach?

Hello all,

I am very, very new to Stata. I need to figure out the best approach to do stratified random sampling.

Some background: I have ~170 words and their abstractness ratings from a very large word database. Within the 170 words, I grouped them up into five bins based on their abstractness ratings (lowest 20% of ratings to highest 20% of ratings). I want to randomly sample 50 total words for my study; 10 from each bin to ensure all abstractness levels are being represented fairly in the sample.

The first suggestion from my supervisor was to command "sample 50, count", however this appeared to just sample 50 words regardless of which bins they belong to - the resulting sample had an uneven representation of bins. Intuitively, I feel like my best go would to manually sample 10 from each bin somehow.

The bins are labelled - is there a way I can command Stata to sample from a particular bin by typing something like "sample 10, [bin label]? Is there a better way to do stratified sampling?

Thanks so much in advance!

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/its_liiiiit_fam
πŸ“…︎ Jan 24 2019
🚨︎ report
[D] Stratified sampling and tfrecords

Are there any good learning resources on how to implement stratified sampling in tensorflow? And with tf.records specifically?

Closest I've found is Hvass-Labs tutorial series which covers tf.records, but no stratified sampling implementation. https://www.youtube.com/watch?v=oxrcZ9uUblI https://github.com/Hvass-Labs/TensorFlow-Tutorials

πŸ‘︎ 4
πŸ’¬︎
πŸ‘€︎ u/ReacH36
πŸ“…︎ Apr 01 2019
🚨︎ report
[D] stratified sampling from highly correlated data

What is your goto strategy when it comes to sampling from rather correlated data? Especially where the feature space is rather high e.g N<<D? I have seen approaches that use single-linkage clustering to guarantee a minimal distance between training folds and test sets. However it seems rather hard to optimize edge distance as a hyper parameter. At least when using AUROC or similar cost functions, as this would favor then again, correlation between training fold and test fold (or training set and test set).

πŸ‘︎ 4
πŸ’¬︎
πŸ‘€︎ u/N-iodosuccinimide
πŸ“…︎ Nov 15 2017
🚨︎ report
Can the interval sizing for stratified random sampling be done automatically?

I know that for histogram bin size, the Freedman-Diaconis rule is designed to select a good size without the statistician having to fist look at the data and decide their best guess at what the bin size should be.

Similar to chosing the bin sizes, statified random sampling needs sizes to be chosen for the intervals of prior data that will be the strata when basing it on a numerical variable.

Is the Freedman-Diaconis rule valid for deciding this? Or is there something similar for statified random sampling? Or does the statitian just have to look at the prior data of the population and make a guess?


I started wondering this after seeing a TEDx talk about sortition and the speaker suggested we use statification in the selection process. I wondered how, with many continuous variables (like wealth, earnings, and age [and possibly latent spaces of things like location, race, and gender]), the intervals could be decided upon.

πŸ‘︎ 5
πŸ’¬︎
πŸ‘€︎ u/JNCressey
πŸ“…︎ Jul 13 2018
🚨︎ report
Stratified Sampling - statistics without known population size

Hi,

I'm trying to calculate some information (proportion, std. error) of a stratified random sample without knowing the population size.

I know the value of the sample size, that the sample stratification proportionally reflects the population stratification and that the population could be considered large.

&nbsp;

Unfortunately here is where my process breaks down. All of the formulas I'm familiar with require population size N, most require sample size n as well eg. https://i.imgur.com/4lLWl8n.png

I've considered using the sample size in place of population size. I may be able to get an estimate this way although I'm not sure if this is appropriate.

Calculating the standard error is a bit more difficult as the formula I have requires both sample and population sizes. If i set them to equal it would produce 0 as (N-n) is a multiplier in the formula.

&nbsp;

I feel like I'm overlooking something very obvious. If anyone could point me in the right direction on how to obtain these statistics without a known population size I'd really appreciate it.

Thanks in advance!

πŸ‘︎ 4
πŸ’¬︎
πŸ‘€︎ u/5k1rm15h
πŸ“…︎ May 13 2018
🚨︎ report
Stratified Sampling question

So my company has always used a proportional stratified sample weighting methodology that is:

stratum weight = ( Nh / N ) * (n/nh)

Where nh is the sample size for stratum h, Nh is the population size for stratum h, N is total population size, and n is total sample size.

But lately there is a desire to use a different feature to get proportion (one that isn't a population at all). so the new formula would be:

stratum weight = ( Xh / X ) * (n/nh)

Where Xh is an aggregate currency value for the stratum and X is the population estimate for this number.

Is this kind of mix ok to do?

Are there things we should consider when doing this?

What is an alternative?

I suggested using the currency value as a further strata dimension but data quality/availability is somewhat prohibitive and it also would likely make stratum very small and expose us to more bias.

Thoughts?

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/foshogun
πŸ“…︎ Jun 17 2016
🚨︎ report
ELI5: what and how the process of stratified and multistage area sampling work?
πŸ‘︎ 2
πŸ’¬︎
πŸ“…︎ Jul 06 2017
🚨︎ report
When should you choose Stratified sampling over random sampling?

Is there an easier way or a general rule to follow.

πŸ‘︎ 11
πŸ’¬︎
πŸ‘€︎ u/Ani10
πŸ“…︎ Dec 05 2015
🚨︎ report
Stratified sampling question

https://vle.mathswatch.co.uk/images/questions/question1254.png

How do I solve this?

πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/LibraFraudster
πŸ“…︎ Mar 05 2017
🚨︎ report
Doing democracy differently: Sortition (with some constraints, such as Stratified Random Sampling) | Brett Hennig youtube.com/watch?v=-FsOH…
πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/howsci
πŸ“…︎ Nov 06 2017
🚨︎ report
Smart Way to Do Stratified Random Sampling

Whenever I bring in data for a neural net, I have to convert it into numpy arrays (which is fine), but then I have to use code to break the numpy array into stratified random arrays, and doing this well can take a lot of code. Upon googling, I can find no premade easy library to do this that's compatible with tensorflow/numpy arrays (the sklearn one can't be converted back to numpy arrays as far as I can tell). Does anyone know of one?

If not I'll make my own optimized little module for it and publish it to save the rest of the world (and me in the future) some time.

πŸ‘︎ 5
πŸ’¬︎
πŸ‘€︎ u/inferno596
πŸ“…︎ Jun 30 2017
🚨︎ report
Machine learning for stratified sampling

I am trying to downsample an extremely uneven data sets. I have:

One with 40 million rows of data The other with 4K rows of data

I want to downsample in a meaningful way and compare the model results of XGBoost, ANN, etc to the same models trained on purely random sampling.

I am using Kmeans and randomly sampling proportionally within each cluster.

Do you know of any other ways to accomplish this?

πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/CkmCpvis
πŸ“…︎ Jun 27 2020
🚨︎ report
[University] Stratified or Random Sampling ?

I had a question regarding sampling methodology,

I want to interview PRMs (Disabled Passengers) with disabilities. However, If I chose not to interview passengers with mental disabilities or impairment, due to ethical procedures.

I had the notion that this sampling technique would be regarded as a Stratified Random Sampling methodology as I divided the total population into a smaller strata. Or would this still be regarded as just random sampling, or something else.

Within ... Airport, you can distinguish mentally impaired passengers from a multi-coloured lanyard provided to them and so I want to use this identification tool to exclude them from the interview process.

Thank you for any help,

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/dimension3000
πŸ“…︎ Aug 22 2019
🚨︎ report
Question About Stratified and Cluster sampling

With stratified sampling, if you want to sample the population of the United states, you would take a random sample of people from each state.

With Cluster sampling, you would group all the states into random groups of 2 or more states, and randomly sample one or more of these groups, and all the people in it would be your sample ?

πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/stagforce
πŸ“…︎ Jan 11 2019
🚨︎ report

Please note that this site uses cookies to personalise content and adverts, to provide social media features, and to analyse web traffic. Click here for more information.