fb-share
Contact Us menu-bars menu-close

Sampling Simplified!

avatar

Arkita Toshniwal

October 12, 2020

Sampling is a process of drawing a predetermined number of observations from a larger population, allowing us to get information about the population from a subset of population.

Let’s understand the relevance of sampling with a case study. During the Presidential election in 1936, which pitted Landon against Roosevelt, a very large poll was conducted by the Literary Digest, sending mail to about 10 million people. They got 2.4 million answers, and with a very high confidence, it was predicted that Landon would get 57% of votes. Instead, Roosevelt won with 62% of votes.

Why?

If you noticed, they received responses from 2.4 million people, which is practically less than 25% of the population that received mail. This implies that while predicting, they came across sampling biases.

Often, while dealing with machine learning, you might come across such sampling biases, leading to false prediction. But you can overcome this scenario by applying correct sampling techniques.

Types of Sampling

  • Simple Sampling– Here, every individual is chosen entirely by chance and each member of the population has an equal chance of being selected.  it is the most direct method of probability sampling. Steps to sample your data:
    • Identify and define target population
    • Select Sampling Frame
    • Select your sampling method
    • Define sampling size
    • Collect the required data
  • Systematic Sampling – Here, the first individual is selected randomly, and others are selected using a fixed ‘sampling interval’. Say our population size is x and we must select a sample size of n. Then, the next individual that we will select would be x/nth intervals away from the first individual. Steps to sample your data:
    • Calculate the sampling interval
    • Select a random start between 1 and sampling interval
    • Repeatedly add sampling interval to select subsequent households
  • Stratified Sampling – Here, the population is divided into homogeneous subgroups called strata, and right number of instances are sampled from each stratum to guarantee that the test set is representative of overall population. Steps to Sample your data:
    • Divide the population into smaller subgroups, or strata, based on the members’ shared attributes and characteristics
    • Take a random sample from each stratum in a number that is proportional to the size of the stratum

In most cases stratified sampling works like a charm. It is as simple as to just add a keyword.

       from sklearn.model_selection import train_test_split

       X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.25)

During this pandemic, a Sero Survey is being conducted, which involves testing, to identify antibodies against a Virus.

The enhanced exercise sees health facilities from each district and is tested. In such cases, when you want to identify the antibodies, it gets important to identify and consider sample that includes people of different age groups, gender or areas.

So, let’s discuss, one by one, each sampling method you just read and find out which suits the best.

When you choose random sampling, you have higher chances of getting people of same age or same gender, belonging to the same area/zone. Here, you don’t have much clue because your samples are picked randomly. You might not get an idea about the correct number of infections.

Next, you do Systematic Sampling, here you might get slight variation because of sampling interval and give you a better result as compared to previous sampling method, yet the sample set doesn’t assure you to be the representative of the overall population.

For such cases, where your sample needs to suffice or act as a representative for overall population, stratified sampling works the best. It chooses right number of samples from each stratum (groups based on age, gender and belongs to different areas) to give you an outlook of your population. Here, your sampled data set showcases the equal distribution of all strata.

However, stratified sampling requires proper knowledge of the characteristics of the population to create strata.

Usually our accuracy subtly relies on sampled data. No matter how good your algorithm is, if your data isn’t sampled properly, it can give you good accuracy on your trained data but not on actual data. A good sampling strategy sometimes could pull the whole project forward. A bad sampling strategy could give us incorrect results.

Get updates. Sign up for our newsletter.

contact-bg

Let's explore how we can create WOW for you!