Data Sampling forms the essential part of the majority of research, scientific and data experiments. It is one of the most important factors which determines the accuracy of your research or survey result. If your sample has not been accurately sampled then this might impact significantly the final results and conclusions. There are many sampling techniques that can be used to gather a data sample depending upon the need and situation. In this blog post, I will cover the following data sampling techniques:
– Terminology: Population and Sampling
– Random Sampling
– Systematic Sampling
– Cluster Sampling
– Weighted Sampling
– Stratified Sampling
Introduction to Population and Sample
To start with, let’s have a look at some basic terminology. It is important to learn the concepts of population and sample. The population is the set of all observations (individuals, objects, events, or procedures) and is usually very large and diverse, whereas a sample is a subset of observations from the population that ideally is a true representation of the population.
Given that experimenting with an entire population is either impossible or simply too expensive, researchers or analysts use samples rather than the entire population in their experiments or trials. To make sure that the experimental results are reliable and hold for the entire population, the sample needs to be a true representation of the population. That is, the sample needs to be unbiased.
Random Sampling
The simplest data sampling technique that creates a random sample from the original population is Random Sampling. In this approach, every sampled observation has the same probability of getting selected during the sample generation process. Random Sampling is usually used when we don’t have any kind of prior information about the target population.
For example random selection of 3 individuals from a population of 10 individuals. Here, each individual has an equal chance of getting selected to the sample with a probability of selection of 1/10.
Random Sampling: Python Implementation
First, we generate random data that will serve as population data. We will, therefore, randomly sample 10K data points from Normal distribution with mean mu = 10 and standard deviation std = 2. After this, we create a Python function called random_sampling() that takes population data and desired sample size and produces as output a random sample.
Systematic Sampling
Systematic sampling is defined as a probability sampling approach where the elements from a target population are selected from a random starting point and after a fixed sampling interval.
Stated differently, systematic sampling is an extended version of probability sampling techniques in which each member of the group is selected at regular periods to form a sample. We calculate the sampling interval by dividing the entire population size by the desired sample size.
Note that, Systematic Sampling usually produces a random sample but is not addressing the bias in the created sample.
Systematic Sampling: Python Implementation
We generate data that serve as population data as in the previous case. We then create a Python function called systematic_sample() that takes population data and interval for the sampling and produces as output a systematic sample.
Cluster Sampling
Cluster sampling is a probability sampling technique where we divide the population into multiple clusters(groups) based on certain clustering criteria. Then we select a random cluster(s) with simple random or systematic sampling techniques. So, in cluster sampling, the entire population is divided into clusters or segments and then cluster(s) are randomly selected.
For example, if you want to conduct an experience evaluating the performance of sophomores in business education across Europe. It is impossible to conduct an experiment that involves a student in every university across the EU. Instead, by using Cluster Sampling, we can group the universities from each country into one cluster. These clusters then define all the sophomore student population in the EU. Next, you can use simple random sampling or systematic sampling and randomly select cluster(s) for the purposes of your research study.
Note that, Systematic Sampling usually produces a random sample but is not addressing the bias in the created sample.
Cluster Sampling: Python Implementation
First, we generate data that will serve as population data with 10K observations, and this data consists of the following 4 variables:
- Price: generated using Uniform distribution,
- Id
- event_type: which is a categorical variable with 3 possible values {type1, type2, type3}
- click: binary variable taking values {0: no click, 1: click}
id price event_type click
0 0 1.767794 type2 0
1 1 2.974360 type2 0
2 2 2.903518 type2 0
3 3 3.699454 type2 1
4 4 1.416739 type1 0
… … … … …
9995 9995 3.689656 type2 1
9996 9996 1.929186 type3 0
9997 9997 2.393509 type3 1
9998 9998 1.276473 type2 1
9999 9999 3.959585 type2 1[10000 rows x 4 columns]
Then the function get_clustered_Sample() takes as inputs the original data, the amount of observations per cluster, and a number of clusters you want to select, and produces as output a clustered sample.
id price event_type click cluster
4847 4847 3.813680 type3 0 17
567 567 1.642347 type2 0 17
8982 8982 3.741744 type3 1 17
2321 2321 2.192724 type3 0 17
5045 5045 3.645671 type2 0 17
… … … … … …
5681 5681 3.175308 type1 0 90
882 882 2.676477 type2 1 90
2090 2090 3.861775 type3 1 90
907 907 1.947100 type3 0 90
2723 2723 2.557626 type1 0 90 [200 rows x 5 columns]
Weighted Sampling
In some experiments, you might need items sampling probabilities to be according to weights associated with each item, that’s when the proportions of the type of observations should be taken into account. For example, you might need a sample of queries in a search engine with weight as a number of times these queries have been performed so that the sample can be analyzed for overall impact on the user experience. In this case, Weighted Sampling is much more preferred compared to Random Sampling or Systematic Sampling.
Weighted Sampling is a data sampling method with weights, that intends to compensate for the selection of specific observations with unequal probabilities (oversampling), non-coverage, non-responses, and other types of bias. If a biased data set is not adjusted and a simple random sampling type of approach is used instead, then the population descriptors (e.g., mean, median) will be skewed and they will fail to correctly represent the population’s proportion to the population.
Weighted Sampling addresses the bias in the sample, by creating a sample that takes into account the proportions of the type of observations in the population. Hence, Weighted Sampling usually produces a random and unbiased sample.
Then the function get_clustered_Sample() takes as inputs the original data, the amount of observations per cluster, and a number of clusters you want to select, and produces as output a clustered sample.
Weighted Sampling: Python Implementation
The function get_weighted_sample() takes as inputs the original data, and the desired sample size, and produces as output a weighted sample. Note that, the proportions, in this case, are defined based on the click event. That is, we compute the proportion of data points that had click events of 1 (let’s say X%) and 0 (Y%, where Y% = 100-X%), then we generate a random sample such that, the sample will also contain X% observations with click = 1 and Y% observations with click = 0.
id price event_type click
event_type
type1 0 6780 1.200188 type1 1
1 8830 2.990630 type1 1
2 8997 3.483728 type1 0
3 7541 2.402993 type1 1
4 4460 2.959203 type1 0
… … … … …
type3 29 5058 3.426289 type3 1
30 5855 3.852197 type3 0
31 6295 2.679898 type3 0
32 8978 1.115072 type3 1
33 7730 1.208441 type3 1[100 rows x 4 columns]
Stratified Sampling
Stratified Sampling is a data sampling approach, where we divide a population into homogeneous subpopulations called strata based on specific characteristics (e.g., age, race, gender identity, location, event type etc.).
Every member of the population studied should be in exactly one stratum. Each stratum is then sampled using Cluster Sampling, allowing data scientists to estimate statistical measures for each sub-population. We rely on Stratified Sampling when the populations’ characteristics are diverse and we want to ensure that every characteristic is properly represented in the sample.
So, Stratified Sampling, is simply, the combination of Clustered Sampling and Weighted Sampling.
Stratified Sampling: Python Implementation
The function get_stratified_sample() takes as inputs the original data, the desired sample size, the number of clusters needed, and it produces as output a stratified sample. Note that, this function, firstly performs weighted sampling using the click event. Secondly, it performs clustered sampling using the event_type.
id price event_type click cluster
0 5131 2.707995 type1 0 45
1 5102 1.677190 type1 0 45
2 7370 1.893061 type1 0 45
3 4207 2.491246 type1 0 45
4 8909 3.252655 type1 1 45
.. … … … … …
96 3254 2.637625 type3 0 85
97 1555 1.196040 type3 1 85
98 7627 3.240507 type3 1 85
99 6405 1.607379 type3 0 85
100 1075 2.471806 type3 0 85[202 rows x 5 columns]
FREE Data Science and AI Handbook
How to Start a Career in Data Science
Become Data Scientist with LunarTech?
Consider joining us at LunarTech and follow The Ultimate Data Science Bootcamp which has earned the recognition of being one of the Best Data Science Bootcamps of 2023, and has been featured in esteemed publications like Forbes, Yahoo, Entrepreneur and more.
This is your chance to be a part of a community that thrives on innovation and knowledge. You can enroll for a Free Trial of The Ultimate Data Science Bootcamp at LunarTech [Enroll here to Free Trial]
Conclusion
Artificial intelligence (AI) has emerged as a transformative force, revolutionizing industries and reshaping the way we live and work. Throughout this blog, we have explored the roots, development, and impact of AI in today’s technologically advanced world.
AI is a powerful tool that has the potential to greatly enhance human capabilities and drive innovation across various sectors. It encompasses a wide range of sub-fields and concepts, such as machine learning, deep learning, and neural networks. Through applications like speech recognition, stock trading, and social media platforms, AI has already made significant contributions to society.
As we move forward, it is important to consider the potential risks and ethical implications associated with AI. However, with proper regulations and guidelines in place, AI has the potential to revolutionize industries, improve efficiency, and create a more connected world.
About the Author — That’s Me!
I am Tatev, Senior Machine Learning and AI Researcher. I have had the privilege of working in Data Science across numerous countries, including the US, UK, Canada, and the Netherlands.
With an MSc and BSc in Econometrics under my belt, my journey in Machine and AI has been nothing short of incredible. Drawing from my technical studies during my Bachelors & Masters, along with over 5 years of hands-on experience in the Data Science Industry, in Machine Learning and AI including NLP, LLMs and GenAI, I’ve gathered this knowledge to share with you.
Connect with Me:
- Follow me on LinkedIn for a ton of Free Resources in ML and AI
- Visit my Personal Website
- Subscribe to my The Data Science and AI Newsletter
Want to discover everything about a career in Data Science, Machine Learning and AI, and learn how to secure a Data Science job? Download this FREE Data Science and AI Career Handbook
Thank you for choosing this guide as your learning companion. As you continue to explore the vast field of machine learning and AI, I hope you do so with confidence, precision, and an innovative spirit. Best wishes in all your future endeavors!