Python Random Sample from List: Without Replacement, With Replacement & NumPy

Python Random Sample from List: Without Replacement, With Replacement & NumPy

Picking random elements from a list is one of those tasks that sounds trivial until you need to do it correctly. Should the same element be able to appear twice? Does order matter? Do some elements have a higher probability of being picked? Should the result be reproducible for debugging?

Each of these questions points to a different Python tool. random.sample() gives you sampling without replacement. random.choices() gives you sampling with replacement. numpy.random.choice() handles both and works efficiently on large arrays. sklearn.utils.random.sample_without_replacement() is optimised for the specific patterns that machine learning pipelines need.

This guide covers all four methods in full detail - syntax, parameters, working code, edge cases, and exactly when to use each - plus seeding for reproducibility, weighted sampling, a performance comparison, and five real-world use cases from data science and software development. Board Infinity's guide on boolean in Python covers the foundational Python data types that underpin how random functions handle edge cases like empty lists and duplicate elements.

Who This Guide Is For

Sampling With vs Without Replacement: The Core Distinction

Before looking at any code, understanding the difference between sampling with and without replacement is essential - it determines which tool to use for every sampling task.

In most data science and machine learning contexts, sampling without replacement is the correct default - you want each data point to appear at most once in a sample to avoid bias. Sampling with replacement is specifically used for bootstrapping (estimating confidence intervals by resampling) and data augmentation (artificially expanding training datasets). If in doubt, without replacement is almost always what you need.

Method 1: random.sample() - Python Random Sample Without Replacement

random.sample() is the standard Python library function for random sampling without replacement. It returns a new list containing k unique elements chosen randomly from the population, without modifying the original.

Syntax and Parameters

Basic Usage - Sample from List

Edge Cases and Error Handling

If you need to sample k numbers from 0 to N where N is very large (millions), do NOT create a list first. Use random.sample(range(N), k) instead. The range object uses O(1) memory regardless of N, while list(range(N)) creates all N integers in memory. This is the standard Python idiom for memory-efficient index sampling.

Method 2: random.choices() - Python Random Sample With Replacement

random.choices() (added in Python 3.6) selects k elements from the population with replacement - the same element can be chosen multiple times. It also supports weighted sampling, where some elements have a higher probability of being selected than others.

random.sample() vs random.choices(): Key Differences

Method 3: numpy.random.choice() - For Arrays and Data Science

numpy.random.choice() is the NumPy equivalent for random sampling and is the preferred tool in data science contexts when working with NumPy arrays. It supports both with and without replacement, weighted sampling, and works efficiently with large datasets.

In NumPy 1.17 and later, the recommended approach is to use the Generator API: rng = numpy.random.default_rng(seed=42) then rng.choice(population, size=k, replace=False). The Generator API is statistically superior and produces better random sequences. The legacy np.random.seed() / np.random.choice() API still works but is no longer recommended for new code.

Method 4: sklearn.utils.random.sample_without_replacement() - For ML Pipelines

sklearn.utils.random.sample_without_replacement() is a specialised function from scikit-learn designed for high-performance sampling without replacement in machine learning pipelines. It samples integer indices rather than actual elements, and offers three algorithmic methods optimised for different population/sample size ratios.

Unlike random.sample() which returns actual elements, sklearn's sample_without_replacement returns an array of integer indices from 0 to n_population-1. Use these to index into your data: data[indices] or df.iloc[indices]. This design decouples the sampling logic from the data structure and works efficiently with any array-like object in scikit-learn's ecosystem.

Seeding for Reproducibility

Reproducibility is non-negotiable in data science and machine learning. When you sample training data, split datasets, or generate random batches, you need to guarantee that running the same code tomorrow produces the same result as today.

Method 5: Manual Sampling Using Shuffle

For cases requiring complete control or environments without NumPy, you can implement sampling manually using random.shuffle().

Real-World Use Cases

Use Case 1: Train/Test Split Without Replacement

Use Case 2: Bootstrap Sampling With Replacement

Use Case 3: A/B Test Group Assignment

Use Case 4: Data Augmentation With Weighted Sampling

Use Case 5: Lottery Draw

Performance Comparison

Pure Python, without replacement: random.sample(). Pure Python, with replacement or weighted: random.choices(). Working with NumPy arrays or DataFrames: numpy.random.choice(). In a scikit-learn pipeline with large populations: sklearn.utils.random.sample_without_replacement(). When unsure: random.sample() covers 80% of cases.

Common Mistakes and How to Avoid Them

Conclusion

Python gives you multiple tools for random sampling, each designed for a specific context. random.sample() is the default for sampling without replacement from any Python sequence. random.choices() handles sampling with replacement and weighted probability distributions. numpy.random.choice() is the tool of choice in data science contexts. sklearn.utils.random.sample_without_replacement() is the specialist for ML pipelines where you need optimal performance sampling indices from very large populations.

Three things to take away: first, always set a seed when sampling in experiments or any reproducibility-sensitive context. Second, the default for most data science tasks is without replacement - use random.choices() with replacement only when you explicitly need bootstrapping or augmentation. Third, numpy.random.choice() with the Generator API is the modern NumPy standard - prefer it over the legacy np.random.seed() API in new code.

The best next step is to practice with real datasets. Board Infinity's guide on building a data science portfolio has project ideas where random sampling techniques like train/test splits, bootstrap sampling, and cross-validation are applied in realistic contexts.

Frequently Asked Questions

Q1. How do I get a random sample from a list in Python? Use random.sample(your_list, k) where k is the number of elements you want. This returns a new list of k unique elements chosen randomly without replacement. Example: random.sample([1, 2, 3, 4, 5], 3) might return [4, 1, 3].

Q2. What is the difference between random.sample() and random.choices()? random.sample() samples without replacement - each element appears at most once and k cannot exceed the population size. random.choices() samples with replacement - the same element can appear multiple times and k can be any positive integer. random.choices() also supports weighted probabilities; random.sample() does not.

Q3. How do I do Python random sample without replacement? Use random.sample(population, k) - it always samples without replacement. For NumPy arrays, use numpy.random.choice(arr, size=k, replace=False). For ML pipelines, use sklearn.utils.random.sample_without_replacement(n_population, n_samples).

Q4. How do I do Python random sample with replacement? Use random.choices(population, k=k) from the standard library. For NumPy, use numpy.random.choice(arr, size=k, replace=True). For weighted sampling with replacement, use random.choices(population, weights=[...], k=k).

Q5. How do I make random.sample() reproducible? Set a seed before calling it: random.seed(42) then random.sample(...). For NumPy, use the Generator API: rng = numpy.random.default_rng(seed=42) then rng.choice(...). The same seed always produces the same result.

Q6. Can random.sample() return duplicates? random.sample() never returns an element from the same index twice. However, if your input list contains duplicate values like [1, 1, 2, 3], both instances of 1 can appear in the result because sample() treats each index as unique, not each value.

Q7. What is random choice without replacement in Python? It means selecting items from a collection where each selected item cannot be picked again. In Python, random.sample() implements this. random.choice() (singular) picks only one element and does not prevent duplicates if called multiple times.

Further Reading

Board Infinity Guides:

External Resources:

Python Programming Programming Languages