一个场景
当内存无法加载全部数据时,如何从包含未知大小的数据流中随机选取k
个数据,并且要保证每个数据被抽取到的概率相等。
水塘抽样
把上述问题具体化为给定一个未知大小数组data[]
,从中选取k
个数,要求每个数被选中的概率相等。假设数组长度为n
,则每个数被选中概率应为k/n
。
策略是:
- 对最前面
k
个数,直接放到数组chosen[]
即chosen[i] = data[i] 0 <= i < k
。 - 对
data[i], i >= k
,先生成个[0...i)
间的随机数(左闭右开),假设rand(0, i) = j
,则如果j < k
,则做替换chosen[j] = data[i]
。否则,nothing happens. - 如此反复直到结束。
Java
实现如下:
public class ReservoirSampling {
private Random rand = new Random();
public int[] sampling(int[] data, int k) {
int[] chosen = new int[k];
int i = 0;
for (int d : data) {
if (i < k) {
chosen[i] = d;
} else {
int j = rand.nextInt(i);
if (j < k) {
chosen[j] = d;
}
}
++i;
}
return chosen;
}
}
每个元素被选中概率相等的证明如下:
- 对前
k
个数,直接放到了数组,所以概率为1. - 从第
k+1
即data[k] (0 based)
开始,根据以上策略,data[k]
此次被选中概率为k/(k+1)
,而已经被选中的数此次被踢掉的概率则为1/(k+1)
,即被保留的概率为1 - 1/(k + 1) = k/(k+1)
。可见,到第k+1
个数时,每个数被选中的概率均为k/(k+1)
。 - 以此类推,到第
k+2
个数即data[k+1]
此次被选中概率为k/(k+2)
,已被选中的数此次被踢掉概率1/(k+2)
,被保留概率(k+1)/(k+2)
,在上一步和这一步均被保留才称之为保留,所以概率为k/(k+1) * (k+1)/(k+2) = k/(k+2)
。到第k+2
个数时,每个数被选中的概率k/(k+2)
。 - 以此类推到第
n
个数,每个数被选中的概率为k/n
。
Leetcode 398. Random Pick Index
Given an array of integers with possible duplicates, randomly output the index of a given target number. You can assume that the given target number must exist in the array.
Note: The array size can be very large. Solution that uses too much extra space will not pass the judge.
Example:
int[] nums = new int[] {1,2,3,3,3};
Solution solution = new Solution(nums);
// pick(3) should return either index 2, 3, or 4 randomly. Each index should have equal probability of returning.
solution.pick(3);
// pick(1) should return 0. Since in the array only nums[0] is equal to 1.
solution.pick(1);
其实就是k=1
的水塘抽样。代码如下:
private int[] nums;
private Random rand = new Random();
public Solution(int[] nums) {
this.nums = nums;
}
public int pick(int target) {
int count = 0;
int index = 0;
for (int i = 0; i < nums.length; i++) {
if (target == nums[i]) {
if (rand.nextInt(++count) == 0) {
index = i;
}
}
}
return index;
}
Leetcode 382. Linked List Random Node
Given a singly linked list, return a random node’s value from the linked list. Each node must have the same probability of being chosen.
Example 1:
Input:
[“Solution”, “getRandom”, “getRandom”, “getRandom”, “getRandom”, “getRandom”]
[[[1, 2, 3]], [], [], [], [], []]
Output
[null, 1, 3, 2, 2, 3]
Explanation \
Solution solution = new Solution([1, 2, 3]);
solution.getRandom(); // return 1
solution.getRandom(); // return 3
solution.getRandom(); // return 2
solution.getRandom(); // return 2
solution.getRandom(); // return 3
// getRandom() should return either 1, 2, or 3 randomly. Each element should have equal probability of returning.
依然是k=1
的水塘抽样。
private int size = 0;
private ListNode head;
private Random rand = new Random();
public Solution(ListNode head) {
this.head = head;
}
public int getRandom() {
ListNode cur = head;
ListNode chosen = null;
int count = 1;
while (cur != null) {
if (rand.nextInt(count++) == 0) {
chosen = cur;
}
cur = cur.next;
}
return chosen.val;
}
此题完全可以在初始化时得到这个链表长度N
,然后取0 based rand(N)
节点即可。复杂度略有差别而已。所以题目改成以下形式更适合:
public int getRandom(ListNode head) {
}