I recently needed to select few random elements from a big database table and was looking for alternatives to ORDER BY RAND()
(because of its performance issues). Because the IDs are not continuous, the proposed solutions in the linked article are not sufficient.
So the idea is to fetch all IDs from the table, pick X random IDs and query these directly. The memory overhead of reading an array of about 100,000 IDs is not too big, so the problem is reduced to how to pick random IDs efficiently with PHP.
Problem definition
- Given an array
$array
of$n
integers, draw$k
distinct elements, with$k << $n
- The array is indexed numerically from
0
to$n-1
and does not contain duplicates.
The naive approach
This is what I did first:
shuffle($array); $result = array_splice($array, 0, $k);
But this is relatively slow because the whole array has to be shuffled.
Another obvious solution would be using the built in function array_rand
:
$keys = array_rand($array, $k); $result = array(); foreach ($keys as $key) { $result[] = $array[$key]; }
Update: I added the above method for the sake of completeness. In the first tests I did not include it, because array_rand()
is flawed.
From the documentation:
array_rand uses the libc generator, which is slower and less-random than Mersenne Twister
Also, it does not shuffle the output array, so an additional shuffle($result)
might be necessary.
I asked a question on StackOverflow to get recommendations for a more performant solution and it spawned quite a discussion. So I decided to examine the answers closely and run a few benchmarks.
Use random keys
This solution looks straightforward but was criticized for several reasons:
$randomArray = []; while (count($randomArray) < $k)) { $randomKey = mt_rand(0, count($array)-1); $randomArray[$randomKey] = $array[$randomKey]; }
- large gaps in the output array. Luckily that is not a problem in PHP because PHP arrays are not real arrays, but ordered maps instead. For example an array with the keys 23, 666 and 42 has exactly three elements, in the order they have been added. With
array_values
I can normalize the result if necessary. - final output is not shuffled. Again, this is not true for PHP arrays because they do not behave like arrays in other languages. Insertion order is preserved, regardless of the keys (see above).
- takes long time if
$k
approaches$n
. True, but irrelevant for my use case where$k
is much smaller than$n
-
biased This one is a downer. But I could not believe it, so let’s see if the argument withstands:
“it is biased because some picks have more probability than others while all possible picks should be equi-probable”
- Given that
mt_rand
produces a sufficiently random number, the probabilityP($randomKey == $i)
for each key$i
of the array is1/$n
in every iteration. - The probability
P($randomKey has not been drawn before)
is($n-N)/$n
withN
being the count of already drawn items - For the first iteration the code will always draw an element that has not drawn before and therefore all elements have the same probability to be drawn as first item. no bias so far!
- For the next iterations, we have to look at two cases: 1)
$randomKey
is a key that has already been drawn, and 2)$randomKey
is a key that has not been drawn yet.
In the first case, nothing happens because we assign the same element to the same key again and repeat. There is no next drawn item until$randomKey
is eventually a key that has not been drawn yet. All iterations until then can be completely ignored because they did not have any effect. That means, for the item that is eventually drawn as next item, we can assume that the key has not been drawn before. - Now let’s get our hands dirty with proof by induction: For the (N+1)th item, the probability
P($randomKey == $i)
for each key$i
of the array is equal to:
P($randomKey == $i | $randomKey has not been drawn before)
1
= P($randomKey == $i ∩ $randomKey has not been drawn before) / P($randomKey has not been drawn before)
This is 0 for keys that have been drawn before (the intersection is empty), and(1/$n) / (($n-N)/$n) = 1/($n-N)
for keys that have not been drawn before. yay, still no bias!
Forgive my mix of code and mathematical notation, I hope it’s still clear enough if you know a bit about statistics.
- Given that
Partial Richard Durstenfeld’s Fisher-Yates shuffle
This was a promising approach, as it mimics the internal behavior of shuffle()
but stops at $k
elements. I’ll quote the complete explanation of George Reith because I could not explain it better:
This function performs a shuffle on only $n
elements where $n
is the number of random elements you want to pick. It will also work on associative arrays and sparse arrays. $array
is the array to work on and $n
is the number of random elements to retrieve.
If we define the $max_index
as count($array) - 1 - $iteration
.
It works by generating a random number between 0 and $max_index
. Picking the key at that index, and replacing its index with the value at $max_index
so that it can never be picked again, as $max_index
will be one less at the next iteration and unreachable.
function rand_pluck($array, $n) { $array_keys = array_keys($array); $array_length = count($array_keys); $max_index = $array_length -1; $iterations = min($n, $array_length); $random_array = array(); while($iterations--) { $index = mt_rand(0, $max_index); $value = $array_keys[$index]; $array_keys[$index] = $array_keys[$max_index]; array_push($random_array, $array[$value]); $max_index--; } return $random_array; }
The only critic that it got, was that it is still in O(N) complexity, because array_keys()
is called on the whole array. But array_keys
should actually be relatively fast, and if it really matters, this is only necessary to make the function work with associative and sparse arrays. For my use case, it could be replaced with range(0, count($array) - 1)
Another Fisher-Yates shuffle
This variation claims to be “strictly O(n) in both time and space, produces unbiased selections (it is a partial unbiased shuffling) and produces output which is proper array with consecutive keys (not needing extra array_values etc..)”.
function random_pick( $a, $n ) { $N = count($a); $n = min($n, $N); $picked = array_fill(0, $n, 0); $backup = array_fill(0, $n, 0); // partially shuffle the array, and generate unbiased selection simultaneously // this is a variation on fisher-yates-knuth shuffle for ($i=0; $i<$n; $i++) // O(n) times { $selected = mt_rand( 0, --$N ); // unbiased sampling N * N-1 * N-2 * .. * N-n+1 $value = $a[ $selected ]; $a[ $selected ] = $a[ $N ]; $a[ $N ] = $value; $backup[ $i ] = $selected; $picked[ $i ] = $value; } // restore partially shuffled input array from backup // optional step, if needed it can be ignored, e.g $a is passed by value, hence copied for ($i=$n-1; $i>=0; $i--) // O(n) times { $selected = $backup[ $i ]; $value = $a[ $N ]; $a[ $N ] = $a[ $selected ]; $a[ $selected ] = $value; $N++; } return $picked; }
Not bad! For the benchmarks I will test once with the “restore” part and $array
passed by reference and once without this part and the array passed by value.
Benchmark results
My benchmark script can be found in this Gist (it uses Sebastian Bergmann’s PHP_Timer).
The results were not quite as expected:
(Tested on PHP 5.5.27)
Shuffle and Splice
|------------------------------------------------------------| | function | $k | $t | time | |---------------------------+----------+-----------+---------| | shuffle_and_splice | 10 | 1000 | 1 ms | | shuffle_and_splice | 10 | 10000 | 4 ms | | shuffle_and_splice | 10 | 100000 | 60 ms | | shuffle_and_splice | 10 | 1000000 | 773 ms | | shuffle_and_splice | 100 | 1000 | 0 ms | | shuffle_and_splice | 100 | 10000 | 3 ms | | shuffle_and_splice | 100 | 100000 | 66 ms | | shuffle_and_splice | 100 | 1000000 | 785 ms | | shuffle_and_splice | 1000 | 10000 | 2 ms | | shuffle_and_splice | 5000 | 10000 | 3 ms | | shuffle_and_splice | 9000 | 10000 | 2 ms | | shuffle_and_splice | 1000 | 100000 | 60 ms | | shuffle_and_splice | 10000 | 100000 | 62 ms | | shuffle_and_splice | 50000 | 100000 | 61 ms | | shuffle_and_splice | 90000 | 100000 | 62 ms | |---------------------------+----------+-----------+---------| Total memory usage: 313,467,232 Bytes
array_rand
|------------------------------------------------------------| | function | $k | $t | time | |---------------------------+----------+-----------+---------| | value_by_array_rand | 10 | 1000 | 0 ms | | value_by_array_rand | 10 | 10000 | 0 ms | | value_by_array_rand | 10 | 100000 | 2 ms | | value_by_array_rand | 10 | 1000000 | 5 ms | | value_by_array_rand | 100 | 1000 | 0 ms | | value_by_array_rand | 100 | 10000 | 1 ms | | value_by_array_rand | 100 | 100000 | 2 ms | | value_by_array_rand | 100 | 1000000 | 23 ms | | value_by_array_rand | 1000 | 10000 | 1 ms | | value_by_array_rand | 5000 | 10000 | 2 ms | | value_by_array_rand | 9000 | 10000 | 3 ms | | value_by_array_rand | 1000 | 100000 | 3 ms | | value_by_array_rand | 10000 | 100000 | 6 ms | | value_by_array_rand | 50000 | 100000 | 21 ms | | value_by_array_rand | 90000 | 100000 | 33 ms | |---------------------------+----------+-----------+---------| Total memory usage: 136,696,784 Bytes
Random Keys
|------------------------------------------------------------| | function | $k | $t | time | |---------------------------+----------+-----------+---------| | random_keys | 10 | 1000 | 0 ms | | random_keys | 10 | 10000 | 0 ms | | random_keys | 10 | 100000 | 0 ms | | random_keys | 10 | 1000000 | 0 ms | | random_keys | 100 | 1000 | 0 ms | | random_keys | 100 | 10000 | 0 ms | | random_keys | 100 | 100000 | 0 ms | | random_keys | 100 | 1000000 | 0 ms | | random_keys | 1000 | 10000 | 0 ms | | random_keys | 5000 | 10000 | 3 ms | | random_keys | 9000 | 10000 | 9 ms | | random_keys | 1000 | 100000 | 0 ms | | random_keys | 10000 | 100000 | 6 ms | | random_keys | 50000 | 100000 | 41 ms | | random_keys | 90000 | 100000 | 134 ms | |---------------------------+----------+-----------+---------| Total memory usage: 136,678,552 Bytes
Richard Dustenfeld’s Fisher Yates
|------------------------------------------------------------| | function | $k | $t | time | |---------------------------+----------+-----------+---------| | richard_durstenfeld | 10 | 1000 | 0 ms | | richard_durstenfeld | 10 | 10000 | 2 ms | | richard_durstenfeld | 10 | 100000 | 19 ms | | richard_durstenfeld | 10 | 1000000 | 194 ms | | richard_durstenfeld | 100 | 1000 | 0 ms | | richard_durstenfeld | 100 | 10000 | 1 ms | | richard_durstenfeld | 100 | 100000 | 18 ms | | richard_durstenfeld | 100 | 1000000 | 200 ms | | richard_durstenfeld | 1000 | 10000 | 2 ms | | richard_durstenfeld | 5000 | 10000 | 4 ms | | richard_durstenfeld | 9000 | 10000 | 6 ms | | richard_durstenfeld | 1000 | 100000 | 18 ms | | richard_durstenfeld | 10000 | 100000 | 27 ms | | richard_durstenfeld | 50000 | 100000 | 68 ms | | richard_durstenfeld | 90000 | 100000 | 104 ms | |---------------------------+----------+-----------+---------| Total memory usage: 273,073,288 Bytes
Fisher Yates (by reference, with backup)
|------------------------------------------------------------| | function | $k | $t | time | |---------------------------+----------+-----------+---------| | fisher_yates_by_reference | 10 | 1000 | 0 ms | | fisher_yates_by_reference | 10 | 10000 | 2 ms | | fisher_yates_by_reference | 10 | 100000 | 18 ms | | fisher_yates_by_reference | 10 | 1000000 | 186 ms | | fisher_yates_by_reference | 100 | 1000 | 0 ms | | fisher_yates_by_reference | 100 | 10000 | 1 ms | | fisher_yates_by_reference | 100 | 100000 | 19 ms | | fisher_yates_by_reference | 100 | 1000000 | 183 ms | | fisher_yates_by_reference | 1000 | 10000 | 2 ms | | fisher_yates_by_reference | 5000 | 10000 | 5 ms | | fisher_yates_by_reference | 9000 | 10000 | 9 ms | | fisher_yates_by_reference | 1000 | 100000 | 19 ms | | fisher_yates_by_reference | 10000 | 100000 | 32 ms | | fisher_yates_by_reference | 50000 | 100000 | 92 ms | | fisher_yates_by_reference | 90000 | 100000 | 151 ms | |---------------------------+----------+-----------+---------| Total memory usage: 313,446,936 Bytes
Fisher Yates (by value)
|------------------------------------------------------------| | function | $k | $t | time | |---------------------------+----------+-----------+---------| | fisher_yates_by_value | 10 | 1000 | 0 ms | | fisher_yates_by_value | 10 | 10000 | 1 ms | | fisher_yates_by_value | 10 | 100000 | 11 ms | | fisher_yates_by_value | 10 | 1000000 | 113 ms | | fisher_yates_by_value | 100 | 1000 | 0 ms | | fisher_yates_by_value | 100 | 10000 | 1 ms | | fisher_yates_by_value | 100 | 100000 | 12 ms | | fisher_yates_by_value | 100 | 1000000 | 114 ms | | fisher_yates_by_value | 1000 | 10000 | 1 ms | | fisher_yates_by_value | 5000 | 10000 | 2 ms | | fisher_yates_by_value | 9000 | 10000 | 5 ms | | fisher_yates_by_value | 1000 | 100000 | 12 ms | | fisher_yates_by_value | 10000 | 100000 | 18 ms | | fisher_yates_by_value | 50000 | 100000 | 48 ms | | fisher_yates_by_value | 90000 | 100000 | 73 ms | |---------------------------+----------+-----------+---------| Total memory usage: 225,067,424 Bytes
Summary
- For small values of
$k
, the “random keys” approach was always by far the fastest, only when$k
grew more than 50% of$n
it slowed down because of more collisions. - Somewhere between 50% and 90%, the “shuffle and splice” approach takes over. As expected, it takes roughly the same time for each
$n
independent of$k
. The “array_rand” approach is better than “shuffle and splice” but still only takes advantage over “random keys” for big values of$k
- The Fisher Yates implementations were disappointing every single one. I am sure, this is not a problem with the algorithm per se (as it is used by shuffle() itself), but with the massive overhead of PHP arrays. An implementation in C would probably yield different results.
Out of interest, I tried to replace the arrays in both Fisher Yates variants with SplFixedArray
instances, the closest you get to real arrays with PHP.
Fisher Yates (Fixed Array)
|------------------------------------------------------------| | function | $k | $t | time | |---------------------------+----------+-----------+---------| | fisher_yates_fixed_array | 10 | 1000 | 0 ms | | fisher_yates_fixed_array | 10 | 10000 | 0 ms | | fisher_yates_fixed_array | 10 | 100000 | 3 ms | | fisher_yates_fixed_array | 10 | 1000000 | 32 ms | | fisher_yates_fixed_array | 100 | 1000 | 0 ms | | fisher_yates_fixed_array | 100 | 10000 | 0 ms | | fisher_yates_fixed_array | 100 | 100000 | 3 ms | | fisher_yates_fixed_array | 100 | 1000000 | 32 ms | | fisher_yates_fixed_array | 1000 | 10000 | 1 ms | | fisher_yates_fixed_array | 5000 | 10000 | 3 ms | | fisher_yates_fixed_array | 9000 | 10000 | 4 ms | | fisher_yates_fixed_array | 1000 | 100000 | 4 ms | | fisher_yates_fixed_array | 10000 | 100000 | 9 ms | | fisher_yates_fixed_array | 50000 | 100000 | 31 ms | | fisher_yates_fixed_array | 90000 | 100000 | 50 ms | |---------------------------+----------+-----------+---------| Total memory usage: 144,681,520 Bytes
The improvement is significant because element swapping is much faster, but due to the overhead of converting the input array to a fixed array, it still cannot beat the “random keys” approach. However, for large $k
it’s better than “shuffle and slice”.
Richard Durstenfeld’s Fisher Yates (Fixed Array)
|------------------------------------------------------------| | function | $k | $t | time | |---------------------------+----------+-----------+---------| | richard_durstenfeld_fixed | 10 | 1000 | 0 ms | | richard_durstenfeld_fixed | 10 | 10000 | 3 ms | | richard_durstenfeld_fixed | 10 | 100000 | 20 ms | | richard_durstenfeld_fixed | 10 | 1000000 | 214 ms | | richard_durstenfeld_fixed | 100 | 1000 | 0 ms | | richard_durstenfeld_fixed | 100 | 10000 | 2 ms | | richard_durstenfeld_fixed | 100 | 100000 | 21 ms | | richard_durstenfeld_fixed | 100 | 1000000 | 215 ms | | richard_durstenfeld_fixed | 1000 | 10000 | 2 ms | | richard_durstenfeld_fixed | 5000 | 10000 | 4 ms | | richard_durstenfeld_fixed | 9000 | 10000 | 6 ms | | richard_durstenfeld_fixed | 1000 | 100000 | 21 ms | | richard_durstenfeld_fixed | 10000 | 100000 | 28 ms | | richard_durstenfeld_fixed | 50000 | 100000 | 59 ms | | richard_durstenfeld_fixed | 90000 | 100000 | 91 ms | |---------------------------+----------+-----------+---------| Total memory usage: 281,069,000 Bytes
No big difference here. The problem is that we operate on two arrays, so either we have to convert both (even more overhead) or get only half of the speedup. I converted the one that is read and written to a fixed array, converting both was even slower in the end.
PHP 7 / HHVM
If you are using PHP 7 or HHVM, I have interesting news for you: The Fisher Yates Shuffle with array passed by reference really takes advantage of the improvements in these platforms. It’s not significantly better than the random key approach for small $k
and slower than the shuffle and slice approach for $k > $n / 2
but overall has the best performance, so it is a good choice if you don’t know how the numbers relate.
These are the results on PHP 7.0.0-dev for this particular algorithm (I won’t repeat all results, the article is already long enough…):
|------------------------------------------------------------| | function | $k | $t | time | |---------------------------+----------+-----------+---------| | fisher_yates_by_reference | 10 | 1000 | 0 ms | | fisher_yates_by_reference | 10 | 10000 | 0 ms | | fisher_yates_by_reference | 10 | 100000 | 0 ms | | fisher_yates_by_reference | 10 | 1000000 | 0 ms | | fisher_yates_by_reference | 100 | 1000 | 0 ms | | fisher_yates_by_reference | 100 | 10000 | 0 ms | | fisher_yates_by_reference | 100 | 100000 | 0 ms | | fisher_yates_by_reference | 100 | 1000000 | 0 ms | | fisher_yates_by_reference | 1000 | 10000 | 2 ms | | fisher_yates_by_reference | 5000 | 10000 | 6 ms | | fisher_yates_by_reference | 9000 | 10000 | 10 ms | | fisher_yates_by_reference | 1000 | 100000 | 2 ms | | fisher_yates_by_reference | 10000 | 100000 | 12 ms | | fisher_yates_by_reference | 50000 | 100000 | 60 ms | | fisher_yates_by_reference | 90000 | 100000 | 92 ms | |---------------------------+----------+-----------+---------| Total memory usage: 33,961,160 Bytes
Conclusion
For the given scenario, picking $k
items from an array with $n
elements, where $k
is not greater than $n / 2
, the following code seems to be unbeatable so far:
$randomArray = []; while (count($randomArray) < $k)) { $randomKey = mt_rand(0, count($array)-1); $randomArray[$randomKey] = $array[$randomKey]; }
If the output array should be continuously indexed, add $randomArray = array_values($randomArray)
at the end.
If the input array is not continuously indexed, you can add $array = SplFixedArray::fromArray($array, false);
at the beginning. That should be better than using array_values()
because accessing the SplFixedArray is faster. If we have to copy the data anyway, why not into a more efficient structure? But as always, please run your own benchmarks instead of believing everything on the internet 😉
Notes:
- P(A|B) is the probability of A, given B ↩
Maybe i am wrong, but i think that call count() in a while-loop is a performance loss because it will be called in each loop? Additionally, count() on the $array will be executed at least twice (the second one in the mt_rand()-line).
So i think that this would be better:
$arrayLength = count($array);
$randomArray = [];
while ($arrayLength < $k) {
$randomKey = mt_rand(0, $arrayLength - 1);
$randomArray[$randomKey] = $array[$randomKey];
}
I thought it would not make much difference but you are right, with
$arrayLength = count($array)
and using that inmt_rand
we can even improve the results a bit more. However, don’t use$arrayLength
in the while-condition, there we need to check the size of the result array.