This is a stochastic streaming sketch that enables near real-time analysis of the * approximate distribution of values from a very large stream in a single pass, requiring only * that the values are comparable. * The analysis is obtained using get_quantile() or get_quantiles() functions or the * inverse functions get_rank(), get_PMF() (Probability Mass Function), and get_CDF() * (Cumulative Distribution Function). * *
As of May 2020, this implementation produces serialized sketches which are binary-compatible * with the equivalent Java implementation only when template parameter T = float * (32-bit single precision values). * *
Given an input stream of N numeric values, the absolute rank of any specific * value is defined as its index (0 to N-1) in the hypothetical sorted stream of all * N input values. * *
The normalized rank (rank) of any specific value is defined as its * absolute rank divided by N. * Thus, the normalized rank is a value between zero and one. * In the documentation for this sketch absolute rank is never used so any * reference to just rank should be interpreted to mean normalized rank. * *
This sketch is configured with a parameter k, which affects the size of the sketch * and its estimation error. * *
The estimation error is commonly called epsilon (or eps) and is a fraction * between zero and one. Larger values of k result in smaller values of epsilon. * Epsilon is always with respect to the rank and cannot be applied to the * corresponding values. * *
The relationship between the normalized rank and the corresponding values can be viewed * as a two dimensional monotonic plot with the normalized rank on one axis and the * corresponding values on the other axis. If the y-axis is specified as the value-axis and * the x-axis as the normalized rank, then y = get_quantile(x) is a monotonically * increasing function. * *
The functions get_quantile(rank) and get_quantiles(...) translate ranks into * corresponding values. The functions get_rank(value), * get_CDF(...) (Cumulative Distribution Function), and get_PMF(...) * (Probability Mass Function) perform the opposite operation and translate values into ranks. * *
The getPMF(...) function has about 13 to 47% worse rank error (depending * on k) than the other queries because the mass of each "bin" of the PMF has * "double-sided" error from the upper and lower edges of the bin as a result of a subtraction, * as the errors from the two edges can sometimes add. * *
The default k of 200 yields a "single-sided" epsilon of about 1.33% and a * "double-sided" (PMF) epsilon of about 1.65%. * *
A get_quantile(rank) query has the following guarantees: *
A get_rank(value) query has the following guarantees: *
A get_PMF() query has the following guarantees: *
A get_CDF(...) query has the following guarantees; *
From the above, it might seem like we could make some estimates to bound the * value returned from a call to get_quantile(). The sketch, however, does not * let us derive error bounds or confidences around values. Because errors are independent, we * can approximately bracket a value as shown below, but there are no error estimates available. * Additionally, the interval may be quite large for certain distributions. *
* Note that this method has a fairly large overhead (microseconds instead of nanoseconds) * so it should not be called multiple times to get different quantiles from the same * sketch. Instead use get_quantiles(), which pays the overhead only once. *
* For floating point types: if the sketch is empty this returns NaN.
* For other types: if the sketch is empty this throws runtime_error.
*
* @param fraction the specified fractional position in the hypothetical sorted stream.
* These are also called normalized ranks or fractional ranks.
* If fraction = 0.0, the true minimum value of the stream is returned.
* If fraction = 1.0, the true maximum value of the stream is returned.
* If the parameter inclusive=true, the given rank is considered inclusive (includes the weight of an item)
*
* @return the approximation to the value at the given fraction
*/
using quantile_return_type = typename quantile_sketch_sorted_view
* This returns an array that could have been generated by using get_quantile() for each * fractional rank separately, but would be very inefficient. * This method incurs the internal set-up overhead once and obtains multiple quantile values in * a single query. It is strongly recommend that this method be used instead of multiple calls * to get_quantile(). * *
If the sketch is empty this returns an empty vector.
*
* @param fractions given array of fractional positions in the hypothetical sorted stream.
* These are also called normalized ranks or fractional ranks.
* These fractions must be in the interval [0.0, 1.0], inclusive.
* If the parameter inclusive=true, the given fractions are considered inclusive (include weights of items)
*
* @return array of approximations to the given fractions in the same order as given fractions
* in the input array.
*/
template
If the sketch is empty this returns an empty vector.
*
* @param num an integer that specifies the number of evenly-spaced fractional ranks.
* This must be an integer greater than 0. A value of 1 will return the min value.
* A value of 2 will return the min and the max value. A value of 3 will return the min,
* the median and the max value, etc.
*
* @return array of approximations to the given number of evenly-spaced fractional ranks.
*/
template
The resulting approximation has a probabilistic guarantee that can be obtained from the * get_normalized_rank_error(false) function. * *
If the sketch is empty this returns NaN.
*
* @param value to be ranked
* @return an approximate rank of the given value
*/
template
The resulting approximations have a probabilistic guarantee that can be obtained from the * get_normalized_rank_error(true) function. * *
If the sketch is empty this returns an empty vector.
*
* @param split_points an array of m unique, monotonically increasing values
* that divide the input domain into m+1 consecutive disjoint intervals.
* The definition of an "interval" is inclusive of the left split point (or minimum value) and
* exclusive of the right split point, with the exception that the last interval will include
* the maximum value.
* It is not necessary to include either the min or max values in these split points.
*
* @return an array of m+1 doubles each of which is an approximation
* to the fraction of the input stream values (the mass) that fall into one of those intervals.
* If the template parameter inclusive=false, the definition of an "interval" is inclusive of the left split point and exclusive of the right
* split point, with the exception that the last interval will include the maximum value.
* If the template parameter inclusive=true, the definition of an "interval" is exclusive of the left split point and inclusive of the right
* split point.
*/
template
If the sketch is empty this returns an empty vector.
*
* @param split_points an array of m unique, monotonically increasing values
* that divide the input domain into m+1 consecutive disjoint intervals.
* The definition of an "interval" is inclusive of the left split point (or minimum value) and
* exclusive of the right split point, with the exception that the last interval will include
* the maximum value.
* It is not necessary to include either the min or max values in these split points.
*
* @return an array of m+1 double values, which are a consecutive approximation to the CDF
* of the input stream given the split_points. The value at array position j of the returned
* CDF array is the sum of the returned values in positions 0 through j of the returned PMF
* array.
*/
template