Utils
utils module#
statista.utils
#
merge_small_bins(bin_count_observed, bin_count_fitted_data)
#
Merge small bins for goodness-of-fit tests (e.g., chi-square).
This utility merges adjacent "small" bins (those whose expected count is < 5) starting from the right-most bin and moving left, accumulating small bins until their combined expected count is >= 5. If a large (>= 5) bin is encountered while there is an accumulation, that accumulation is merged into that bin. If the left edge is reached with a remaining accumulation that was never merged into a large bin, the accumulation is appended as its own bin.
After merging, the expected counts are rescaled so that their sum equals the total observed count (required by Pearson's chi-square test), preserving the expected proportions within the merged structure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bin_count_observed
|
List[float]
|
Observed counts per original bin. Must be the same length as
|
required |
bin_count_fitted_data
|
List[float]
|
Expected (model-fitted) counts per original bin. Must be the same
length as |
required |
Returns:
| Type | Description |
|---|---|
tuple[ndarray, ndarray]
|
Tuple[np.ndarray, np.ndarray]:
Two 1D numpy arrays |
Raises:
| Type | Description |
|---|---|
ZeroDivisionError
|
If the total expected count across merged bins is 0, rescaling cannot be performed (division by zero). This can happen if all expected counts are zero. |
ValueError
|
If the input sequences have different lengths. |
Notes
- The function assumes a one-to-one correspondence of observed and
expected bins. If lengths differ, only a partial zip would occur; to
avoid silent truncation a
ValueErroris raised. - Merging proceeds from right to left and the result is then reversed back to low-to-high order.
- The "< 5" rule is a common heuristic for chi-square tests to ensure adequate expected counts per bin.
Examples:
-
Merge tail small bins with the nearest large bin on the left
-
No merging when all expected counts are >= 5
-
Accumulated leftmost small bins remain as their own bin if no large bin is found to the left
-
Expected counts are rescaled to match the observed total while preserving proportions
Source code in src\statista\utils.py
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | |