Waqqas Dadabhoy
17 Feb 2021
•
6 min read
onemusicapi.com allows searching for releases by passing in track positions and durations. The search allows the track durations to be off
by a couple of seconds, (TODO figure out the reasons for variation), so we need a fast method for performing fuzzy search on the track
durations.
Another requirement that makes things more complicated is that not all the input tracks need to match; a subset of the tracks matching (the actual fraction of tracks that need to match depended on other factors) need to be enough for the medium to match. However, this won't be covered in detail here.
This is the straightforward solution - a table containing the following columns:
(medium, track_id, track_position, track_duration)
It has one row per track, and an index on (track_position, track_duration). It can be queried fairly simply:
SELECT medium FROM track WHERE (position='1' AND duration BETWEEN 119000 AND 125000)
INTERSECT SELECT medium FROM track WHERE (position='2' AND duration BETWEEN 161000 AND 167000)
INTERSECT SELECT medium FROM track WHERE (position='3' AND duration BETWEEN 207000 AND 213000)
INTERSECT SELECT medium FROM track WHERE (position='4' AND duration BETWEEN 167000 AND 173000)
INTERSECT SELECT medium FROM track WHERE (position='5' AND duration BETWEEN 156000 AND 162000);
This is a more complex solution - a table containing a column for each possible position (in our case, position values up to 99 were enough):
(medium, duration_1, duration_2, ..., duration_99)
It has one row per medium, and an index on each of the duration_*
columns. Querying it is slightly more complicated, because the column
names have to be dynamic, but the eventual query is also easy to read:
SELECT * FROM track_with_duration WHERE (track_1 BETWEEN 119000 AND 125000)
AND (track_2 BETWEEN 161000 AND 167000)
AND (track_3 BETWEEN 207000 AND 213000)
AND (track_4 BETWEEN 167000 AND 173000)
AND (track_5 BETWEEN 156000 AND 162000);
Using the sample query above, we can use EXPLAIN ANALYZE to compare the performance of the approaches.
HashSetOp Intersect  (cost=1841.49..1646797.47 rows=56862 width=20) (actual time=3015.493..3015.741 rows=187 loops=1)
  ->  Append  (cost=1841.49..1646429.07 rows=147359 width=20) (actual time=2635.974..3000.008 rows=90650 loops=1)
        ->  Result  (cost=1841.49..1350456.02 rows=56862 width=20) (actual time=2635.973..2636.224 rows=195 loops=1)
              ->  HashSetOp Intersect  (cost=1841.49..1349887.40 rows=56862 width=20) (actual time=2635.962..2636.199 rows=195 loops=1)
                    ->  Append  (cost=1841.49..1349471.62 rows=166311 width=20) (actual time=2113.309..2616.742 rows=109821 loops=1)
                          ->  Result  (cost=1841.49..1000769.41 rows=56862 width=20) (actual time=2113.308..2113.590 rows=244 loops=1)
                                ->  HashSetOp Intersect  (cost=1841.49..1000200.79 rows=56862 width=20) (actual time=2113.297..2113.562 rows=244 loops=1)
                                      ->  Append  (cost=1841.49..999690.25 rows=204216 width=20) (actual time=1249.196..2082.270 rows=179735 loops=1)
                                            ->  Result  (cost=1841.49..552217.31 rows=56862 width=20) (actual time=1249.195..1252.290 rows=2243 loops=1)
                                                  ->  HashSetOp Intersect  (cost=1841.49..551648.69 rows=56862 width=20) (actual time=1249.184..1252.075 rows=2243 loops=1)
                                                        ->  Append  (cost=1841.49..551228.95 rows=167896 width=20) (actual time=334.363..1206.468 rows=171241 loops=1)
                                                              ->  Subquery Scan on "*SELECT* 1"  (cost=1841.49..221349.88 rows=65327 width=20) (actual time=334.363..652.896 rows=58032 loops=1)
                                                                    ->  Bitmap Heap Scan on track  (cost=1841.49..220696.61 rows=65327 width=16) (actual time=334.359..647.252 rows=58032 loops=1)
                                                                          Recheck Cond: ((("position")::text = '1'::text) AND (duration >= 119000) AND (duration <= 125000))
                                                                          Heap Blocks: exact=56456
                                                                          ->  Bitmap Index Scan on track_position_duration_idx  (cost=0.00..1825.16 rows=65327 width=0) (actual time=14.961..14.961 rows=58032 loops=1)
                                                                                Index Cond: ((("position")::text = '1'::text) AND (duration >= 119000) AND (duration <= 125000))
                                                              ->  Subquery Scan on "*SELECT* 2"  (cost=2892.32..329039.59 rows=102569 width=20) (actual time=45.984..540.047 rows=113209 loops=1)
                                                                    ->  Bitmap Heap Scan on track track_1  (cost=2892.32..328013.90 rows=102569 width=16) (actual time=45.982..530.188 rows=113209 loops=1)
                                                                          Recheck Cond: ((("position")::text = '2'::text) AND (duration >= 161000) AND (duration <= 167000))
                                                                          Heap Blocks: exact=108172
                                                                          ->  Bitmap Index Scan on track_position_duration_idx  (cost=0.00..2866.68 rows=102569 width=0) (actual time=27.499..27.500 rows=113209 loops=1)
                                                                                Index Cond: ((("position")::text = '2'::text) AND (duration >= 161000) AND (duration <= 167000))
                                            ->  Subquery Scan on "*SELECT* 3"  (cost=4155.33..446451.87 rows=147354 width=20) (actual time=67.255..815.769 rows=177492 loops=1)
                                                  ->  Bitmap Heap Scan on track track_2  (cost=4155.33..444978.33 rows=147354 width=16) (actual time=67.253..800.264 rows=177492 loops=1)
                                                        Recheck Cond: ((("position")::text = '3'::text) AND (duration >= 207000) AND (duration <= 213000))
                                                        Heap Blocks: exact=166223
                                                        ->  Bitmap Index Scan on track_position_duration_idx  (cost=0.00..4118.49 rows=147354 width=0) (actual time=38.543..38.544 rows=177492 loops=1)
                                                              Index Cond: ((("position")::text = '3'::text) AND (duration >= 207000) AND (duration <= 213000))
                          ->  Subquery Scan on "*SELECT* 4"  (cost=3088.04..347870.65 rows=109449 width=20) (actual time=42.540..494.850 rows=109577 loops=1)
                                ->  Bitmap Heap Scan on track track_3  (cost=3088.04..346776.16 rows=109449 width=16) (actual time=42.538..485.399 rows=109577 loops=1)
                                      Recheck Cond: ((("position")::text = '4'::text) AND (duration >= 167000) AND (duration <= 173000))
                                      Heap Blocks: exact=105027
                                      ->  Bitmap Index Scan on track_position_duration_idx  (cost=0.00..3060.68 rows=109449 width=0) (actual time=24.663..24.663 rows=109577 loops=1)
                                            Index Cond: ((("position")::text = '4'::text) AND (duration >= 167000) AND (duration <= 173000))
        ->  Subquery Scan on "*SELECT* 5"  (cost=2554.41..295236.25 rows=90497 width=20) (actual time=28.559..357.307 rows=90455 loops=1)
              ->  Bitmap Heap Scan on track track_4  (cost=2554.41..294331.28 rows=90497 width=16) (actual time=28.558..349.863 rows=90455 loops=1)
                    Recheck Cond: ((("position")::text = '5'::text) AND (duration >= 156000) AND (duration <= 162000))
                    Heap Blocks: exact=87153
                    ->  Bitmap Index Scan on track_position_duration_idx  (cost=0.00..2531.78 rows=90497 width=0) (actual time=16.112..16.112 rows=90455 loops=1)
                          Index Cond: ((("position")::text = '5'::text) AND (duration >= 156000) AND (duration <= 162000))
Planning Time: 0.621 ms
JIT:
  Functions: 43
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 5.811 ms, Inlining 46.029 ms, Optimization 169.492 ms, Emission 95.533 ms, Total 316.865 ms
Execution Time: 3069.140 ms
Bitmap Heap Scan on track_with_duration  (cost=2884.32..4012.21 rows=1 width=861) (actual time=30.243..31.718 rows=187 loops=1)
  Recheck Cond: ((track_1 >= 119000) AND (track_1 <= 125000) AND (track_5 >= 156000) AND (track_5 <= 162000))
  Filter: ((track_2 >= 161000) AND (track_2 <= 167000) AND (track_3 >= 207000) AND (track_3 <= 213000) AND (track_4 >= 167000) AND (track_4 <= 173000))
  Rows Removed by Filter: 1861
  Heap Blocks: exact=2028
  ->  BitmapAnd  (cost=2884.32..2884.32 rows=286 width=0) (actual time=29.864..29.865 rows=0 loops=1)
        ->  Bitmap Index Scan on track_with_duration_track_1  (cost=0.00..1012.90 rows=50847 width=0) (actual time=11.551..11.551 rows=58032 loops=1)
              Index Cond: ((track_1 >= 119000) AND (track_1 <= 125000))
        ->  Bitmap Index Scan on track_with_duration_track_5  (cost=0.00..1871.16 rows=93873 width=0) (actual time=16.223..16.223 rows=90455 loops=1)
              Index Cond: ((track_5 >= 156000) AND (track_5 <= 162000))
Planning Time: 0.926 ms
Execution Time: 32.672 ms
As can be seen in the query plans above, querying the narrow table was over 90x slower compared to the wide table (3070ms vs 34ms).
There are several possible reasons the wide table is faster:
The wide table is almost 8.5x smaller than the narrow table (16,711,759 rows vs 141,308,213 rows). This reduces the amount of data that needs to be processed in several ways:
per-row overhead is reduced
track position information is stored implicitly in the table structure, instead of explicitly in every row
Position information is not written to the indexes; instead the
index is chosen based on the column name in the query. This allows
the index to be smaller. This can be seen in the query plan above,
where only the indexes for track_1 & track_5 are used, the other
indexes are not scanned, saving disk reads. However, the
disadvantage is that the possible position values need to be
enumerated beforehand, and an index created for each value. This is
possible in our case, which is why a wide table was a good fit.
Better statistics can be kept on the wide table (e.g. duration_99 will have mostly NULL values, because few media have that many
tracks), because Postgresql keeps per-column
statistics.
This allows PostgreSQL to choose more selective indexes, so that fewer disk pages need to be read. This can be seen in the query plan
above, where the wide table only required 2 of the indexes to be scanned, and the planner's estimate was very close to the actual
number of rows matched.
While a wide table speeds up querying, it has tradeoffs that make it unsuitable for use in many scenarios. The main ones are:
O(n^(k-1)) columns, where k is the number of dimensions, and
n is the number of possible values for each dimensions. Even with 3 dimensions & 99 possible values, the number of columns needed becomes
9801, which is far above the limit of 1600 columns in a PostgreSQL table.In conclusion, a wide table provides a significant benefit to querying performance, assuming the data can be formatted as a wide table.
Waqqas Dadabhoy
See other articles by Waqqas
Ground Floor, Verse Building, 18 Brunswick Place, London, N1 6DZ
108 E 16th Street, New York, NY 10003
Join over 111,000 others and get access to exclusive content, job opportunities and more!