We're planting a tree for every job application! Click here to learn more

Fuzzy searching integer ranges in PostgreSQL

Waqqas Dadabhoy

17 Feb 2021

•

6 min read

 Fuzzy searching integer ranges in PostgreSQL
  • PostgreSQL

onemusicapi.com allows searching for releases by passing in track positions and durations. The search allows the track durations to be off by a couple of seconds, (TODO figure out the reasons for variation), so we need a fast method for performing fuzzy search on the track durations.

Another requirement that makes things more complicated is that not all the input tracks need to match; a subset of the tracks matching (the actual fraction of tracks that need to match depended on other factors) need to be enough for the medium to match. However, this won't be covered in detail here.

Solution 1 (narrow table)

This is the straightforward solution - a table containing the following columns:

(medium, track_id, track_position, track_duration)

It has one row per track, and an index on (track_position, track_duration). It can be queried fairly simply:

SELECT medium FROM track WHERE (position='1' AND duration BETWEEN 119000 AND 125000)
INTERSECT SELECT medium FROM track WHERE (position='2' AND duration BETWEEN 161000 AND 167000)
INTERSECT SELECT medium FROM track WHERE (position='3' AND duration BETWEEN 207000 AND 213000)
INTERSECT SELECT medium FROM track WHERE (position='4' AND duration BETWEEN 167000 AND 173000)
INTERSECT SELECT medium FROM track WHERE (position='5' AND duration BETWEEN 156000 AND 162000);

Solution 2 (wide table)

This is a more complex solution - a table containing a column for each possible position (in our case, position values up to 99 were enough):

(medium, duration_1, duration_2, ..., duration_99)

It has one row per medium, and an index on each of the duration_* columns. Querying it is slightly more complicated, because the column names have to be dynamic, but the eventual query is also easy to read:

SELECT * FROM track_with_duration WHERE (track_1 BETWEEN 119000 AND 125000)
AND (track_2 BETWEEN 161000 AND 167000)
AND (track_3 BETWEEN 207000 AND 213000)
AND (track_4 BETWEEN 167000 AND 173000)
AND (track_5 BETWEEN 156000 AND 162000);

Performance Comparison

Using the sample query above, we can use EXPLAIN ANALYZE to compare the performance of the approaches.

Narrow Table

HashSetOp Intersect  (cost=1841.49..1646797.47 rows=56862 width=20) (actual time=3015.493..3015.741 rows=187 loops=1)
  ->  Append  (cost=1841.49..1646429.07 rows=147359 width=20) (actual time=2635.974..3000.008 rows=90650 loops=1)
        ->  Result  (cost=1841.49..1350456.02 rows=56862 width=20) (actual time=2635.973..2636.224 rows=195 loops=1)
              ->  HashSetOp Intersect  (cost=1841.49..1349887.40 rows=56862 width=20) (actual time=2635.962..2636.199 rows=195 loops=1)
                    ->  Append  (cost=1841.49..1349471.62 rows=166311 width=20) (actual time=2113.309..2616.742 rows=109821 loops=1)
                          ->  Result  (cost=1841.49..1000769.41 rows=56862 width=20) (actual time=2113.308..2113.590 rows=244 loops=1)
                                ->  HashSetOp Intersect  (cost=1841.49..1000200.79 rows=56862 width=20) (actual time=2113.297..2113.562 rows=244 loops=1)
                                      ->  Append  (cost=1841.49..999690.25 rows=204216 width=20) (actual time=1249.196..2082.270 rows=179735 loops=1)
                                            ->  Result  (cost=1841.49..552217.31 rows=56862 width=20) (actual time=1249.195..1252.290 rows=2243 loops=1)
                                                  ->  HashSetOp Intersect  (cost=1841.49..551648.69 rows=56862 width=20) (actual time=1249.184..1252.075 rows=2243 loops=1)
                                                        ->  Append  (cost=1841.49..551228.95 rows=167896 width=20) (actual time=334.363..1206.468 rows=171241 loops=1)
                                                              ->  Subquery Scan on "*SELECT* 1"  (cost=1841.49..221349.88 rows=65327 width=20) (actual time=334.363..652.896 rows=58032 loops=1)
                                                                    ->  Bitmap Heap Scan on track  (cost=1841.49..220696.61 rows=65327 width=16) (actual time=334.359..647.252 rows=58032 loops=1)
                                                                          Recheck Cond: ((("position")::text = '1'::text) AND (duration >= 119000) AND (duration <= 125000))
                                                                          Heap Blocks: exact=56456
                                                                          ->  Bitmap Index Scan on track_position_duration_idx  (cost=0.00..1825.16 rows=65327 width=0) (actual time=14.961..14.961 rows=58032 loops=1)
                                                                                Index Cond: ((("position")::text = '1'::text) AND (duration >= 119000) AND (duration <= 125000))
                                                              ->  Subquery Scan on "*SELECT* 2"  (cost=2892.32..329039.59 rows=102569 width=20) (actual time=45.984..540.047 rows=113209 loops=1)
                                                                    ->  Bitmap Heap Scan on track track_1  (cost=2892.32..328013.90 rows=102569 width=16) (actual time=45.982..530.188 rows=113209 loops=1)
                                                                          Recheck Cond: ((("position")::text = '2'::text) AND (duration >= 161000) AND (duration <= 167000))
                                                                          Heap Blocks: exact=108172
                                                                          ->  Bitmap Index Scan on track_position_duration_idx  (cost=0.00..2866.68 rows=102569 width=0) (actual time=27.499..27.500 rows=113209 loops=1)
                                                                                Index Cond: ((("position")::text = '2'::text) AND (duration >= 161000) AND (duration <= 167000))
                                            ->  Subquery Scan on "*SELECT* 3"  (cost=4155.33..446451.87 rows=147354 width=20) (actual time=67.255..815.769 rows=177492 loops=1)
                                                  ->  Bitmap Heap Scan on track track_2  (cost=4155.33..444978.33 rows=147354 width=16) (actual time=67.253..800.264 rows=177492 loops=1)
                                                        Recheck Cond: ((("position")::text = '3'::text) AND (duration >= 207000) AND (duration <= 213000))
                                                        Heap Blocks: exact=166223
                                                        ->  Bitmap Index Scan on track_position_duration_idx  (cost=0.00..4118.49 rows=147354 width=0) (actual time=38.543..38.544 rows=177492 loops=1)
                                                              Index Cond: ((("position")::text = '3'::text) AND (duration >= 207000) AND (duration <= 213000))
                          ->  Subquery Scan on "*SELECT* 4"  (cost=3088.04..347870.65 rows=109449 width=20) (actual time=42.540..494.850 rows=109577 loops=1)
                                ->  Bitmap Heap Scan on track track_3  (cost=3088.04..346776.16 rows=109449 width=16) (actual time=42.538..485.399 rows=109577 loops=1)
                                      Recheck Cond: ((("position")::text = '4'::text) AND (duration >= 167000) AND (duration <= 173000))
                                      Heap Blocks: exact=105027
                                      ->  Bitmap Index Scan on track_position_duration_idx  (cost=0.00..3060.68 rows=109449 width=0) (actual time=24.663..24.663 rows=109577 loops=1)
                                            Index Cond: ((("position")::text = '4'::text) AND (duration >= 167000) AND (duration <= 173000))
        ->  Subquery Scan on "*SELECT* 5"  (cost=2554.41..295236.25 rows=90497 width=20) (actual time=28.559..357.307 rows=90455 loops=1)
              ->  Bitmap Heap Scan on track track_4  (cost=2554.41..294331.28 rows=90497 width=16) (actual time=28.558..349.863 rows=90455 loops=1)
                    Recheck Cond: ((("position")::text = '5'::text) AND (duration >= 156000) AND (duration <= 162000))
                    Heap Blocks: exact=87153
                    ->  Bitmap Index Scan on track_position_duration_idx  (cost=0.00..2531.78 rows=90497 width=0) (actual time=16.112..16.112 rows=90455 loops=1)
                          Index Cond: ((("position")::text = '5'::text) AND (duration >= 156000) AND (duration <= 162000))
Planning Time: 0.621 ms
JIT:
  Functions: 43
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 5.811 ms, Inlining 46.029 ms, Optimization 169.492 ms, Emission 95.533 ms, Total 316.865 ms
Execution Time: 3069.140 ms

Wide table

Bitmap Heap Scan on track_with_duration  (cost=2884.32..4012.21 rows=1 width=861) (actual time=30.243..31.718 rows=187 loops=1)
  Recheck Cond: ((track_1 >= 119000) AND (track_1 <= 125000) AND (track_5 >= 156000) AND (track_5 <= 162000))
  Filter: ((track_2 >= 161000) AND (track_2 <= 167000) AND (track_3 >= 207000) AND (track_3 <= 213000) AND (track_4 >= 167000) AND (track_4 <= 173000))
  Rows Removed by Filter: 1861
  Heap Blocks: exact=2028
  ->  BitmapAnd  (cost=2884.32..2884.32 rows=286 width=0) (actual time=29.864..29.865 rows=0 loops=1)
        ->  Bitmap Index Scan on track_with_duration_track_1  (cost=0.00..1012.90 rows=50847 width=0) (actual time=11.551..11.551 rows=58032 loops=1)
              Index Cond: ((track_1 >= 119000) AND (track_1 <= 125000))
        ->  Bitmap Index Scan on track_with_duration_track_5  (cost=0.00..1871.16 rows=93873 width=0) (actual time=16.223..16.223 rows=90455 loops=1)
              Index Cond: ((track_5 >= 156000) AND (track_5 <= 162000))
Planning Time: 0.926 ms
Execution Time: 32.672 ms

As can be seen in the query plans above, querying the narrow table was over 90x slower compared to the wide table (3070ms vs 34ms).

Analyzing the performance difference

There are several possible reasons the wide table is faster:

Table Size

The wide table is almost 8.5x smaller than the narrow table (16,711,759 rows vs 141,308,213 rows). This reduces the amount of data that needs to be processed in several ways:

  • per-row overhead is reduced

  • track position information is stored implicitly in the table structure, instead of explicitly in every row

Index Size

Position information is not written to the indexes; instead the index is chosen based on the column name in the query. This allows the index to be smaller. This can be seen in the query plan above, where only the indexes for track_1 & track_5 are used, the other indexes are not scanned, saving disk reads. However, the disadvantage is that the possible position values need to be enumerated beforehand, and an index created for each value. This is possible in our case, which is why a wide table was a good fit.

Better Table Statistics

Better statistics can be kept on the wide table (e.g. duration_99 will have mostly NULL values, because few media have that many tracks), because Postgresql keeps per-column statistics. This allows PostgreSQL to choose more selective indexes, so that fewer disk pages need to be read. This can be seen in the query plan above, where the wide table only required 2 of the indexes to be scanned, and the planner's estimate was very close to the actual number of rows matched.

Wide table tradeoffs

While a wide table speeds up querying, it has tradeoffs that make it unsuitable for use in many scenarios. The main ones are:

  • The fields to query by (in our case, track position) need to be defined up-front, and have a limited set of possible values (in our case, the values are integers 1 to 99).
  • There are only 2 dimensions to query by (in this case, track position and duration). If there are multiple dimensions, each possible combination will needs its own column, which means there will be O(n^(k-1)) columns, where k is the number of dimensions, and n is the number of possible values for each dimensions. Even with 3 dimensions & 99 possible values, the number of columns needed becomes 9801, which is far above the limit of 1600 columns in a PostgreSQL table.

In conclusion, a wide table provides a significant benefit to querying performance, assuming the data can be formatted as a wide table.

Did you like this article?

Waqqas Dadabhoy

See other articles by Waqqas

Related jobs

See all

Title

The company

  • Remote

Title

The company

  • Remote

Title

The company

  • Remote

Title

The company

  • Remote

Related articles

JavaScript Functional Style Made Simple

JavaScript Functional Style Made Simple

Daniel Boros

•

12 Sep 2021

JavaScript Functional Style Made Simple

JavaScript Functional Style Made Simple

Daniel Boros

•

12 Sep 2021

WorksHub

CareersCompaniesSitemapFunctional WorksBlockchain WorksJavaScript WorksAI WorksGolang WorksJava WorksPython WorksRemote Works
hello@works-hub.com

Ground Floor, Verse Building, 18 Brunswick Place, London, N1 6DZ

108 E 16th Street, New York, NY 10003

Subscribe to our newsletter

Join over 111,000 others and get access to exclusive content, job opportunities and more!

© 2024 WorksHub

Privacy PolicyDeveloped by WorksHub