CTO技术笔记|Zion无代码搜索方案详解

Introduction 介绍:

Zion is a cloud-native no-code IDE (integrated development environment) that we here at FunctorZ have been developing since early 2020. It aims to significantly lower the hurdle one has to overcome to build an application that is capable of meeting the requirements of serving requests from a non-insignificant crowd. A very common requirement is text search. I would like to share some of my thoughts on this requirement.
Zion是一个云原生无代码IDE(集成开发环境),我们从2020年初开始开发这个平台,它的目标是显著降低构建应用程序的门槛,使其能够满足来自业务人员的需求。一个非常常见的需求是文本搜索,今天我想分享一下对这一要求的一些想法。

TL;DR :

Breaking down the requirement 拆解需求:

A lot is implied when our customers ask the question: "Do you support search?". After some digging, I narrowed them down to the following.
  • Exact matches. e.g. SELECT * FROM account WHERE phone_number = '+8612312345678'. This is the easiest type to satisfy.
  • Prefix matches. e.g. SELECT * FROM post WHERE post.title ILIKE 'functorz%'. As we shall see later, these are quite easy as well.
  • Suffix matches. e.g. SELECT * FROM post WHERE post.title ILIKE '%functorz'. These look similar to prefix matches, and indeed are quite similar.
  • Infix matches. e.g. SELECT * FROM post WHERE post.title ILIKE '%functorz%'. These are quite a bit more difficult to deal with, and require a different strategy entirely.
  • Regular-expression matches. e.g. SELECT * FROM post WHERE post.title ~ '(functorz|zion)'. Yeah, a tough nut to crack.
  • Full-text searches. e.g. SELECT * FROM post WHERE post.content @@ to_tsquery('functorz & zion'). (Postgres syntax) This one is especially difficult to deal with, and we shall reserve this for a separate post.
In addition to those "search" requirements, our customers frequently use "unique" constraint on the content of the text. This one although seems innocuous on the surface, can become a huge pain under certain circumstances.

Relevant background information 相关背景信息:

Our product Zion is opinionated. We are a start-up team. We are very cognizant of the resources that we have, and being opinionated allows us to not be boggled down by the innumerable combinations of technology choices at every level of the stack. We also wanted to have as few components as possible while meeting our customers' needs.
Our choice for database is Postgres, and our API was initially generated by hasura (in the process of being replaced). Hasura is a GraphQL backend that essentially provides a pass-through SQL interface, and does not offer the ability to inject custom logic for specific types of queries. That means whatever we decide to do to optimize the SQL queries generated by all those different kinds of "searches".

Route to optimization 优化路径:

Setup 配置

It is not scientific, but should be representative.Surface Book 2, i7-8650U + 16GB + 512GB.Postgres 14 installed on Windows' NTFS (not WSL2).Populated via
CREATE DATABASE functorz ENCODING 'UTF-8'; CREATE TABLE simple (id serial PRIMARY KEY, content TEXT); INSERT INTO simple (content) SELECT gen_random_uuid() FROM generate_series(1, 1000000);

Exact Matches  精确匹配

This one is simple enough. All it takes is to create any kind of index that supports equality test.Before any optimization, let's establish the baseline.
functorz=# explain analyze select * from simple where content = 'f0f453fd-8f7c-44f2-bcc0-124f706786cf'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- Gather (cost=1000.00..15554.43 rows=1 width=41) (actual time=144.976..172.778 rows=0 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Seq Scan on simple (cost=0.00..14554.33 rows=1 width=41) (actual time=35.305..35.305 rows=0 loops=3) Filter: (content = 'f0f453fd-8f7c-44f2-bcc0-124f706786cf'::text) Rows Removed by Filter: 333333 Planning Time: 1.331 ms Execution Time: 172.798 ms (8 rows)
Notice the parallel sequential scan. Now let's just create a vanilla B-tree index on content.
functorz=# CREATE INDEX idx_simple_by_content ON simple (content); functorz=# explain analyze select * from simple where content = 'f0f453fd-8f7c-44f2-bcc0-124f706786cf'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Index Scan using idx_simple_by_content_b_tree on simple (cost=0.42..8.44 rows=1 width=41) (actual time=0.287..0.288 rows=1 loops=1) Index Cond: (content = 'f0f453fd-8f7c-44f2-bcc0-124f706786cf'::text) Planning Time: 0.394 ms Execution Time: 0.323 ms (4 rows)
We now have an index scan. And execution time goes from hundreds of ms to hundreds of us, about a 1000x difference on this 1 million row dataset.

Prefix matches前缀匹配

A bit more interesting obviously, so let's try these with the previously created B-tree index on content.
functorz=# explain analyze select * from simple where content like 'f0f453fd-8f7c-44f2-bcc0-124f706786cf%'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Gather (cost=1000.00..15564.33 rows=100 width=41) (actual time=0.528..100.149 rows=1 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Seq Scan on simple (cost=0.00..14554.33 rows=42 width=41) (actual time=26.435..52.645 rows=0 loops=3) Filter: (content ~~ 'f0f453fd-8f7c-44f2-bcc0-124f706786cf%'::text) Rows Removed by Filter: 333333 Planning Time: 1.960 ms Execution Time: 100.194 ms (8 rows)
Odd, we are not going through index at all. We ended up with a parallel sequential scan. One would think prefix matches should be indexable, since we can quite simply express this as a conjunction of two range conditions, and B-trees are perfect for those queries. So let's try that.
functorz=# explain analyze select * from simple where content >= 'f0f453fd-8f7c-44f2-bcc0-124f706786cf' and content < 'f0f453fd-8f7c-44f2-bcc0-124f706786cg'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------- Index Scan using idx_simple_by_content_b_tree on simple (cost=0.42..8.45 rows=1 width=41) (actual time=0.113..0.113 rows=1 loops=1) Index Cond: ((content >= 'f0f453fd-8f7c-44f2-bcc0-124f706786cf'::text) AND (content < 'f0f453fd-8f7c-44f2-bcc0-124f706786cg'::text)) Planning Time: 0.167 ms Execution Time: 0.128 ms (4 rows)
Amazing! The index is performing its magic tricks again! So it appears that Postgres wasn't smart enough to figure out this transformation. Oh, what happened to SQL's promise of "let users express what data to retrieve, let the engine underneath take care of seamlessly retrieving it"? It turns out that the reason behind Postgres' "inability" to translate "LIKE 'value%'" was due to the limitations of the >= and < operators. Both only operate on a byte value level, and once you have some collation other than "C", things can breakdown very quickly. So that "inability" was really Postgres' attempt to protect us from ourselves. The correct operators to use turns out to be "~>=~" and "~<~". Let's bask in the light of some magic:
functorz=# create index idx_simple_by_content_b_tree_text_pattern_ops on simple (content text_pattern_ops); functorz=# explain analyze select * from simple where content like 'f0f453fd-8f7c-44f2-bcc0-124f706786cf%'; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using idx_simple_by_content_b_tree_text_pattern_ops on simple (cost=0.42..2.65 rows=100 width=41) (actual time=0.041..0.042 rows=1 loops=1) Index Cond: ((content ~>=~ 'f0f453fd-8f7c-44f2-bcc0-124f706786cf'::text) AND (content ~<~ 'f0f453fd-8f7c-44f2-bcc0-124f706786cg'::text)) Filter: (content ~~ 'f0f453fd-8f7c-44f2-bcc0-124f706786cf%'::text) Planning Time: 1.785 ms Execution Time: 0.053 ms (5 rows)
That "text_pattern_ops" is all the "nudge" Postgres needed. "text_pattern_ops" is an "opclass". "The operator class identifies the operators to be used by the index for that column. ", says Postgres' official document. Usually we do not specify one, and that just means that Postgres will choose the default operator class for that column's data type. In addition to =, the default operator class supports the following operations: <, <=, >, and >=. And that was why the "LIKE" operator that operates on two text values, which is implemented by Postgres' "~~", wasn't getting the proper treatment. On line 6, we can clearly see that the index condition involves the text version of ">=" and "<", namely "~>=~" and "~<~". And Postgres does indeed translate the "~~" condition for us!
A natural follow up would be, what are all the operators supported by "text_pattern_ops"?
functorz=# SELECT am.amname AS index_method, opf.opfname AS opfamily_name, amop.amopopr::regoperator AS opfamily_operator FROM pg_am am, pg_opfamily opf, pg_amop amop WHERE opf.opfmethod = am.oid AND amop.amopfamily = opf.oid AND opf.opfname = 'text_pattern_ops'; index_method | opfamily_name | opfamily_operator --------------+------------------+------------------- btree | text_pattern_ops | ~<~(text,text) btree | text_pattern_ops | ~<=~(text,text) btree | text_pattern_ops | =(text,text) btree | text_pattern_ops | ~>=~(text,text) btree | text_pattern_ops | ~>~(text,text) hash | text_pattern_ops | =(text,text) (6 rows)
You might have guessed, since we dropped the index that supported <, <=, >, and >= operators, some queries now suffer.
functorz=# explain analyze select * from simple where content >= 'f0f453fd-8f7c-44f2-bcc0-124f706786cf' and content < 'f0f453fd-8f7c-44f2-bcc0-124f706786cg'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ Gather (cost=1000.00..16596.10 rows=1 width=41) (actual time=0.648..204.607 rows=1 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Seq Scan on simple (cost=0.00..15596.00 rows=1 width=41) (actual time=89.163..148.741 rows=0 loops=3) Filter: ((content >= 'f0f453fd-8f7c-44f2-bcc0-124f706786cf'::text) AND (content < 'f0f453fd-8f7c-44f2-bcc0-124f706786cg'::text)) Rows Removed by Filter: 333333 Planning Time: 0.586 ms Execution Time: 204.665 ms (8 rows)
And yes, now we are back to sequential scans. Though one could very reasonably argue, few requirements would involve byte-wise string range lookups. And you could always choose to keep the original B-tree index.

Suffix matches 后缀匹配

Now we know what operators "text_pattern_ops" support, it is not difficult to come to the conclusion that we can't express LIKE '%something' solely in terms of the text version of less-than, less-than-or-equal-to, equal-to, greater-than and greater-than-or-equal-to. (Give it a try if you don't believe me).And that is reflected Postgres' query plan:
functorz=# explain analyze select * from simple where content like '%f0f453fd-8f7c-44f2-bcc0-124f706786cf'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Gather (cost=1000.00..14388.26 rows=100 width=41) (actual time=0.599..178.746 rows=1 loops=1) Workers Planned: 3 Workers Launched: 3 -> Parallel Seq Scan on simple (cost=0.00..13378.26 rows=32 width=41) (actual time=27.477..52.040 rows=0 loops=4) Filter: (content ~~ '%f0f453fd-8f7c-44f2-bcc0-124f706786cf'::text) Rows Removed by Filter: 250000 Planning Time: 0.114 ms Execution Time: 178.763 ms (8 rows)
As expected, we have reverted back to sequential scan. However, once we reverse the string, "%query" becomes effectively "query%". So let's go that route.
functorz=# create index idx_simple_by_reverse_content_b_tree_text_pattern_ops on simple (reverse(content) text_pattern_ops); functorz=# explain analyze select * from simple where reverse(content) like 'fc687607f421-0ccb-2f44-c7f8-df354f0f%'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on simple (cost=92.38..4250.57 rows=5000 width=41) (actual time=0.023..0.023 rows=1 loops=1) Filter: (reverse(content) ~~ 'fc687607f421-0ccb-2f44-c7f8-df354f0f%'::text) Heap Blocks: exact=1 -> Bitmap Index Scan on idx_simple_by_reverse_content_b_tree_text_pattern_ops (cost=0.00..91.13 rows=5000 width=0) (actual time=0.015..0.015 rows=1 loops=1) Index Cond: ((reverse(content) ~>=~ 'fc687607f421-0ccb-2f44-c7f8-df354f0f'::text) AND (reverse(content) ~<~ 'fc687607f421-0ccb-2f44-c7f8-df354f0g'::text)) Planning Time: 0.162 ms Execution Time: 0.047 ms (7 rows)
And yes, as expected, we have the beautiful performance afforded to us by the lovely index.

Infix matches 中缀匹配

Looking at infix matches, it is not possible to translate "LIKE '%query%'" in terms of any combination of "~>~" and "~<~" operators. What we now need is proper "~~" operator support natively in the index. To find such an index type, we need to first find the opclasses that support the "~~" operator, and then find the index types compatible with such opclasses.It turns out that none of Postgres' default opclasses support "~~(text, text)", and one can run the following sql to verify:
SELECT am.amname AS index_method, opf.opfname AS opfamily_name, amop.amopopr::regoperator AS opfamily_operator FROM pg_am am, pg_opfamily opf, pg_amop amop WHERE opf.opfmethod = am.oid AND amop.amopfamily = opf.oid AND amop.amopopr::regoperator = '~~(text, text)'::regoperator;
However, an extension that is by default available in Postgres, called "pg_trgm", supports "~~(text, text)". As the name suggests, this extension gives Postgres some trigram-related abilities. So let us install that.
functorz=# create index idx_simple_by_content_gist_trgm_ops on simple USING GIST (content gist_trgm_ops); CREATE INDEX Time: 28290.219 ms (00:28.290) functorz=# explain analyze select * from simple where content like '%fc687607f421-0ccb-2f44-c7f8-df354f0f%'; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using idx_simple_by_content_gist_trgm_ops on simple (cost=0.41..115.46 rows=100 width=41) (actual time=10.810..10.811 rows=0 loops=1) Index Cond: (content ~~ '%fc687607f421-0ccb-2f44-c7f8-df354f0f%'::text) Planning Time: 1.145 ms Execution Time: 10.829 ms (4 rows) functorz=# explain analyze select * from simple where content like '%0ccb%'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using idx_simple_by_content_gist_trgm_ops on simple (cost=0.41..115.46 rows=100 width=41) (actual time=1.774..104.431 rows=256 loops=1) Index Cond: (content ~~ '%0ccb%'::text) Rows Removed by Index Recheck: 24 Planning Time: 1.793 ms Execution Time: 104.460 ms (5 rows) functorz=# create index idx_simple_by_content_gin_trgm_ops on simple USING GIN (content gin_trgm_ops); CREATE INDEX Time: 18051.048 ms (00:18.051) functorz=# explain analyze select * from simple where content like '%fc687607f421-0ccb-2f44-c7f8-df354f0f%'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on simple (cost=200.98..311.19 rows=100 width=41) (actual time=11.378..11.379 rows=0 loops=1) Recheck Cond: (content ~~ '%fc687607f421-0ccb-2f44-c7f8-df354f0f%'::text) -> Bitmap Index Scan on idx_simple_by_content_gin_trgm_ops (cost=0.00..200.95 rows=100 width=0) (actual time=11.375..11.375 rows=0 loops=1) Index Cond: (content ~~ '%fc687607f421-0ccb-2f44-c7f8-df354f0f%'::text) Planning Time: 0.294 ms Execution Time: 11.505 ms (6 rows) functorz=# explain analyze select * from simple where content like '%0ccb%'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on simple (cost=12.88..123.09 rows=100 width=41) (actual time=0.472..0.863 rows=256 loops=1) Recheck Cond: (content ~~ '%0ccb%'::text) Rows Removed by Index Recheck: 24 Heap Blocks: exact=279 -> Bitmap Index Scan on idx_simple_by_content_gin_trgm_ops (cost=0.00..12.85 rows=100 width=0) (actual time=0.421..0.421 rows=280 loops=1) Index Cond: (content ~~ '%0ccb%'::text) Planning Time: 0.373 ms Execution Time: 0.923 ms (8 rows)
Looks like GIST is slower to build than GIN, and much slower on shorter search patterns. So slow that GIST is only getting less than 2x performance compared to a parallel sequential scan on this 1M row table. In fact, GIST index takes more space, too, due to its tree structure.
functorz=# SELECT psai.indexrelname AS index_name, pg_size_pretty(pg_relation_size(c.oid)) AS table_size, pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size FROM pg_tables t LEFT JOIN pg_class c ON t.tablename = c.relname LEFT JOIN pg_index i ON c.oid = i.indrelid LEFT JOIN pg_stat_all_indexes psai ON i.indexrelid = psai.indexrelid WHERE t.schemaname NOT IN ('pg_catalog', 'information_schema') ORDER BY 1; index_name | table_size | index_size -------------------------------------------------------+------------+------------ idx_simple_by_content_b_tree | 73 MB | 56 MB idx_simple_by_content_b_tree_text_pattern_ops | 73 MB | 56 MB idx_simple_by_content_gin_trgm_ops | 73 MB | 121 MB idx_simple_by_content_gist_trgm_ops | 73 MB | 200 MB idx_simple_by_reverse_content_b_tree_text_pattern_ops | 73 MB | 56 MB simple_pkey | 73 MB | 21 MB
So why do we even bother with GIST? Insert/update speed. Due to the "inverted index" nature, each insertion/update can trigger updates to a huge number of pages, significantly slowing the update speed. Further discussion of GIN indexes deserves its own post.
" nature, each insertion/update can trigger updates to a huge number of pages, significantly slowing the update speed. Further discussion of GIN indexes deserves its own post. functorz=# INSERT INTO simple_gin_only SELECT nextval('simple_gin_only_id_seq'), gen_random_uuid() FROM generate_series(1, 100000); INSERT 0 100000 Time: 2761.322 ms (00:02.761) functorz=# INSERT INTO simple_gist_only SELECT nextval('simple_gist_only_id_seq'), gen_random_uuid() FROM generate_series(1, 100000); INSERT 0 100000 Time: 4084.976 ms (00:04.085) functorz=# INSERT INTO simple_gin_only SELECT nextval('simple_gin_only_id_seq'), gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text FROM simple order by id limit 10000; INSERT 0 10000 Time: 2669.833 ms (00:02.670) functorz=# INSERT INTO simple_gist_only SELECT nextval('simple_gist_only_id_seq'), gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text FROM simple order by id limit 10000; INSERT 0 10000 Time: 1263.832 ms (00:01.264)
What we observe is somewhat consistent with Postgres 9.4's documentation on this topic. As the size of the text grows, GIN's performance degradation is much more severe. So this now becomes a trade-off, is your data read-heavy or write-heavy?
Both GIN and GIST support a ton of text operations, and it seems other than size and insert/update performance, there now seems no particular reason to rely on B-tree indexes for text fields.
functorz=# select * into simple_btree_only from simple; functorz=# create index idx_simple_btree_only_by_content on simple_btree_only using btree (content text_pattern_ops); CREATE INDEX functorz=# INSERT INTO simple_btree_only SELECT nextval('simple_btree_only_id_seq'), gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text || gen_random_uuid()::text FROM simple order by id limit 10000; INSERT 0 10000 Time: 412.707 ms

Regular-expression matches 正则表达式匹配

Remember when we said "both GIN and GIST support a ton of text operations"? Well the payback is quick on this one.
functorz=# explain analyze select * from simple_gin_only where content ~ 'ba1.*abc'; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on simple_gin_only (cost=249.65..398.44 rows=135 width=41) (actual time=2.126..2.338 rows=37 loops=1) Recheck Cond: (content ~ 'ba1.*abc'::text) Rows Removed by Index Recheck: 28 Heap Blocks: exact=65 -> Bitmap Index Scan on idx_simple_gin_only_by_content_gin_trgm_ops (cost=0.00..249.61 rows=135 width=0) (actual time=2.106..2.106 rows=83 loops=1) Index Cond: (content ~ 'ba1.*abc'::text) Planning Time: 0.454 ms Execution Time: 2.384 ms (8 rows)functorz=# explain analyze select * from simple_gist_only where content ~ 'ba1.*abc'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Index Scan using idx_simple_gist_only_by_content_gist_trgm_ops on simple_gist_only (cost=0.41..156.79 rows=136 width=41) (actual time=7.026..264.619 rows=42 loops=1) Index Cond: (content ~ 'ba1.*abc'::text) Rows Removed by Index Recheck: 28 Planning Time: 0.230 ms Execution Time: 264.655 ms (5 rows)
Again, GIST seems like a rather clear loser, though still performing better than sequential scan. So the answer would most likely be, "just use a GIN with trigrams".

Uniqueness constraint 唯一性约束

Neither GIN nor GIST indexes can be used to support a unique constraint. In fact, according to Postgres' own documentation, "currently, only B-tree indexes can be declared unique."That is usually not too much of a problem, but what about a huge string that doesn't fit in a single page? All B-tree entries must fit inside a page, which in most cases, is 8kB in size. And if we try to create a (unique) index on a column that does contain large strings greater than 8kB in size, we are greeted with the following error message.
functorz=# insert into simple_btree_only values (nextval('simple_btree_only_id_seq'), repeat('a', 100000000)); ERROR: index row requires 1144712 bytes, maximum size is 8191
But there are ways out. We can shorten the content inside a B-tree index by first hashing them.
functorz=# CREATE UNIQUE INDEX udx_simple_btree_only_by_content_hash ON simple_btree_only USING btree (digest(content, 'sha512')); CREATE INDEX functorz=# insert into simple_btree_only values (1, 'f0f453fd-8f7c-44f2-bcc0-124f706786cf'); ERROR: duplicate key value violates unique constraint "udx_simple_btree_only_by_content_hash" DETAIL: Key (digest(content, 'sha512'::text))=(\xcf187c6795e52b38 8d721afeae11e86d115742bf22deae5e360b028643b51a1a1d13dd646fa72b90c8fe16edd372c2eeb76cdebe6424b18ce0a0aecc76290889)already exists.
The problem with this approach is that although we have a B-tree index that is orders of magnitude better at point selects than GIN or GIST indexes (reason is again worth another post), we can't really utilize this index unless we modify our query to operate on (digest(content, 'sha512')), rather than the vanilla content. Doable, not great though.
Since we are doing hashing using the "digest" method, why don't we just use a hash index? We can't directly create a unique index using hash, as said in the aforementioned documentation. However, for some reason that is yet unclear to me, we can create a constraint that automatically creates a hash index to support it. The same index can also be used for point selects.
functorz=# alter table simple_hash_only add constraint udx_simple_hash_only_by_content exclude using hash (content with =); ALTER TABLE Time: 3299.922 ms (00:03.300) functorz=# explain analyze select * from simple_hash_only where content = 'f0f453fd-8f7c-44f2-bcc0-124f706786cf'; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using udx_simple_hash_only_by_content on simple_hash_only (cost=0.00..2.22 rows=1 width=41) (actual time=0.037..0.039 rows=1 loops=1) Index Cond: (content = 'f0f453fd-8f7c-44f2-bcc0-124f706786cf'::text) Planning Time: 0.214 ms Execution Time: 0.065 ms (4 rows) functorz=# insert into simple_hash_only values (1, 'f0f453fd-8f7c-44f2-bcc0-124f706786cf'); ERROR: conflicting key value violates exclusion constraint "udx_simple_hash_only_by_content" DETAIL: Key (content)=(f0f453fd-8f7c-44f2-bcc0-124f706786cf) conflicts with existing key (content)=(f0f453fd-8f7c-44f2-bcc0-124f706786cf).

Conclusion 结论:

The longer version:

TL;DR plus the following:If you are storing long strings (greater than at least a few hundred characters on average) and/or searching with long patterns, use GIST instead of GIN, especially when your workload is comparatively write-heavy. If you know or can enforce the maximum length of the stored strings, a B-tree can be used for both uniqueness and point selects.
长话短说:如果要存储长字符串(平均大于几百个字符)和/或使用Long Patterns进行搜索,特别是当您的工作负载相对较重时,那么应该使用GIST而不是GIN。如果您知道或者可以限制存储字符串的最大长度,那么B -tree可以用于惟一性和点选择。

P.S.

At Zion, we strongly believe that it is our duty to figure out, automatically, the technical details surrounding issues like text search for our users, and apply the solutions automatically for them. It is not reasonable at all to expect the average Joe to dig into the innards of a database to figure out the best search strategy for his use case.To this end, we are constantly hiring brilliant minds to join us. If you are like me, who enjoys thorough understanding of various technologies and is delighted by tough challenges, do not hesitate to contact me at jiangyaokai@functorz.com.
在Zion,我们坚信我们有责任为用户提供围绕文本搜索等问题的技术细节和解决方案。因为我们认为:期望普通人深入数据库的内部来找出use case的最佳搜索策略是不合理的。为此,我们不断招募优秀人才加入我们。如果你像我一样,喜欢全面了解各种技术,并乐于接受艰巨的挑战,请通过jiangyaokai@functorz.com联系我。
2022-08-25
0 0