-
Notifications
You must be signed in to change notification settings - Fork 17
Add BRIN support for spoint and sbox #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I can see, you are indenting using spaces. pgsphere follows PostgreSQL coding style guidelines. Could you please indent this code using tabs? You can try just run pg_indent over new source files.
Makefile
Outdated
gnomo.o | ||
|
||
EXTENSION = pg_sphere | ||
DATA_built = pg_sphere--1.0.sql | ||
DOCS = README.pg_sphere COPYRIGHT.pg_sphere | ||
REGRESS = init tables points euler circle line ellipse poly path box index \ | ||
contains_ops contains_ops_compat bounding_box_gist gnomo | ||
contains_ops contains_ops_compat bounding_box_gist gnomo spoint_brin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should spoint_brin test be used only on BRIN_CHECK?
brin.c
Outdated
} | ||
|
||
// | ||
// Define operators procedures below |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In PostgreSQL coding style, we evade C++ style comments.
In general looks very good, but please take a look to minor notes by my review. |
Thank you @akorotkov for the review. I'll fix the minor notes ASAP, and let you know when it will be done. |
Hi @akorotkov I've then fixed what you reported: I've used pgindent for source code, and corrected the Makefile in order to execute regression test of BRIN code just for proper PostgreSQL versions. Sorry for my late reply, it was a busy period. It should be now ready to be merged, but let me know if there are further needed fixes. |
You still have a tab on line 34 of the Makefile (where you added This is a fantastic contribution! |
Hi @esabol indeed I passed to pgindent only source code, not the Makefile. It works, but I can fix the style also for this. Thank you for your review! |
I've also fixed the Makefile as per @esabol suggestion: other lines where indented with spaces so I uniformly indented the added line (n. 34) with spaces too. |
I think we also need to bump extension version, so that existing users can upgrade. Another issue is instances pg_upgrade'd from pre-9.5. PostgreSQL doesn't offer built-in mechanism which could add them a BRIN index support. But we can do that using manual SQL-script and describe its usage in README. |
…in case of upgrades from pre9.5 releases
Hi @akorotkov I've added the following changes:
|
It's been a while. Any progress? |
@esabol I've applied the fixes required by @akorotkov, adding a command to support upgrades from pre-9.5 releases through pg_upgrade (similarly to what I've done for BRINs in PostGIS) and improving the documentation too. This is ready for the final review. In the meantime, I've also tried to perform some benchmark on an artificial sample of spherical points, varying the size of the sample itself. I'm attaching here one of the plot I've obtained: it reports the time needed to select 1 spherical point included in a particular spherical sector using both BRINs or GiST indices, as the size of the whole sample increase, keeping low values for the configured shared buffers in PostgreSQL in order to better see how BRINs are useful for big datasets (note the behaviours' change at ~10M points). |
Nice work. The performance degradation for less than ~10M points is disappointing, if I am being honest. |
@esabol it is expected that BRINs performance is lower than a tree index. Block Range INdices map per range, than the retrieved ranges after the index scan have to be sequentially inspected to find exact matches. In this sense, BRINs are more "lossy" than GiST indices. But the cost to inspect ranges' map is as lower than the cost to inspect a GiST index as the sample size is larger. BRINs are specifically thought for big dataset, more than an alternative of GiST indices. Furthermore, consider that the sample used to generate the plot is an artificial one, running on a small DB server. It would be good to see how the BRINs perform with a real dataset. |
I guess I'd just hoped that BRINs would see a benefit at ~1M points instead of ~10M. Many of our largest catalogs are in the millions of rows, not tens of millions. Of course, that will change over time. Catalogs only keep getting bigger. |
@esabol Block Range INdexing is configurable, i.e. you can set the number of data pages included in each range (in the above example, I considered 16 pages per range) as lower as you are able to be close to the desired performance. This will increase the effort for the maintenance of the index (time needed for creation, final size of the index, impact during insertions, time needed for resummarization) that will be more similar to the one needed for the GiST indices, but also the performance will be more similar for small datasets, keeping the advantages for larger ones. The parameter that configure this behaviour is the |
@gbroccolo, right. BRIN performance is strongly dependent on how keys are correlated with layout of tuples on heap pages. We can imagine two corner cases for BRIN performance:
So as I get, the experiment you did is close to the first scenario. In real-life, distribution could be different, and expected to be somewhere in the middle of these two corner cases. I would note, that I found experiment you did quite surprising. BRIN is expected to be slower on selective queries (low fraction of tuples is selected), and faster on non-selective queries (high fraction of tuples is selected). So, for me most straight-forward benchmark would be following: for some large-enough dataset run queries with different selectivity. With fraction of selected tuples higher than some threshold, BRIN should outperform GiST. As I get in your experiment, every query selects 1 point. And we can see serious degradation of GiST near 10^7 dataset size. I can explain it as following. Before 10^7 points, GiST index fits OS cache. After that, lookups in GiST index becomes more expensive, because of random disk seeks. While BRIN index performs better thanks to smaller size of sequential disk access pattern. In order to help me validate my thoughts, could you please provide more details on hard 8000 ware you use during benchmark: amount of RAM, type of disk (hard disk or SSD), PostgreSQL settings etc? |
Regarding pull request itself, I've some more notes:
|
Please excuse me, regrettably sscan.c had been prematurely deleted by me on the master branch. I have now reverted the deletion. |
@mnullmei, no problem, thanks. |
Hi @akorotkov, let me give further details about my benchmarks: I considered an Ubuntu Xenial VM with 2GB RAM, the storage unit was based on SSD, but consider the driver virtualization. I considered PostgreSQL 9.6 + pgSphere built from this current branch, and default configuration for PostgreSQL. The artificial dataset was inserted in the DB in order to keep spatially close objects as more as possible close in data page tuples on disk, so the benchmark was actually more similar to Apart of the graph I attached in my previous reply, I've produced several graphs, in order to inspect data access to disk, how performance changes with the selectivity of the query, and so on. I attach here the graph you requested, reporting the time needed to select a variable number of points included in a spherical sector, from 1 single point to 1 million. BRINs start to perform closer to GiST indices at ~O(100k) points, so performance start to increase as the selectivity is lower and lower. For a further validation of your thoughts, I'm also attaching the number of indices & data blocks accessed from the storage, as the selectivity of the query decreases. As you mentioned, the reduced size of the index allows it to be completely cached in the database's shared buffers in memory. As a consequence no index block needs to be read from disk in case of BRIN, and also the number of data blocks read from disk is quite lower than GiST case as the selectivity of the query decreases. Regarding the PR: let me rebase on master in order to include the fix of @mnullmei, and have a look to the |
@akorotkov I rebased the PR branch to the latest |
If anyone is interested in reviving this feature, I recommend submitting a PR to the currently active pgsphere repo: https://github.com/postgrespro/pgsphere |
Hi @esabol, to be fair I don't know anymore if the PR would be useful after 5 years. Maybe now catalogs with ~O(10M) start to be more common, anyhow it's a pity because the PR was ready with all the required fixes and changes came up already 5 years ago. Before starting working on this again, my question is: would the PR (considering the limitations of the Block Range INdexing) be still of interest for pgsphere? |
@gbroccolo wrote:
I don't know that I can answer that for all pgsphere users, but I do agree that ~O(10M) datasets are more common these days. That said, I have opened an issue to discuss the matter in the new repo. Looking at the changes in this PR, I don't think it would be that difficult to get a similar PR working with the new pgsphere repo. It's kind of up to you, I think. |
Update: There's now a working PR for adding BRIN support to the https://github.com/postgrespro/pgsphere/ repo that is based on @gbroccolo's PR here with essentially just some updates to the Makefile. |
As discussed in the mailing list, I opened the PR here. Support for searches within spherical circles will be added in the future.