8000 Implementation of a one-level PGM based on the Eytzinger array by RomaA2000 · Pull Request #22 · gvinciguerra/PGM-index · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@RomaA2000
Copy link
Contributor
@RomaA2000 RomaA2000 commented Apr 13, 2021

Since this structure implies the presence of binary search between levels, we thought that we could replace a regular array with one that is optimized for cache queries, accordingly, this speeds up binary search. We also added types for a single-level PGM and a BPGM with this optimization. In this version, we store both arrays, since after transferring the data over them, it is impossible to iterate in the desired order.

@gvinciguerra
Copy link
Owner
gvinciguerra commented Apr 15, 2021

Hi @RomaA2000, using the Eytzinger layout is a cool idea, thanks for opening this PR! I have a suggestion that should improve the cache efficiency and the memory usage of your implementation of PGMIndexEytzinger.

It seems that you are using an IdxHolder struct that stores a pair <value, position of the segment with key = value inside this->segments> so that when you search in EytzingerArray<IdxHolder> you can find the position of the segment in this->segments, correct?

Why not replacing this->segments and eytzinger_first_layer with an EytzingerArray<Segment>? This way you don't need to store additional sizeof(T) bytes for IdxHolder.value(because the segment's value is stored only once in the corresponding Segment struct), plus additional sizeof(size_t) bytes for IdxHolder.idx (because the segments are stored in the Eytzinger layout) 😉


For what concerns the BucketingPGMIndexEytzinger alias defined here:

using BucketingPGMIndexEytzinger = BucketingPGMIndex<K, Epsilon, TopLevelBitSize, Floating, PGMIndexEytzinger<K, Epsilon, Floating>>;
it is equivalent to a standard BucketingPGMIndex. This is because the argument PGMIndexEytzinger<K, Epsilon, Floating> passed to the fifth template parameter of BucketingPGMIndex (i.e. PGMType) is used only in this line of BucketingPGMIndex:
using Segment = typename PGMType::Segment;

But PGMIndexEytzinger<K, Epsilon, Floating>::Segment is defined as

using Segment = typename PGMIndex<K, Epsilon, 0,Floating>::Segment;

So BucketingPGMIndex::Segment == typename PGMType::Segment == PGMIndex<...>::Segment.

Nonetheless, combining BucketingPGMIndex with the Eytzinger layout is a cool idea too that is worth developing.


Thanks again @RomaA2000. I leave this PR open if you'd like to continue to work on this.

@gvinciguerra gvinciguerra added the help wanted Extra attention is needed label Apr 15, 2021
@gshanemiller
Copy link
gshanemiller commented Oct 1, 2022

Would completing this PR do anything (indirectly) for DynamicPGMIndex::lower_bound_bl because from what I can see DynamicPGMIndex::find spends 50-75% of its time in this function either in line 459 or 459 waiting on the prefetches. In consequence it pushes search performance into the next order of 10 per operation compared to PGMIndex e.g. 3-4x worse. I rarely need a few mutations, so I'll need to consider if should figure out how to improve DynamicPGMIndex or somehow extend PGMIndex to meet needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

help wanted Extra attention is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

0