Skip to content

Index Bug #160

@poppindouble

Description

@poppindouble

Our database has some index out of sync bugs.

What is the issue

Steps to reproduce the bug

  1. Insert two json files with the following format
{
	"id": 1,
	"age": 70, 
	"name": "Josh",
},
{
	"id": 2,
	"age": 80,
	"name": "David"
}

Let's assume the id of these two Unit are 1 and 2.

  1. Create Index

We call execute_create_index to create index for current grouping.

Till so far, our index works as expected. We can query our db with select by age=70, or select by age=80, we will get the expected result.

  1. Now we will use the unique feature in ImmuxDB, we insert following json file:
{
	"id": 1,
	"age": 100, 
	"name": "Josh",
}

We are changing the age from 70 to 100.

  1. If we query our db with select by age=70, or select by age=100, both of the query will get the result:
{
	"id": 1,
	"age": 100, 
	"name": "Josh",
}

Which is not the expected result of select by age=70.

This bug will also lead to the problem of versioning, similar to the above example, after we call execute_create_index, if we do a revert operation, the index will have the result like above, which is out of sync.

The reason behind this bug

  1. When we call execute_create_index, we are actually only creating index for current db, let's have a look in our execute_create_index function.
        let prefix: StoreKeyFragment = grouping.marshal().into();
        let get_by_prefix = Instruction::DataAccess(DataInstruction::Read(
            DataReadInstruction::GetMany(GetManyInstruction {
                height: None,
                targets: GetManyTargetSpec::KeyPrefix(prefix),
            }),
        ));

We are using prefix here to extract all the unit in current grouping, which is the "newest" unit for each unit_id, our history data are stored in the InstructionRecord, which are not being "indexed".

  1. When we insert a new unit which has the same unit_id as the previous unit (step 3 in the above example), filed age is already being recorded in indexed_names, the updates_for_index will be called:
let reverse_index: ReverseIndex = {
        let mut index = ReverseIndex::new();
        for target in &insert.targets {
            match &target.content {
                UnitContent::JsonString(json_string) => {
                    match serde_json::from_str::<JsonValue>(json_string) {
                        Err(_error) => continue,
                        Ok(json_value) => {
                            for name in indexed_names.clone() {
                                index.index_new_json(target.id, &json_value, &name)?;
                            }
                        }
                    }
                }
                _ => continue,
            }
        }
        index
    };

if the new inserted unit has the same unit_id as before, which will leads to both old id_list_key and the new id_list_key will point to the same unit_id.

The versioning problem also comes out from this reason, when we do revert operation, essentialy, we just insert a new unit with some existed unit_id.

but we didn't update our index properly in the following code:

match self.get_journal(key) {
            Err(error) => Err(error),
            Ok(mut journal) => {
                match find_appropriate_height(&journal.update_heights, &target_height) {
                    None => Err(VkvError::CannotFindSuitableVersion.into()),
                    Some(height) => {
                    	// didn't update index here.
                        let value = self.get_value_after_height(key, &height)?;
                        journal.update_heights.push(next_height);
                        journal.value = value;
                        self.set_journal(key, &journal)
                    }
                }
            }
        }

How to fix it.

The essential problem of this bug is that the index is out of sync, only the "newest" unit are indexed when we call "execute_create_index", and then when "new" unit with the same unit_id are inserted, our index system is not updated properly, this particular unit_id are existed in both id_list(corresponding id_list_key are get_store_key_of_indexed_id_list(&insert.grouping, &property_name, &old_unit_content); and get_store_key_of_indexed_id_list(&insert.grouping, &property_name, &new_unit_content);).

I have two ideas how to solve this problem:

Solution 1

When we call "execute_create_index", we create index for everything including history data. This solution requires that we need to scan all history data as well. This can be achieved by following steps:

  1. We read all the units in the specific grouping(We can achieve this by prefix).
  2. We have all the unit_id from step 1, and for each unit_id, we do inspect.
  3. For each result of inspect is the historical data regarding to that specific unit_id, at this point, we can know these informations: grouping, unit_id, unit_content, NameProperty, as well as its chain_height.
  4. Instead of saving the unit_id only, we should also pass in chain_height as well.
pub fn add_to_index(&mut self, name: &PropertyName, property_bytes: &[u8], id: UnitId) -> () {
        let key = (name.to_owned(), property_bytes.to_owned());
        match self.inner.get(&key) {
            Some(id_list) => {
                let mut new_list = id_list.to_owned();
                // new_list.push(id); Instead of only saving the id
                // we should save the combination of (unit_id, chain_height)
                self.inner.insert(key, new_list)
            }
            None => {
                // let new_list = IdList::new(vec![id]);
                // same as above
                self.inner.insert(key, new_list)
            }
        };
    }

So, we are not only saving unit_id but also saving its corresponding chain_height together. I don't
have a proper name for the combination of (unit_id, chain_height) for now, if you can help with naming, that would be great.

When we are query our DB with SelectCondition::NameProperty, what we get will be a list of the combination of (unit_id, chain_height), so, the following problem is how to get the unit when we know its unit_id and chain_height. We don't have a specific API for this kind of query yet, but we can use inspect to get it.

For this solution, we also need to remember to update our index when we do revert operation.

Solution 2

We only keep the "newest" index, for each corresponding unit_id, we don't support SelectCondition::NameProperty for its historical data.
In the above example, when we query our db with select by age=70, or select by age=100, both of the query will get the result:

{
	"id": 1,
	"age": 100, 
	"name": "Josh",
}

but with solution 2, we will only get the above result if we do select by age=100. We don't get the above result if we query our db with select by age=70. At least our db is correct in this way.

To achieve this, we need to keep our index update to date, which means we can keep our create_index as it is for now, but every time when we insert a new unit, or we revert unit or units, we need to update our db's index.

To update our index, we need to do a read operation regarding to this unit_id first, find out what is the unit_contents of this unit_id in current db, then, we go through all the NameProperty, we need to do a query SelectCondition::NameProperty, and delete its unit_id in the corresponding indexed_id_list.

After we call execute_create_index, We need to do this update_index operation in several places in our codebase.

  1. Everytime when we insert a new unit or multi units.
  2. Everytime when we revert a unit or multi units.

Comparison

Solution 1:

Pros:

  1. Give us really nice query feature, we can query historical data by SelectCondition::NameProperty as well. the cost in insert and revert is lower then solution 2.

Cons:

  1. really heavy work load at the execute_create_index period, it go through all the historical data in the whole grouping.
  2. a huge problem when we do SelectCondition::NameProperty on some "hot" data, since everytime we query by NameProperty, for each existed unit_id and chian_height, we involve one inspect operation, which is really costly.

Solution 2:

Pros:

  1. not that much work load at the execute_create_index period.
  2. SelectCondition::NameProperty is not costly.

Cons:

  1. We lost the feature that we can index historical data.
  2. Since we need to update our indexed_id_list, everytime when we insert a new unit or units, or revert a unit or units, before we can actually perform these operations, we need to do following steps:

(I) A read operation regarding to this unit_id.
(II) Go through all the NameProperty, for each NameProperty, do a query SelectCondition::NameProperty.
(III) Delete its unit_id in the corresponding indexed_id_list

insert and revert operation in solution 2 is costly then solution 1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions