WIP: perf(syncs): implement partial idempotent syncs interface #2309

Closed
zachgoll wants to merge 1 commits from zachgoll/partial-syncs into main
zachgoll commented 2025-05-26 18:17:02 +08:00 (Migrated from github.com)

The current behavior for a "data sync" of a "Syncable" (Family, PlaidItem, Account) is a "sync everything every time" model. This behavior was put in place early on to keep the data syncing process simple, predictable, and avoid the burden of keeping state to determine when to start each new successive data sync. This model has now reached its limit with the size of our user base and requires a move towards "partial data syncs". This PR implements logic to keep track of each syncable's last known "good data window" and uses that information to decide when to start subsequent sync operations.

Syncables are idempotent

Each Syncable record now has a date column called data_synced_through, which represents the latest date which we have synced data up to. This allows for a simpler model to compute partial syncs:

  • Each Syncable knows what date window to sync next (it simply reads data_synced_through)
  • Processes that affect the state of a Syncable (e.g. updating an entry date or value) must remember to update data_synced_through so the "sync cache" is invalidated starting from the date in which the modification affects the series of syncable balances and state

Syncs still track the sync window

While sync_later no longer accepts start/end date arguments, the Sync record still captures the window_start_date and window_end_date of each sync.

Furthermore, syncs can be triggered with sync_later(clear_cache: true) to perform a "full sync" that ignores and resets the data_synced_through column. These sorts of syncs should be used sparingly and only for data repair.

Repairing data

If code logic changes that affects the calculation of historical balances and requires a full re-sync of all syncables in the DB, the easiest way to force full re-syncs is a quick scope.update_all(data_synced_through: nil), which makes it so each syncable is forced to do a full sync next time.

In the future, as sync logic becomes more stable, we may think about adding a SYNC_VERSION to handle this:

# Not implemented in PR, just documentation for future possible state

# config/initializers/sync_version.rb
SYNC_VERSION=1

# app/models/sync.rb
if clear_cache || syncable.sync_version < SYNC_VERSION || syncable.data_synced_through.nil?
    # If any of these are true, perform a full sync
end  

Handling sync direction

Depending on the type of account, we may sync data in either the forward or backward direction. For example, Plaid connected accounts have a known, "source of truth" current balance that we start with, and work reverse chronologically from, while manual accounts start at a balance of 0 and work chronologically to the current date.

Regardless of sync direction, data_synced_through is a chronological indicator.

So if current day is 2025-05-20...

And data_synced_through is 2025-05-15...

  • A reverse sync would start at 2025-05-20 and sync backwards to 2025-05-15
  • A forward sync would start at 2025-05-15 and sync forwards to 2025-05-20
The current behavior for a "data sync" of a "Syncable" (`Family`, `PlaidItem`, `Account`) is a "sync everything every time" model. This behavior was put in place early on to keep the data syncing process simple, predictable, and avoid the burden of keeping state to determine when to start each new successive data sync. This model has now reached its limit with the size of our user base and requires a move towards "partial data syncs". This PR implements logic to keep track of each syncable's last known "good data window" and uses that information to decide when to start subsequent sync operations. **Syncables are idempotent** Each `Syncable` record now has a date column called `data_synced_through`, which represents the latest date which we have synced data up to. This allows for a simpler model to compute partial syncs: - Each Syncable knows what date window to sync next (it simply reads `data_synced_through`) - Processes that _affect_ the state of a Syncable (e.g. updating an entry date or value) must remember to update `data_synced_through` so the "sync cache" is invalidated starting from the date in which the modification affects the series of syncable balances and state **Syncs still track the sync window** While `sync_later` no longer accepts start/end date arguments, the `Sync` record still captures the `window_start_date` and `window_end_date` of each sync. Furthermore, syncs can be triggered with `sync_later(clear_cache: true)` to perform a "full sync" that ignores and resets the `data_synced_through` column. These sorts of syncs should be used sparingly and only for data repair. **Repairing data** If code logic changes that affects the calculation of historical balances and requires a full re-sync of all syncables in the DB, the easiest way to force full re-syncs is a quick `scope.update_all(data_synced_through: nil)`, which makes it so each syncable is forced to do a full sync next time. In the future, as sync logic becomes more stable, we _may_ think about adding a `SYNC_VERSION` to handle this: ```rb # Not implemented in PR, just documentation for future possible state # config/initializers/sync_version.rb SYNC_VERSION=1 # app/models/sync.rb if clear_cache || syncable.sync_version < SYNC_VERSION || syncable.data_synced_through.nil? # If any of these are true, perform a full sync end ``` **Handling sync _direction_** Depending on the _type_ of account, we may sync data in either the forward or backward direction. For example, Plaid connected accounts have a known, "source of truth" current balance that we start with, and work reverse chronologically from, while manual accounts start at a balance of 0 and work chronologically to the current date. Regardless of sync direction, `data_synced_through` is a chronological indicator. So if current day is `2025-05-20`... And `data_synced_through` is `2025-05-15`... - A reverse sync would start at `2025-05-20` and sync _backwards_ to `2025-05-15` - A forward sync would start at `2025-05-15` and sync _forwards_ to `2025-05-20`

Pull request closed

Sign in to join this conversation.