Transaction Data Sharding

Transaction data sharding means that it is possible to split accounting journal into multiple files.

The benefits of using shard scheme for Accounting Journal is more logical data management, especially when there are many transactions in the journal, when there are multiple parties to editing journal, or when the journal is generated by automated system.

Tackler supports journal sharding with both Filesystem and Git backends.

It is also possible to store each transaction into own file, this is so-called "single transaction - single file" mode. This is the sharding mode is used by performance tests, and it is recommended if transaction data is generated by some automated system.

With "single transaction - single file mode" it is also recommended to use UUIDs with transaction metadata and use same UUID as part of file name.

Transaction UUID is printed with Register Report, and by using UUIDs with transactions it will be easier to find actual the transaction file, in case there is any need to do so.

Sharding Schemes

Two most common shard schemes are time based or topic based sharding.

Example of time based shards:

  • year/month (e.g. txns/2019/01/)

  • year/month/day (e.g. txns/2019/01/31)

  • iso-year/iso-week (e.g. txns/2019/W10)

  • iso-year/iso-week/iso-week-date (e.g. Monday 2017-01-02 → txns/2017/W01/1)

Example of topic based shards by customers:

  • txns/Customers/ACME

  • txns/Customers/Initech

Tackler doesn’t care how do you shard or not shard txn data. But sharding makes a lot of sense with Git Storage backend and in case that there is lots of data. If transactions are generated automatically, it’s recommended to use single transaction - single file model and use applicable shard scheme to store journal.

Regardless of used sharding scheme, it is possible to group txns by different group-by operators with Balance Group report.

Subset of Transaction Data by Shard

Selecting subset of transactions can be done by using Transaction Filters or by using shards.

The major difference is that by using Transaction Filters all data is first parsed, and after that filtered. By using sharding scheme, "filtering" happens before journal files are even parsed. On the other hand, sharding lacks all fancy filtering options.

File scanning starts from top level directory identified by input.fs.dir setting.

From performance point of view, sharding is beneficial maybe after tens or hundreds of thousands of transactions. This is affect heavily by used Operating System, filesystem and used hardware. See Performance Testing for further details.

Example of month based sharding

With data sharding it is very straightforward to generate reports with only selected set of accounting data. For example with shard based on month it is possible to generate month reports with following piece of shell script:

report_year=$1
report_month=$2

tackler\
   --config journal.toml \
   --input.fs.dir="txns/${report_year}/${report_month}" \
   "$@"

How to Test Shard Schemes

Tool called pta-generator can be used to generate test transactions with different sharding schemes.

The pta-generator is used to generate test data for tackler’s performance testing, and it can generate journals from 10 to one million transactions.

PTA-Generator supports following sharding schemes:

single

The journal is one single file: txns/filename.txn

month

The journal is diced into 12 shards by month based on transaction time: txns/2024/05/filename.txn

txn

The journal is diced into single transactions, e.g. each transaction is on its own file: txns/2024/05/15/filename.txn

PTA-Generator Installation

See pta-generator repository for details, but installation can be done as:

cargo install --locked pta-generator

Generate Journals for Shard Test

Generate test data for shard demo
pta-generator audit \
   --path test \
   --shard-type month \
   --set-size 1e3
Reports based on partial journal, transactions from May 2024
tackler \
   --config test/audit/set-1e3-month.toml \
   --input.fs.dir txns/2024/05/