Transaction Data Sharding
Transaction data sharding means that it is possible to split accounting journal into multiple files.
The benefits of using shard scheme for Accounting Journal is more logical data management, especially when there are many transactions in the journal, when there are multiple parties to editing journal, or when the journal is generated by automated system.
Tackler supports journal sharding with both Filesystem and Git backends.
It is also possible to store each transaction into own file, this is so-called "single transaction - single file" mode. This is the sharding mode is used by performance tests, and it is recommended if transaction data is generated by some automated system.
With "single transaction - single file mode" it is also recommended to use UUIDs with transaction metadata and use same UUID as part of file name.
Transaction UUID is printed with Register Report, and by using UUIDs with transactions it will be easier to find actual the transaction file, in case there is any need to do so.
Sharding Schemes
Two most common shard schemes are time based or topic based sharding.
Example of time based shards:
-
year/month (e.g.
txns/2019/01/
) -
year/month/day (e.g.
txns/2019/01/31
) -
iso-year/iso-week (e.g.
txns/2019/W10
) -
iso-year/iso-week/iso-week-date (e.g. Monday 2017-01-02 →
txns/2017/W01/1
)
Example of topic based shards by customers:
-
txns/Customers/ACME
-
txns/Customers/Initech
Tackler doesn’t care how do you shard or not shard txn data. But sharding makes a lot of sense with Git Storage backend and in case that there is lots of data. If transactions are generated automatically, it’s recommended to use single transaction - single file model and use applicable shard scheme to store journal.
Regardless of used sharding scheme, it is possible to group txns by
different group-by
operators with
Balance Group report.
Subset of Transaction Data by Shard
Selecting subset of transactions can be done by using Transaction Filters or by using shards.
The major difference is that by using Transaction Filters all data is first parsed, and after that filtered. By using sharding scheme, "filtering" happens before journal files are even parsed. On the other hand, sharding lacks all fancy filtering options.
File scanning starts from top level directory identified by
input.fs.dir
setting.
From performance point of view, sharding is beneficial maybe after tens or hundreds of thousands of transactions. This is affect heavily by used Operating System, filesystem and used hardware. See Performance Testing for further details.
Example of month based sharding
With data sharding it is very straightforward to generate reports with only selected set of accounting data. For example with shard based on month it is possible to generate month reports with following piece of shell script:
report_year=$1
report_month=$2
tackler\
--config journal.toml \
--input.fs.dir="txns/${report_year}/${report_month}" \
"$@"
How to Test Shard Schemes
Tool called pta-generator can be used to generate test transactions with different sharding schemes.
The pta-generator is used to generate test data for tackler’s performance testing, and it can generate journals from 10 to one million transactions.
PTA-Generator supports following sharding schemes:
single
-
The journal is one single file:
txns/filename.txn
month
-
The journal is diced into 12 shards by month based on transaction time:
txns/2024/05/filename.txn
txn
-
The journal is diced into single transactions, e.g. each transaction is on its own file:
txns/2024/05/15/filename.txn
PTA-Generator Installation
See pta-generator repository for details, but installation can be done as:
cargo install --locked pta-generator