-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up DataWarehouseExport.generate! #564
Speed up DataWarehouseExport.generate! #564
Conversation
copied from the Trello ticket (https://dxw.zendesk.com/agent/tickets/10463): Yesterday when reviewing the changes to
Timur confirms that this value is actually needed in the report which we generate here. However, this need to ‘cached’ the invoice total has been previously anticipated (thanks again Russell) and looking at: and 95bc038#diff-01992bb902ba0c51605767b8c48e0288 we see that the value is computed on ingestion and stored as
|
Background It was proposed that the very slow daily `DataWarehouseExport.generate!` could be improved by adding indexes. Back in July 2018 Tekin identified that `Submissions#created_at` should be indexed (#23) as it started to be used in `Task#latest_submission` (f46dd46). However `submissions.created_at` is not used in any of the four 'extractions' which make up the daily data warehouse export. But there are some other unindexed fields involved in extractions, as follows: `Export::Tasks::Extract` -> uses `Task#updated_at` (unindexed) `Export::Submissions::Extract` -> uses `Submission#updated_at` (unindexed) `Export::Invoices::Extract` -> uses `SubmissionEntry#updated_at` (unindexed) `Export::Contracts::Extract` -> uses `SubmissionEntry#updated_at` (unindexed) Accordingly there are some PRs coming through which: - 559: Add index to `submissions.created_at` - 560: Add indexes to `tasks.updated_at`, `submissions.updated_at`, `submission_entries.updated_at` - 561: Increase `maintenance_work_mem` setting for Postgres to allow these indexes to be added This commit improves the speed of one of the 4 extractions: `Export::Submissions::Extract`. It was identified (thanks Russell Garner!) that the subquery joining 'invoices' was a major bottleneck. In local tests, removing the join reduced a sample query from 26s to 0.4s. In order to properly remove this element from the query: - it has been recognised that `entry_count` (the count of submission entries of the type 'invoice') is not needed as users derive this information using other tooling in the 'data warehouse' - the `total_management_charge` projection could be removed, as since d7db363 this value has been precomputed and stored on ingestion as `submissions.management_charge_total`. This field was already being returned as part of the top level selection `SELECT submissions.*`. - the `invoice_value` projection (`invoices.total_value`) is not required as again this information is available in the 'data warehouse'. - now that `invoice_entry_count` is no longer available it is now necessary find a new way to determine the `submission_type` ('no_business' or 'file'). It's been agreed that the presence (or not) of `submission_file_type` is sufficient to know if there’s a ‘file’ to be had or whether on the other hand it's a case of ‘no business’. As the structure of the exports (the headings or columns) has changed we've taken the opportunity to describe both: - the expected CSV headers more clearly as a vertical list, and - the values of the example row more closely by ensuring that the expected values as well as being present are also in the expected column.
fc5b2a6
to
fd9c337
Compare
As per Ed's comment #564 (comment) I have re-added the total value of the submission to the export
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lozette good stuff!
I've left a small comment, suggesting an additional assertion to verify that the TotalSpend
value is included in the report.
I can't actually approve as GitHub considers me the author...
|
||
expect(submission_record.fetch('SubmissionID')) | ||
.to eq(submission.id) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be prudent to verify that expected value of TotalSpend
is actually included in the report? e.g.
expect(submission_record.fetch('TotalSpend'))
.to eq(123.45)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, I missed that test!
As per Ed's comment #564 (comment) I have re-added the total value of the submission to the export
ab43327
to
a6698e4
Compare
As per Ed's comment #564 (comment) I have re-added the total value of the submission to the export
a6698e4
to
1da5955
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving @edavey 's work and Ed has approved mine!
Background
It was proposed that the very slow daily
DataWarehouseExport.generate!
could be improved by adding indexes. Back in July 2018 Tekin identified
that
Submissions#created_at
should be indexed(#23) as it
started to be used in
Task#latest_submission
(f46dd46).
However
submissions.created_at
is not used in any of the four'extractions' which make up the daily data warehouse export. But
there are some other unindexed fields involved in extractions, as follows:
Export::Tasks::Extract
-> usesTask#updated_at
(unindexed)Export::Submissions::Extract
-> usesSubmission#updated_at
(unindexed)Export::Invoices::Extract
-> usesSubmissionEntry#updated_at
(unindexed)Export::Contracts::Extract
-> usesSubmissionEntry#updated_at
(unindexed)Accordingly there are some PRs coming through which:
submissions.created_at
tasks.updated_at
,submissions.updated_at
,submission_entries.updated_at
maintenance_work_mem
setting for Postgres toallow these indexes to be added
This commit
This commit improves the speed of one of the 4 extractions:
Export::Submissions::Extract
.It was identified (thanks Russell Garner!) that the subquery joining
'invoices' was a major bottleneck. In local tests, removing the join
reduced a sample query from 26s to 0.4s. In order to properly remove
this element from the query:
it has been recognised that
entry_count
(the count of submissionentries of the type 'invoice') is not needed as users derive this
information using other tooling in the 'data warehouse'
the
total_management_charge
projection could be removed,as since
d7db363
this value has been precomputed and stored on ingestion as
submissions.management_charge_total
. This field was alreadybeing returned as part of the top level selection
SELECT submissions.*
.the
invoice_value
projection (invoices.total_value
) isnot required as again this information is available in the
'data warehouse'.
now that
invoice_entry_count
is no longer available it isnow necessary find a new way to determine the
submission_type
('no_business' or 'file'). It's been agreed that the presence
(or not) of
submission_file_type
is sufficient to know ifthere’s a ‘file’ to be had or whether on the other hand it's
a case of ‘no business’.
As the structure of the exports (the headings or columns) has
changed we've taken the opportunity to describe both:
the expected CSV headers more clearly as a vertical list,
and
the values of the example row more closely by ensuring
that the expected values as well as being present are
also in the expected column.