Speed up DataWarehouseExport.generate! #564

edavey · 2019-12-06T21:03:19Z

Background

It was proposed that the very slow daily DataWarehouseExport.generate!
could be improved by adding indexes. Back in July 2018 Tekin identified
that Submissions#created_at should be indexed
(#23) as it
started to be used in Task#latest_submission
(f46dd46).

However submissions.created_at is not used in any of the four
'extractions' which make up the daily data warehouse export. But
there are some other unindexed fields involved in extractions, as follows:

Export::Tasks::Extract -> uses Task#updated_at (unindexed)
Export::Submissions::Extract -> uses Submission#updated_at (unindexed)
Export::Invoices::Extract -> uses SubmissionEntry#updated_at (unindexed)
Export::Contracts::Extract -> uses SubmissionEntry#updated_at (unindexed)

Accordingly there are some PRs coming through which:

559: Add index to submissions.created_at
560: Add indexes to tasks.updated_at, submissions.updated_at,
submission_entries.updated_at
561: Increase maintenance_work_mem setting for Postgres to
allow these indexes to be added

This commit

This commit improves the speed of one of the 4 extractions:
Export::Submissions::Extract.

It was identified (thanks Russell Garner!) that the subquery joining
'invoices' was a major bottleneck. In local tests, removing the join
reduced a sample query from 26s to 0.4s. In order to properly remove
this element from the query:

it has been recognised that entry_count (the count of submission
entries of the type 'invoice') is not needed as users derive this
information using other tooling in the 'data warehouse'
the total_management_charge projection could be removed,
as since
d7db363
this value has been precomputed and stored on ingestion as
submissions.management_charge_total. This field was already
being returned as part of the top level selection
SELECT submissions.*.
the invoice_value projection (invoices.total_value) is
not required as again this information is available in the
'data warehouse'.
now that invoice_entry_count is no longer available it is
now necessary find a new way to determine the submission_type
('no_business' or 'file'). It's been agreed that the presence
(or not) of submission_file_type is sufficient to know if
there’s a ‘file’ to be had or whether on the other hand it's
a case of ‘no business’.

As the structure of the exports (the headings or columns) has
changed we've taken the opportunity to describe both:

the expected CSV headers more clearly as a vertical list,
and
the values of the example row more closely by ensuring
that the expected values as well as being present are
also in the expected column.

edavey · 2019-12-11T08:06:20Z

copied from the Trello ticket (https://dxw.zendesk.com/agent/tickets/10463):

Yesterday when reviewing the changes to Export::Submissions::Extract we agreed that a tweak is needed. Contrary to my note that:

the invoice_value projection (invoices.total_value) is
not required as again this information is available in the
‘data warehouse’.

Timur confirms that this value is actually needed in the report which we generate here.

However, this need to ‘cached’ the invoice total has been previously anticipated (thanks again Russell) and looking at:

72bd881

and

95bc038#diff-01992bb902ba0c51605767b8c48e0288

we see that the value is computed on ingestion and stored as submissions.invoice_total so most of the work in restoring this value from a ‘cached’ value is done. We do need to:

add the field (submissions.total_value) back in to the top-level ’select’ statement in app/models/export/submissions/extract.rb
add the field back into the exported report app/models/export/submissions/row.rb

Background It was proposed that the very slow daily `DataWarehouseExport.generate!` could be improved by adding indexes. Back in July 2018 Tekin identified that `Submissions#created_at` should be indexed (#23) as it started to be used in `Task#latest_submission` (f46dd46). However `submissions.created_at` is not used in any of the four 'extractions' which make up the daily data warehouse export. But there are some other unindexed fields involved in extractions, as follows: `Export::Tasks::Extract` -> uses `Task#updated_at` (unindexed) `Export::Submissions::Extract` -> uses `Submission#updated_at` (unindexed) `Export::Invoices::Extract` -> uses `SubmissionEntry#updated_at` (unindexed) `Export::Contracts::Extract` -> uses `SubmissionEntry#updated_at` (unindexed) Accordingly there are some PRs coming through which: - 559: Add index to `submissions.created_at` - 560: Add indexes to `tasks.updated_at`, `submissions.updated_at`, `submission_entries.updated_at` - 561: Increase `maintenance_work_mem` setting for Postgres to allow these indexes to be added This commit improves the speed of one of the 4 extractions: `Export::Submissions::Extract`. It was identified (thanks Russell Garner!) that the subquery joining 'invoices' was a major bottleneck. In local tests, removing the join reduced a sample query from 26s to 0.4s. In order to properly remove this element from the query: - it has been recognised that `entry_count` (the count of submission entries of the type 'invoice') is not needed as users derive this information using other tooling in the 'data warehouse' - the `total_management_charge` projection could be removed, as since d7db363 this value has been precomputed and stored on ingestion as `submissions.management_charge_total`. This field was already being returned as part of the top level selection `SELECT submissions.*`. - the `invoice_value` projection (`invoices.total_value`) is not required as again this information is available in the 'data warehouse'. - now that `invoice_entry_count` is no longer available it is now necessary find a new way to determine the `submission_type` ('no_business' or 'file'). It's been agreed that the presence (or not) of `submission_file_type` is sufficient to know if there’s a ‘file’ to be had or whether on the other hand it's a case of ‘no business’. As the structure of the exports (the headings or columns) has changed we've taken the opportunity to describe both: - the expected CSV headers more clearly as a vertical list, and - the values of the example row more closely by ensuring that the expected values as well as being present are also in the expected column.

As per Ed's comment #564 (comment) I have re-added the total value of the submission to the export

edavey

@lozette good stuff!

I've left a small comment, suggesting an additional assertion to verify that the TotalSpend value is included in the report.

I can't actually approve as GitHub considers me the author...

edavey · 2020-01-06T15:05:07Z

spec/models/export/relation_spec.rb

+
+        expect(submission_record.fetch('SubmissionID'))
+          .to eq(submission.id)
+


Would it be prudent to verify that expected value of TotalSpend is actually included in the report? e.g.

expect(submission_record.fetch('TotalSpend')) .to eq(123.45)

?

Good call, I missed that test!

As per Ed's comment #564 (comment) I have re-added the total value of the submission to the export

lozette

Approving @edavey 's work and Ed has approved mine!

lozette force-pushed the feature/1162-improve-performance-of-data-warehouse-export branch from fc5b2a6 to fd9c337 Compare January 3, 2020 13:48

lozette pushed a commit that referenced this pull request Jan 6, 2020

(Re-)Add Submission total value to the report export CSV

ab43327

As per Ed's comment #564 (comment) I have re-added the total value of the submission to the export

lozette changed the title ~~WIP: Speed up DataWarehouseExport.generate!~~ Speed up DataWarehouseExport.generate! Jan 6, 2020

edavey commented Jan 6, 2020

View reviewed changes

lozette pushed a commit that referenced this pull request Jan 6, 2020

(Re-)Add Submission total value to the report export CSV

a6698e4

As per Ed's comment #564 (comment) I have re-added the total value of the submission to the export

lozette force-pushed the feature/1162-improve-performance-of-data-warehouse-export branch from ab43327 to a6698e4 Compare January 6, 2020 15:26

(Re-)Add Submission total value to the report export CSV

1da5955

As per Ed's comment #564 (comment) I have re-added the total value of the submission to the export

lozette force-pushed the feature/1162-improve-performance-of-data-warehouse-export branch from a6698e4 to 1da5955 Compare January 6, 2020 15:31

lozette approved these changes Jan 6, 2020

View reviewed changes

lozette merged commit ce3dbe9 into develop Jan 6, 2020

lozette deleted the feature/1162-improve-performance-of-data-warehouse-export branch January 6, 2020 15:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up DataWarehouseExport.generate! #564

Speed up DataWarehouseExport.generate! #564

edavey commented Dec 6, 2019

edavey commented Dec 11, 2019

edavey left a comment

edavey Jan 6, 2020

lozette Jan 6, 2020

lozette left a comment


		expect(submission_record.fetch('SubmissionID'))
		.to eq(submission.id)

Speed up DataWarehouseExport.generate! #564

Speed up DataWarehouseExport.generate! #564

Conversation

edavey commented Dec 6, 2019

Background

This commit

edavey commented Dec 11, 2019

edavey left a comment

Choose a reason for hiding this comment

edavey Jan 6, 2020

Choose a reason for hiding this comment

lozette Jan 6, 2020

Choose a reason for hiding this comment

lozette left a comment

Choose a reason for hiding this comment