Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up DataWarehouseExport.generate! #564

Merged

Conversation

edavey
Copy link
Contributor

@edavey edavey commented Dec 6, 2019

Background

It was proposed that the very slow daily DataWarehouseExport.generate!
could be improved by adding indexes. Back in July 2018 Tekin identified
that Submissions#created_at should be indexed
(#23) as it
started to be used in Task#latest_submission
(f46dd46).

However submissions.created_at is not used in any of the four
'extractions' which make up the daily data warehouse export. But
there are some other unindexed fields involved in extractions, as follows:

Export::Tasks::Extract -> uses Task#updated_at (unindexed)
Export::Submissions::Extract -> uses Submission#updated_at (unindexed)
Export::Invoices::Extract -> uses SubmissionEntry#updated_at (unindexed)
Export::Contracts::Extract -> uses SubmissionEntry#updated_at (unindexed)

Accordingly there are some PRs coming through which:

  • 559: Add index to submissions.created_at
  • 560: Add indexes to tasks.updated_at, submissions.updated_at,
    submission_entries.updated_at
  • 561: Increase maintenance_work_mem setting for Postgres to
    allow these indexes to be added

This commit

This commit improves the speed of one of the 4 extractions:
Export::Submissions::Extract.

It was identified (thanks Russell Garner!) that the subquery joining
'invoices' was a major bottleneck. In local tests, removing the join
reduced a sample query from 26s to 0.4s. In order to properly remove
this element from the query:

  • it has been recognised that entry_count (the count of submission
    entries of the type 'invoice') is not needed as users derive this
    information using other tooling in the 'data warehouse'

  • the total_management_charge projection could be removed,
    as since
    d7db363
    this value has been precomputed and stored on ingestion as
    submissions.management_charge_total. This field was already
    being returned as part of the top level selection
    SELECT submissions.*.

  • the invoice_value projection (invoices.total_value) is
    not required as again this information is available in the
    'data warehouse'.

  • now that invoice_entry_count is no longer available it is
    now necessary find a new way to determine the submission_type
    ('no_business' or 'file'). It's been agreed that the presence
    (or not) of submission_file_type is sufficient to know if
    there’s a ‘file’ to be had or whether on the other hand it's
    a case of ‘no business’.

As the structure of the exports (the headings or columns) has
changed we've taken the opportunity to describe both:

  • the expected CSV headers more clearly as a vertical list,
    and

  • the values of the example row more closely by ensuring
    that the expected values as well as being present are
    also in the expected column.

@edavey
Copy link
Contributor Author

edavey commented Dec 11, 2019

copied from the Trello ticket (https://dxw.zendesk.com/agent/tickets/10463):

Yesterday when reviewing the changes to Export::Submissions::Extract we agreed that a tweak is needed. Contrary to my note that:

  • the invoice_value projection (invoices.total_value) is
    not required as again this information is available in the
    ‘data warehouse’.

Timur confirms that this value is actually needed in the report which we generate here.

However, this need to ‘cached’ the invoice total has been previously anticipated (thanks again Russell) and looking at:

72bd881

and

95bc038#diff-01992bb902ba0c51605767b8c48e0288

we see that the value is computed on ingestion and stored as submissions.invoice_total so most of the work in restoring this value from a ‘cached’ value is done. We do need to:

  • add the field (submissions.total_value) back in to the top-level ’select’ statement in app/models/export/submissions/extract.rb
  • add the field back into the exported report app/models/export/submissions/row.rb

Background

It was proposed that the very slow daily `DataWarehouseExport.generate!`
could be improved by adding indexes. Back in July 2018 Tekin identified
that `Submissions#created_at` should be indexed
(#23) as it
started to be used in `Task#latest_submission`
(f46dd46).

However `submissions.created_at` is not used in any of the four
'extractions' which make up the daily data warehouse export. But
there are some other unindexed fields involved in extractions, as follows:

`Export::Tasks::Extract` -> uses `Task#updated_at` (unindexed)
`Export::Submissions::Extract` -> uses `Submission#updated_at` (unindexed)
`Export::Invoices::Extract` -> uses `SubmissionEntry#updated_at` (unindexed)
`Export::Contracts::Extract` -> uses `SubmissionEntry#updated_at` (unindexed)

Accordingly there are some PRs coming through which:

- 559: Add index to `submissions.created_at`
- 560: Add indexes to `tasks.updated_at`, `submissions.updated_at`,
       `submission_entries.updated_at`
- 561: Increase `maintenance_work_mem` setting for Postgres to
       allow these indexes to be added

This commit improves the speed of one of the 4 extractions:
`Export::Submissions::Extract`.

It was identified (thanks Russell Garner!) that the subquery joining
'invoices' was a major bottleneck. In local tests, removing the join
reduced a sample query from 26s to 0.4s. In order to properly remove
this element from the query:

- it has been recognised that `entry_count` (the count of submission
entries of the type 'invoice') is not needed as users derive this
information using other tooling in the 'data warehouse'

- the `total_management_charge` projection could be removed,
as since
d7db363
this value has been precomputed and stored on ingestion as
`submissions.management_charge_total`. This field was already
being returned as part of the top level selection
`SELECT submissions.*`.

- the `invoice_value` projection (`invoices.total_value`) is
not required as again this information is available in the
'data warehouse'.

- now that `invoice_entry_count` is no longer available it is
now necessary find a new way to determine the `submission_type`
('no_business' or 'file'). It's been agreed that the presence
(or not) of `submission_file_type`  is sufficient to know if
there’s a ‘file’ to be had or whether on the other hand it's
a case of ‘no business’.

As the structure of the exports (the headings or columns) has
changed we've taken the opportunity to describe both:

- the expected CSV headers more clearly as a vertical list,
 and

- the values of the example row more closely by ensuring
  that the expected values as well as being present are
  also in the expected column.
@lozette lozette force-pushed the feature/1162-improve-performance-of-data-warehouse-export branch from fc5b2a6 to fd9c337 Compare January 3, 2020 13:48
lozette pushed a commit that referenced this pull request Jan 6, 2020
As per Ed's comment #564 (comment)
I have re-added the total value of the submission to the export
@lozette lozette changed the title WIP: Speed up DataWarehouseExport.generate! Speed up DataWarehouseExport.generate! Jan 6, 2020
Copy link
Contributor Author

@edavey edavey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lozette good stuff!

I've left a small comment, suggesting an additional assertion to verify that the TotalSpend value is included in the report.

I can't actually approve as GitHub considers me the author...


expect(submission_record.fetch('SubmissionID'))
.to eq(submission.id)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be prudent to verify that expected value of TotalSpend is actually included in the report? e.g.

expect(submission_record.fetch('TotalSpend'))
  .to eq(123.45)

?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I missed that test!

lozette pushed a commit that referenced this pull request Jan 6, 2020
As per Ed's comment #564 (comment)
I have re-added the total value of the submission to the export
@lozette lozette force-pushed the feature/1162-improve-performance-of-data-warehouse-export branch from ab43327 to a6698e4 Compare January 6, 2020 15:26
As per Ed's comment #564 (comment)
I have re-added the total value of the submission to the export
@lozette lozette force-pushed the feature/1162-improve-performance-of-data-warehouse-export branch from a6698e4 to 1da5955 Compare January 6, 2020 15:31
Copy link

@lozette lozette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving @edavey 's work and Ed has approved mine!

@lozette lozette merged commit ce3dbe9 into develop Jan 6, 2020
@lozette lozette deleted the feature/1162-improve-performance-of-data-warehouse-export branch January 6, 2020 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants