Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database cleanup pipeline #964

Draft
wants to merge 190 commits into
base: release/mvp
Choose a base branch
from

Conversation

nwillhoft
Copy link
Contributor

@nwillhoft nwillhoft commented Oct 10, 2024

[DRAFT - Nextflow pipeline to clear up databases].

JIRA ticket: https://www.ebi.ac.uk/panda/jira/browse/ENSCORESW-4404

Description

A nextflow pipeline to export database SQL to file and store it. The source database can optionally be deleted if required to free up storage space.

Use case

If a db host contains any old/unused dbs, this pipeline can be used to dump out their SQL, put the files in a convenient place and remove the db.

Example of how to run the pipeline:

# set up environment
module load nextflow
salloc -t 04:00:00 --mem=8G -p debug
export NEXTFLOW_DIR=/hps/software/users/ensembl/infrastructure/nwillhoft/ensembl-production/
export DATA_DIR=/hps/nobackup/flicek/ensembl/infrastructure/nwillhoft/
cd $DATA_DIR/
source /hps/software/users/ensembl/ensw/swenv/initenv default
pyenv activate production-tools

# help message
nextflow run /hps/software/users/ensembl/infrastructure/nwillhoft/ensembl-production/nextflow/workflows/db_cleanup/main.nf --help

# set up your config file, update email address on command line and run
# NB. please test first with `drop_source_db` set to false in config (and when happy feel free to change to true)
# NB. please see note below and test on a single db to start with
nextflow run $NEXTFLOW_DIR/nextflow/workflows/db_cleanup/main.nf -N <email>@ebi.ac.uk

Benefits

This pipeline will make it easier to automate the removal of old dbs.

Possible Drawbacks

Not an intended drawback but running the pipeline on more than 1 db at a time appears to cause a bottleneck in the dbcopy-client processing. This needs testing out further as the pipeline is set up to process everything in parallel to be as efficient as possible. To give an example, I tried copying over 3 dbs from st6 to core-prod-1 and it took over 24 hours to perform only the copy step. Whereas if I try coping 1 db at a time, it typically takes around an hour or less for this step.

Testing

  • Have you added/modified unit tests to test the changes? Tests so far are with nf-schema to validate parameters
  • If so, do the tests pass? N/A
  • Have you run the entire test suite and no regression was detected? No
  • TravisCI passed on your branch. Python 3.7 build passes. Python 3.8 and Perl builds are erroring. Perl 5.14 seems to be erroring due to perl module installation issues.

Dependencies

If applicable, define what code dependencies were added and/or updated.

The only external code dependency is using plugin/nf-schema within nextflow.

marcoooo and others added 30 commits June 9, 2023 09:05
Recompiled dependencies to use hive default branch.
* Added a failure if gzip fails to complete properly

* Switched backtick gzip to perl package in mysqldump and dumpfile

* bugfix

* revert and fixed the compress calls

* revert and fixed the compress calls

* unpack the arrayref to array

* fixed flat file to pass array reference.

* Removed automatic flow for tsvs

* Readded flow to tsvs and fixed bug with a conditional

* Readded param_required and added optional flow

* Swapped array is empty test for a better one

* removed array test as the files should alwalys be generated

* Swapped array is empty test for a better one and fixed flow

* Modified array ref

---------

Co-authored-by: vinay-ebi <vinay@ebi.ac.uk>
fix data_files path removig the vertebrates folder
Updated download URL for miRBase miRNA.dat file
Updated download URL for miRBase miRNA.dat file
Updated .gitignore + patch DQ forgotten when initially branching
…-fixes-110

# Conflicts:
#	scripts/py/regulation_ftp_symlinks.py
Co-authored-by: Tamara El Naboulsi <ten@codon-login-06.ebi.ac.uk>
…s we actually requested 9.

Added SLURM specifications
@nwillhoft nwillhoft changed the title [DRAFT] Database cleanup pipeline Database cleanup pipeline Feb 4, 2025
@nwillhoft nwillhoft self-assigned this Feb 4, 2025
@nwillhoft nwillhoft marked this pull request as ready for review February 4, 2025 14:43
@nwillhoft nwillhoft requested review from mira13 and dpopleton February 4, 2025 14:44
dpopleton
dpopleton previously approved these changes Feb 4, 2025
@dpopleton dpopleton dismissed their stale review February 4, 2025 15:01

Should have failed it for merging to the wrong branch

@dpopleton dpopleton self-requested a review February 4, 2025 15:01
Copy link
Contributor

@dpopleton dpopleton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should merge into release/mvp. Not main.
We could also do main, if that is your intention.

@nwillhoft nwillhoft changed the base branch from main to release/mvp February 4, 2025 15:12
@nwillhoft nwillhoft marked this pull request as draft February 4, 2025 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.