Skip to main content

Quickstart Hail (Eng)

CloudFormation stack preparation

1. Prepare the AWS credentials and apply them in the terminal. 

export AWS_DEFAULT_REGION="us-east-1"
export AWS_ACCESS_KEY_ID="{ACCESS_KEY}"
export AWS_SECRET_ACCESS_KEY="{SECRET_ACCESS_KEY}"
export AWS_SESSION_TOKEN="{SESSION_TOKEN}"

2. Create an S3 bucket in the region where you want to launch this CloudFormation stack.

At this time, I recommend creating a bucket name using your own initial. As you know, the bucket already exists, it cannot be created.

image.png

aws s3 mb s3://{bucket name}-{region} --region {region}

Download and unzip the content from this repository, then place the downloaded content into the S3 bucket you created earlier.

export AWS_BUCKET={bucket name}-{region}
git clone https://github.com/hmkim/quickstart-hail.git
cd quickstart-hail
aws s3 sync . s3://$AWS_BUCKET/quickstart-hail/ --exclude ".git/*"

3. Connect to the Amazon S3 console and check the bucket and directory.

image.png


Run the CloudFormation stack

1. Go to the CloudFormation console.

Screenshot_2024-06-21_at_10.26.55_PM.png

2. Creates a new stack. At this time, select With new resources (standard).

Screenshot_2024-06-21_at_10.27.22_PM.png

3. Go to the Amazon S3 console, select hail-launcher.template.yaml in the template directory you uploaded earlier, and click Copy URL. The path is as follows:

{bucket name} > quickstart-hail > templates > hail-launcher.template.yaml

image.png

When creating a CloudFormation stack, enter this URL and create the stack.

image.png

4. Proceed with entering information to create a stack.

Type an name for the stack.

image.png

Select a VPC. Select one subnet within the same VPC. For this exercise, select public.

Untitled 2.png

Let's set it up to create additional buckets as needed.

Enter the name of the existing bucket where the quickstart-hail folder was uploaded, and check the region.

Untitled 3.png

Untitled 4.png

5. Finally, press the Next button to create the stack.

Untitled 5.png

Untitled 6.png

6. Check stack creation in CloudFormation.

Untitled 7.png

If the following portfolio appears in the output along with the CREATE_COMPLETE message in the top stack, you can confirm that it was executed correctly.

Untitled 8.png

Create an AMI for Hail and VEP

Pre-downloading VEP Data and Storing in Bucket

For VEP, you can pre-download the data and store it in the bucket created or specified through the stack (using the bucketHail value from CloudFormation's Outputs).

Untitled 9.pngimage.png

Untitled 10.png

Download VEP data using the wget command:

wget ftp://ftp.ensembl.org/pub/release-112/variation/vep/homo_sapiens_vep_112_GRCh37.tar.gz

Upload the downloaded file to your bucket: (I emphasize that this is the value for the BucketHail key confirmed in CloudFormation Outputs)

aws s3 cp homo_sapiens_vep_112_GRCh37.tar.gz s3://{defined Hail S3 bucket}/vep/cache/

AMI Build

1. Access the CodeBuild console and initiate the build process for each new AMI. Select Start build > Start with overrides.

image.png

image.png

2. In the Environment section, expand Additional configuration and input the required values.

image.png

image.png

HAIL_VERSION 0.2.105
HTSLIB_VERSION 1.20
SAMTOOLS_VERSION 1.20

* For building with hail-vep option (includes VEP installation):

HAIL_VERSION 0.2.105
HTSLIB_VERSION 1.20
SAMTOOLS_VERSION 1.20
VEP_VERSION  107
RODA_BUCKET <VEPvalue downloadfor bucketthe name>BucketHail key in CloudFormation Outputs

Unknown 1.pngimage.png

3. Check the build status in CodeBuild

Hail (without VEP): The Hail image build completes in about 20 minutes.

image.png

image.png

Hail (VEP): The VEP version build takes approximately 1 hour and 38 minutes to complete.

Screenshot_2024-06-24_at_9.31.00_AM.png

Screenshot_2024-06-24_at_9.31.12_AM.png

You can find the **AMI results** in either the AMI menu of Amazon EC2 console or CodeBuild logs.

image.png

EMR Cluster Setup and Jupyter Environment Configuration

EMR Cluster Setup

1. In the CloudFormation service console's Outputs tab, click the portfolio link.portfolio.

Untitled 14.pngimage.png

2. In the portfolio,portfolio of AWS Service Catalog, locate the relevant Product, click the Access tab, then select Grant access.

image.png

3. Add permissions. Check the user that suits you and grant them access to Hail Products. Search for it, select it, and click Grant access.

image.png

image.png

4. After confirming access permissions, navigate to the Product in the Provisioning menu.

image.png

5. With permissions granted, you should now see 2 Products listed.

image.png

6. Select the Hail EMR Cluster product and click Launch product.

Untitled 19.png

7. Enter the required launch information.

Either enter a name manually or click Generate name.

Untitled 20.png

Specify the Hail AMI you created earlier. You can find the AMI ID in the EC2 service's AMIs section (as previously described).

Untitled 21.png

Input the Cluster name and Hail AMI ID. You can leave all other settings at their default values.

Untitled 22.png

8. Click Launch product at the bottom of the page.

Screenshot_2024-07-08_at_11.47.56_AM.png

SageMaker Notebook 실행Setup

    1.

  1. ProductSimilarly, 메뉴에서 마찬가지로select Launch product를product 클릭합니다.
  2. in
the Product menu.

Untitled 23.pngimage.png

    2.

  1. Provide a name for your Hail notebook을notebook 위한instance. 인스턴스의Keep 이름을all 입력하고other 나머지는settings 모두at 기본값입니다.
  2. their
defaults.

Untitled 24.pngUntitled 24.png

    3.

  1. 맨 아래의Click Launch product를product 클릭합니다.
  2. at
the bottom of the page.

Screenshot_2024-07-08_at_11.47.56_AM.png

참고로*Note: 제품(Product)의You 실행can 과정은monitor CloudFormation을the 통해서도product 확인할deployment progress 있습니다.through CloudFormation.

Untitled 25.pngimage.png

Screenshot_2024-07-08_at_12.08.40_PM.png

GWAS 실습Practice (Hail)using Hail

    1.

  1. 노트북을Launch 실행합니다.your notebook. Find CloudFormation의the Outputs탭에서URL url을in 확인할CloudFormation's Outputs 있습니다.tab. 클릭하면Clicking it will automatically connect you to your notebook instance in Amazon SageMaker의 해당 노트북 인스턴스로 자동 연결됩니다.
SageMaker.

Untitled 26.png

    2.

  1. Select Open JupyterLab을JupyterLab 클릭하여to 노트북을start 실행합니다.
  2. the
notebook interface.

Untitled 27.png

    3.

  1. 이제We'll 노트북에서work 각각with 2개의two 노트북을notebooks 가지고in 실습해봅니다.
  2. this
practice session:

  • common-notebooks/plotting-tutorail.ipynb
  • common-notebooks/GWAS-tutorial.ipynb

Screenshot_2024-06-24_at_3.26.55_PM.png

앞에서Locate 만든your previously created EMR 클러스터를cluster 조회한and update the Cluster Name을Name 2번째in 셀에서the 수정해줍니다.second cell.

노트북You 셀을can 한번에execute 실행하기notebook 위해서cells 시작하고자in 하는sequence 셀에by 커서를placing 놓은your cursor 일괄in 실행할the cell 있습니다.where you want to begin.

Untitled 28.png

최종적으로When 튜토리얼로the 주어진tutorial 코드가code 정상적으로runs 실행할successfully, you 있었다면should 아래와see 같은results 결과들을similar 확인할to 수 있습니다.these:

Screenshot_2024-07-08_at_12.21.50_PM.png

Untitled 29.png

Screenshot_2024-07-08_at_12.22.18_PM.png

기타Additional Information

VEP configuration

In the S3 버킷에서bucket, 해당select json파일the 객체를json 선택하고file object and click Copy S3 URI를 클릭합니다.URI.

Untitled 30.png

예시)Ex: vep-configuration-GRCh37.json

{
        "command": [
                "/opt/ensembl-vep/vep",
                "--format", "vcf",
                "--dir_plugins", "/opt/vep/plugins",
                "--dir_cache", "/opt/vep/cache",
                "--json",
                "--everything",
                "--allele_number",
                "--no_stats",
                "--cache", "--offline",
                "--minimal",
                "--assembly", "GRCh37",
                "--plugin", "LoF,human_ancestor_fa:/opt/vep/loftee_data/human_ancestor.fa.gz,filter_position:0.05,min_intron_size:15,conservation_file:/opt/vep/loftee_data/phylocsf_gerp.sql,gerp_file:/opt/vep/loftee_data/GERP_scores.final.sorted.txt.gz",
                "-o", "STDOUT"
        ],
        "env": {
                "PERL5LIB": "/opt/vep"
        },
    "vep_json_schema": "Struct{assembly_name:String,allele_string:String,ancestral:String,colocated_variants:Array[Struct{aa_allele:String,aa_maf:Float64,afr_allele:String,afr_maf:Float64,allele_string:String,amr_allele:String,amr_maf:Float64,clin_sig:Array[String],end:Int32,eas_allele:String,eas_maf:Float64,ea_allele:String,ea_maf:Float64,eur_allele:String,eur_maf:Float64,exac_adj_allele:String,exac_adj_maf:Float64,exac_allele:String,exac_afr_allele:String,exac_afr_maf:Float64,exac_amr_allele:String,exac_amr_maf:Float64,exac_eas_allele:String,exac_eas_maf:Float64,exac_fin_allele:String,exac_fin_maf:Float64,exac_maf:Float64,exac_nfe_allele:String,exac_nfe_maf:Float64,exac_oth_allele:String,exac_oth_maf:Float64,exac_sas_allele:String,exac_sas_maf:Float64,id:String,minor_allele:String,minor_allele_freq:Float64,phenotype_or_disease:Int32,pubmed:Array[Int32],sas_allele:String,sas_maf:Float64,somatic:Int32,start:Int32,strand:Int32}],context:String,end:Int32,id:String,input:String,intergenic_consequences:Array[Struct{allele_num:Int32,consequence_terms:Array[String],impact:String,minimised:Int32,variant_allele:String}],most_severe_consequence:String,motif_feature_consequences:Array[Struct{allele_num:Int32,consequence_terms:Array[String],high_inf_pos:String,impact:String,minimised:Int32,motif_feature_id:String,motif_name:String,motif_pos:Int32,motif_score_change:Float64,strand:Int32,variant_allele:String}],regulatory_feature_consequences:Array[Struct{allele_num:Int32,biotype:String,consequence_terms:Array[String],impact:String,minimised:Int32,regulatory_feature_id:String,variant_allele:String}],seq_region_name:String,start:Int32,strand:Int32,transcript_consequences:Array[Struct{allele_num:Int32,amino_acids:String,appris:String,biotype:String,canonical:Int32,ccds:String,cdna_start:Int32,cdna_end:Int32,cds_end:Int32,cds_start:Int32,codons:String,consequence_terms:Array[String],distance:Int32,domains:Array[Struct{db:String,name:String}],exon:String,gene_id:String,gene_pheno:Int32,gene_symbol:String,gene_symbol_source:String,hgnc_id:String,hgvsc:String,hgvsp:String,hgvs_offset:Int32,impact:String,intron:String,lof:String,lof_flags:String,lof_filter:String,lof_info:String,minimised:Int32,polyphen_prediction:String,polyphen_score:Float64,protein_end:Int32,protein_start:Int32,protein_id:String,sift_prediction:String,sift_score:Float64,strand:Int32,swissprot:String,transcript_id:String,trembl:String,tsl:Int32,uniparc:String,variant_allele:String}],variant_class:String}"
}

You can modify and implement the following content in the vep-tutorial 코드에서code 아래using 내용에서 위에서 복사한the S3 객체object URI로URI 수정하여you 사용할copied 수 있습니다.above.

Screenshot_2024-07-11_at_11.07.00_AM.png

VEP 플러그인Plugin 설치Installation

만일If VEP의you 플러그인need 설치에to 변경사항 (추가 등)이 있다면 AMI를 다시 만들어야 합니다. 이 때 AMI를 만들 데 사용되는modify VEP 설치에plugin 관한installations 코드는(additions, etc.), you'll need to rebuild the AMI. The VEP installation code is in vep_install.sh 입니다. 해당Modify 코드를this 수정script and 다시rebuild AMI를the 빌드합니다.AMI as needed.

다음을For 참고하여 커스텀하게customizing Hail, VEP 툴을tool 설치installation, and AMI를AMI 빌드할building, refer 있습니다.to these resources:

Dynamically Expanding EMR 클러스터Cluster EBS (HDFS) 동적 볼륨 늘리기Volume

-When 데이터가working with 경우large 사전에datasets, 클러스터상에you 구성된may 볼륨의find 용량이the 부족할initially configured 있습니다.cluster 아래volume 블로그capacity 내용을insufficient. 참고하여,You can dynamically expand the EBS 볼륨의volume 부족분을by 동적으로following 늘릴the guidance 있습니다.in this blog post:

https://aws.amazon.com/ko/blogs/big-data/dynamically-scale-up-storage-on-amazon-emr-clusters/