Hail with Amazon EMR notebook

노트북 셋업

EMR 스튜디오에서 동일한 VPC에 새 스튜디오를 만듭니다.
생성한 스튜디오에 대한 새 작업 공간(노트북)을 생성하면 완료되면 jupyterHub에 대한 새 탭이 자동으로 열립니다.
python3 커널을 사용하여 새 노트북을 시작합니다.

EMR 노트북에 Hail 설치

pip install hail

추후 S3a 프로토콜 사용을 위한 내용 (Hadoop-AWS module)

Ssh into primary node (as sudo user)
Go to the jars directory: cd /home/emr-notebook/.local/lib/python3.9/site-packages/pyspark/jars
Download the 2 jar files with the following command in the directory:
1. sudo curl -sSL https://search.maven.org/remotecontent?filepath=org/apache/hadoop/hadoop-aws/3.3.2/hadoop-aws-3.3.2.jar > ./hadoop-aws-3.3.2.jar
2. sudo curl -sSL https://search.maven.org/remotecontent?filepath=com/amazonaws/aws-java-sdk-bundle/1.12.99/aws-java-sdk-bundle-1.12.99.jar > ./aws-java-sdk-bundle-1.12.99.jar

의존성 설치

https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.6/

https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/

There are 2 jars missing in the java class path the notebook is using. Using python shell directly from the cluster does not need to do this (but only need to point SPARK_HOME to the jars because the required dependencies are already there if run from hadoop environment) as Notebook hosts a different environment for all dependencies installed. Also, the hadoop version (uses aws version of hadoop) and package is slightly different from what we get in the hadoop environment in SPARK_HOME; Notebook environment uses the external hadoop client, meaning that it will not be able to connect to S3.
We need to download aws sdk jar and hadoop aws jar and put them into Notebook’s environment jar collection.

Primary 노드의 인스턴스 EC2 Monitoring

Core 노드의 인스턴스 EC2 Monitoring

예제 노트북 다운로드

hail-tutorial.zip

참고문서

https://hail.is/docs/0.2/tutorials/01-genome-wide-association-study.html#Quality-Control
https://github.com/hmkim/quickstart-hail/tree/main/packer-files/scripts
EMR on EC2로 Hail 쥬피터를 사용하기 위한 EMR 노트북 환경 구성
- https://catalog.us-east-1.prod.workshops.aws/workshops/c86bd131-f6bf-4e8f-b798-58fd450d3c44/en-US/emr-notebooks-sagemaker
EMR Serverless 로 Hail 작업제출하기
- https://catalog.us-east-1.prod.workshops.aws/workshops/f9855d43-62e3-441b-ba02-7f37a278c077/en-US/5-emr-serverless

AWS HealthOmics 소개 및 주요 링크

Nf-core 워크플로우 마이그레이션 하기 (scrnaseq)

Nf-core 워크플로우 마이그레이션 하기 (rnaseq)

Nf-core 워크플로우 마이그레이션 하기 (spatialvi)

WDL 워크플로우 마이그레이션 하기 (HiFi-human-WGS-WDL)

AWS HealthOmics로 유전체 분석 (Fastq to VCF, Annotation) 자동화하기

AWS HealthOmics - Storage

AWS HealthOmics에서 Annotation 작업 수행하기

AWS HealthOmics - Troubleshooting

몇 가지 기능과 알아둘 것들

Amazon EMR on EC2

AWS Glue

Hail with Amazon EMR notebook

Quickstart Hail (Korean)

Quickstart Hail (English)

Store multimodal data with purpose-built Health AI services

Example code

Hail with Amazon EMR notebook

노트북 셋업

EMR 노트북에 Hail 설치

참고문서