任务1:从Ensembl数据库下载最新的人类参考基因组序列(fasta格式)以及对应的注释文件(GFF格式)
任务2:下载人类GM2878的转录组测序数据(GEO:GSE88583)
https://www.encodeproject.org/experiments/ENCSR843RJV/
并进行质量检测(FastQC)和过滤(TrimGalore)
一、准备
(1) 所需软件
- 📎MobaXterm_Installer_v25.2.zip连接服务器使用
- fastqc进行质量分析
- trim_galore进行过滤
(2)所需资源
1)人类基因组
- Homo_sapiens.GRCh38.114.gtf(gtf 解释文件)
- Homo_sapiens.GRCh38.dna.primary_assembly.fa(人类基因组)
2)GM12828 基因
- ENCLB518OAU
- ENCLB919DEB
(2)安装 miniconda
1)安装
1 2 3 4 5 6 7 8 9 10 11
| wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh
echo 'export PATH="~/miniconda3/bin: $ PATH"' >> ~/.bashrc source ~/.bashrc
conda --version
conda create -n myenv conda activate myenv
|
2)换源
1 2 3 4 5 6 7 8 9 10 11 12 13
| echo 'channels: - defaults show_channel_urls: true default_channels: - https://mirror.lzu.edu.cn/anaconda/pkgs/main - https://mirror.lzu.edu.cn/anaconda/pkgs/r - https://mirror.lzu.edu.cn/anaconda/pkgs/msys2 custom_channels: conda-forge: https://mirror.lzu.edu.cn/anaconda/cloud pytorch: https://mirror.lzu.edu.cn/anaconda/cloud ' | tee ~/.condarc conda config --set custom_channels.bioconda https://mirror.lzu.edu.cn/anaconda/cloud/
|
3)利用 conda 安装软件
1 2 3
| conda create -n myenv python=3.6#创建环境,一定要指定python版本 conda activate myenv conda install -c bioconda fastqc trim-galore snakemake
|
二、 编写 snakemake
使用 snakemake 工作流,可以简便工作流程,此处不过多解释 Snakemake 的编写规则
(1) 创建 Snakefile 文件(utf-8 编码)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
| SAMPLES = ["ENCFF824LLV", "ENCFF974EKR"] INPUT_DIR = "data" FASTQC_OUTPUT_DIR = "results/fastqc" TRIMMED_OUTPUT_DIR = "results/trimmed" configfile: "config/config.yaml"
rule all: input: expand(f"{FASTQC_OUTPUT_DIR}/{{sample}}_fastqc.html", sample=SAMPLES), expand(f"{TRIMMED_OUTPUT_DIR}/{{sample}}_trimmed.fq.gz", sample=SAMPLES), expand(f"{TRIMMED_OUTPUT_DIR}/{{sample}}_trimming_report.txt", sample=SAMPLES)
rule fastqc_original: input: f"{INPUT_DIR}/{{sample}}.fastq.gz" output: html=f"{FASTQC_OUTPUT_DIR}/{{sample}}_fastqc.html", zip=f"{FASTQC_OUTPUT_DIR}/{{sample}}_fastqc.zip" shell: """ mkdir -p {FASTQC_OUTPUT_DIR} fastqc --outdir {FASTQC_OUTPUT_DIR} {input} """
rule trim_galore: input: f"{INPUT_DIR}/{{sample}}.fastq.gz" output: trimmed=f"{TRIMMED_OUTPUT_DIR}/{{sample}}_trimmed.fq.gz", report=f"{TRIMMED_OUTPUT_DIR}/{{sample}}_trimming_report.txt" params: adapter="CTGTCTCTTATACACATCT" threads: 4 shell: """ mkdir -p {TRIMMED_OUTPUT_DIR} trim_galore \ --gzip \ --adapter {params.adapter} \ --length 20 \ --output_dir {TRIMMED_OUTPUT_DIR} \ --cores {threads} \ {input} """
|
Snakefile 一定要按照格式书写
(2)配置文件
1 2 3 4 5 6 7 8 9 10 11 12
| trim_galore: cores: 8 name: rnaseq-pipeline channels: - bioconda - conda-forge - defaults dependencies: - python=3.10 - fastqc=0.12.1 - trim-galore=0.6.9 - snakemake=8.16.0
|