淘先锋技术网

首页 1 2 3 4 5 6 7

基因组组装---Nanopore数据评估(拟南芥nanopore)

1. 下载软件

使用conda创建环境,下载nanoqcNanoPlotnanostat,并运行相关代码:

## Nanopore QC softwares
conda create -n nanoqc
conda activate nanoqc
mamba install -c bioconda nanoqc

## NanoPlot作者还开发了几个过滤比较的工具:NanoFilt, NanoStat, NanoLyse和NanoComp
## 下载还是conda环境中的pip方便,使用conda下载总是有报错
pip install NanoPlot ## plot
pip install nanostat ## stat report
pip install NanoFilt ## filter nanopore reads
pip install NanoLyse ## Remove reads mapping to the lambda phage genome from a fastq file.
## 
nohup NanoStat  --fastq ../CRR302667.fastq.gz -t 10   --tsv  --outdir 01.StatReports -n stat &
nohup NanoPlot -t 10 --fastq ../CRR302667.fastq.gz --plots hex dot kde -o 01.Nanoplot -p Ath -cm Viridis &

在这里插入图片描述其中NanoStat可以只是进行raw nanopore数据的统计,然后使用NanoFilt进行后续过滤,这个软件过滤主要是:
质量、长度和GC含量

2. 软件使用

(1)nanoQC

nanoQC软件说明如下,主要设定 -l参数:

usage: nanoQC [-h] [-v] [-o OUTDIR] [--rna] [-l MINLEN] fastq
Investigate nucleotide composition and base quality.
positional arguments:
  fastq                 Reads data in fastq.gz format.

options:
  -h, --help            show this help message and exit
  -v, --version         Print version and exit.
  -o OUTDIR, --outdir OUTDIR
                        Specify directory in which output has to be created.
  --rna                 Fastq is from direct RNA-seq and contains U nucleotides.
  -l MINLEN, --minlen MINLEN
                        Filters the reads on a minimal length of the given range. Also plots the given length/2 of the
                        begin and end of the reads.

使用命令:

## nanoQC
## -l 参数制定最短的reads长度限制
nohup nanoQC ../CRR302667.fastq.gz -o 01.nanoQC_res -l 1000 &   
nohup nanoQC ../CRR302667.fastq.gz -o 01.nanoQC_res2k -l 2000 &

输出结果包含log文件和html报告文件(主要看html):

-rw-r--r-- 1 debian debian 164097 8  30 21:14 nanoQC.html
-rw-r--r-- 1 debian debian    385 8  30 21:14 NanoQC.log

报告中包含read长度碱基含量碱基质量
在这里插入图片描述
设置参数 -l 2000,可以看出确实结果更好一些了,特别碱基含量和质量情况明显改善:
在这里插入图片描述

(2)NanoPlot

软件参数较多,官方使用例子:

EXAMPLES:
NanoPlot --summary sequencing_summary.txt --loglength -o summary-plots-log-transformed  
NanoPlot -t 2 --fastq reads1.fastq.gz reads2.fastq.gz --maxlength 40000 --plots dot --legacy hex
NanoPlot -t 12 --color yellow --bam alignment1.bam alignment2.bam alignment3.bam --downsample 10000 -o bamplots_downsampled

使用NanoPlot

## 其中hex参数没有出图,在github中找到原因。
## --plots uses the plotly package to plot kde and dot plots. Hex option will be ignored.
## 其中hex plot可以使用 --legacy hex 参数进行调用
## --downsample参数可以进行总体抽样
nohup NanoPlot -t 10 --fastq ../CRR302667.fastq.gz --plots hex dot kde -o 01.Nanoplot -p Ath -cm Viridis &

输出PNG图片:

-rw-r--r-- 1 debian debian 46966 8  30 23:20 AthLengthvsQualityScatterPlot_dot.png
-rw-r--r-- 1 debian debian 99265 8  30 23:20 AthLengthvsQualityScatterPlot_kde.png
-rw-r--r-- 1 debian debian 27722 8  30 23:20 AthNon_weightedHistogramReadlength.png
-rw-r--r-- 1 debian debian 37099 8  30 23:20 AthNon_weightedLogTransformed_HistogramReadlength.png
-rw-r--r-- 1 debian debian 34650 8  30 23:20 AthWeightedHistogramReadlength.png
-rw-r--r-- 1 debian debian 38920 8  30 23:20 AthWeightedLogTransformed_HistogramReadlength.png
-rw-r--r-- 1 debian debian 36923 8  30 23:20 AthYield_By_Length.png

输出html和log文件,整体报告文件为 AthNanoPlot-report.html,报告中包含statistics summary图片

-rw-r--r-- 1 debian debian  486051 8  30 23:20 AthLengthvsQualityScatterPlot_dot.html
-rw-r--r-- 1 debian debian  723285 8  30 23:20 AthLengthvsQualityScatterPlot_kde.html
-rw-r--r-- 1 debian debian    2693 8  30 23:20 AthNanoPlot_20220830_2151.log
-rw-r--r-- 1 debian debian 1540597 8  30 23:20 AthNanoPlot-report.html
-rw-r--r-- 1 debian debian   29207 8  30 23:20 AthNon_weightedHistogramReadlength.html
-rw-r--r-- 1 debian debian   30051 8  30 23:20 AthNon_weightedLogTransformed_HistogramReadlength.html
-rw-r--r-- 1 debian debian   32743 8  30 23:20 AthWeightedHistogramReadlength.html
-rw-r--r-- 1 debian debian   39886 8  30 23:20 AthWeightedLogTransformed_HistogramReadlength.html
-rw-r--r-- 1 debian debian  189660 8  30 23:20 AthYield_By_Length.html

此外还包含一个text文件AthNanoStats.txt,对整体的数据进行统计summary:

General summary:         
Mean read length:                 18,541.3
Mean read quality:                    11.1
Median read length:                7,818.0
Median read quality:                  11.2
Number of reads:               3,064,191.0
Read length N50:                  46,452.0
STDEV read length:                26,536.0
Total bases:              56,814,196,989.0
Number, percentage and megabases of reads above quality cutoffs
>Q5:	3064191 (100.0%) 56814.2Mb
>Q7:	3064123 (100.0%) 56814.2Mb
>Q10:	2168595 (70.8%) 40456.1Mb
>Q12:	1055916 (34.5%) 19383.7Mb
>Q15:	6640 (0.2%) 12.3Mb
Top 5 highest mean basecall quality scores and their read lengths
1:	21.0 (1)
2:	19.0 (1)
3:	19.0 (1)
4:	19.0 (1)
5:	18.9 (358)
Top 5 longest reads and their mean basecall quality score
1:	495032 (12.4)
2:	457760 (8.7)
3:	439434 (9.1)
4:	438143 (8.7)
5:	431286 (9.7)

NanoFilt使用例子:

EXAMPLES:
  gunzip -c reads.fastq.gz | NanoFilt -q 10 -l 500 --headcrop 50 | minimap2 genome.fa - | samtools sort -O BAM -@24 -o alignment.bam -
  gunzip -c reads.fastq.gz | NanoFilt -q 12 --headcrop 75 | gzip > trimmed-reads.fastq.gz
  gunzip -c reads.fastq.gz | NanoFilt -q 10 | gzip > highQuality-reads.fastq.gz

个人觉得stat信息比较有用,可以看出read长度平均值,质量情况等;
其他绘图png结果只能是查看,这些图质量不太行;
(该软件运行时需要联网,否则不出png图片)

整体看这个拟南芥nanopore数据,测序质量还是不太行,跟二代测序质量还是没法比。
另外,有点怀疑这个软件的质量值统计情况呢,我在github软件的issue问了一下作者。

此外minion_qc软件也可以评价nanopore数据,但是这个数据是基于basecaller结果(basecall from fast5),只有fastq不能用:

The benefit of MinIONQC is that it works directly with the sequencing_summary.txt
 files produced by ONT's Albacore or Guppy base callers. 

参考:
https://github.com/wdecoster/nanoQC
https://github.com/wdecoster/NanoPlot