结果是:暂时就以《你好世界那个为模板吧》
它自己会以 default 的style 显示出来!
04 Aug 2016
The FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing.
It is sometimes more productive to preprocess the FASTA/FASTQ files before mapping the sequences to the genome - manipulating the sequences to produce better mapping results.
The FASTX-Toolkit tools perform some of these preprocessing tasks.
$ wget http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2
$ tar -xjf fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2
#然后把可执行的二进制文件放进PATH环境变量
参考自:http://www.bbioo.com/lifesciences/40-115086-1.html
使用FASTX-Toolkit 的默认设置去adapter sequences序列,出现以下问题:
[root@localhost fastq]# fastx_clipper -i 1.fastq -o 1.fa
fastx_clipper: Invalid quality score value (char '#' ord 35 quality value -29) on line 4
fastq 的质量值得问题
理论上 fastq sanger 是 ord($Q)-33
illumina ord($Q)-64
最近拿到数据,发现有些数据实际上还是使用了 ord($Q)-33的质量值
所以fastx_clipper报错
fastx_clipper: Invalid quality score value (char '#' ord 35 quality value -29) on line 4
修改参数运行增加-Q 33 , 运行通过
上面是网上答案:
ASCII->SCORE
根据上图 S 和 L 是减去33;X I J 减去64
Command Line Arguments
- Most tools show usage information with ** -h **.
- Tools can read from STDIN and write to STDOUT, or from a specific input file (** -i **) and specific output file (** -o **).
- Tools can operate silently (producing no output if everything was OK), or print a short summary (**-v**).
If output goes to STDOUT, the summary will be printed to STDERR.
If output goes to a file, the summary will be printed to STDOUT.
- Some tools can compress the output with GZIP ( ** -z ** ).
测试数据为test.fastq 共9×4条记录。为illumina 1.8+ Phred+33 格式
Convert FASTQ files to FASTA files.
$ fastq_to_fasta -h
usage: fastq_to_fasta [-h] [-r] [-n] [-v] [-z] [-i INFILE] [-o OUTFILE]
version 0.0.6
[-h] = This helpful help screen.
[-r] = Rename sequence identifiers to numbers.把第一行的识别号换成数字
即变成:>1
ATCGTGT
>2
ACGTA
[-n] = keep sequences with unknown (N) nucleotides.
Default is to discard such sequences.留有N的序列,默认不保留
[-v] = Verbose - report number of sequences. 详细-报告序列编号,如果使用了-o则报告会直接在STDOUT,如果没有则输入到STDERR
If [-o] is specified, report will be printed to STDOUT.
If [-o] is not specified (and output goes to STDOUT),
report will be printed to STDERR.
[-z] = Compress output with GZIP.压缩输出
[-i INFILE] = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA output file. default is STDOUT.
$ fastq_to_fasta -Q 33 -i test.fastq -n -v -o test.fastq_TofastaKeepN.fasta
Input: 9 reads.
Output: 9 reads.
Chart Quality Statistics and Nucleotide Distribution ####2.2.1 FASTX Statistics
$ fastx_quality_stats -h
usage: fastx_quality_stats [-h] [-i INFILE] [-o OUTFILE]
version 0.0.6 (C) 2008 by Assaf Gordon (gordon@cshl.edu)
[-h] = This helpful help screen.
[-i INFILE] = FASTA/Q input file. default is STDIN.
If FASTA file is given, only nucleotides
distribution is calculated (there's no quality info).
[-o OUTFILE] = TEXT output file. default is STDOUT.
[-N] = New output format (with more information per nucleotide/cycle).新的输出格式,默认是旧的
The output TEXT file will have the following fields (one row per column):
column = column number (1 to 36 for a 36-cycles read solexa file)
count = number of bases found in this column.这列有多少碱基
min = Lowest quality score value found in this column.
max = Highest quality score value found in this column.
sum = Sum of quality score values for this column.
mean = Mean quality score value for this column.
Q1 = 1st quartile quality score.1/4碱基质量值
med = Median quality score.
Q3 = 3rd quartile quality score.
IQR = Inter-Quartile range (Q3-Q1).Q3减去Q1
lW = 'Left-Whisker' value (for boxplotting).
rW = 'Right-Whisker' value (for boxplotting).
A_Count = Count of 'A' nucleotides found in this column.
C_Count = Count of 'C' nucleotides found in this column.
G_Count = Count of 'G' nucleotides found in this column.
T_Count = Count of 'T' nucleotides found in this column.
N_Count = Count of 'N' nucleotides found in this column.
max-count = max. number of bases (in all cycles)
The *NEW* output format:
cycle (previously called 'column') = cycle number
max-count
For each nucleotide in the cycle (ALL/A/C/G/T/N):
count = number of bases found in this column.
min = Lowest quality score value found in this column.
max = Highest quality score value found in this column.
sum = Sum of quality score values for this column.
mean = Mean quality score value for this column.
Q1 = 1st quartile quality score.
med = Median quality score.
Q3 = 3rd quartile quality score.
IQR = Inter-Quartile range (Q3-Q1).
lW = 'Left-Whisker' value (for boxplotting).
rW = 'Right-Whisker' value (for boxplotting).
$ fastq_to_fasta -Q 33 -i test.fastq -n -v -o test.fastq_TofastaKeepN.fasta
说明:column和cycle number是sequence的列数,像下面的就是由40列
@ERR013180.1 HWI-EAS-249_38:2:1:2:857/1
TTTTCTTGTTCTTGACTCTTCTGCATAAGTANTTAAATCC
+
BBBBBCBB=BCBB6BBBB6!!!!!!!!!!!!!!!!!!!!!
$ fastq_quality_boxplot_graph.sh -h
Solexa-Quality BoxPlot plotter
Generates a solexa quality score box-plot graph 绘制碱基质量分布盒式图
Usage: /usr/local/bin/fastq_quality_boxplot_graph.sh [-i INPUT.TXT] [-t TITLE] [-p] [-o OUTPUT]
[-p] - Generate PostScript (.PS) file. Default is PNG image.产生.PS文件,默认产生png图像
[-i INPUT.TXT] - Input file. Should be the output of "solexa_quality_statistics" program.
输入文件为 fastx_quality_stats的输出文件
[-o OUTPUT] - Output file name. default is STDOUT.
[-t TITLE] - Title (usually the solexa file name) - will be plotted on the graph.
输出图像的标题
#需要先安装gunplot
$ sudo yum install gnuplot-minimal.x86_64
$ fastq_quality_boxplot_graph.sh -i test.fastq_qualityStats -t test.fastq_boxplot -o test.fastq_boxplot.png
test.fastq_boxplot.png
$ fastx_nucleotide_distribution_graph.sh -h
FASTA/Q Nucleotide Distribution Plotter
Usage: /usr/local/bin/fastx_nucleotide_distribution_graph.sh [-i INPUT.TXT] [-t TITLE] [-p] [-o OUTPUT]
[-p] - Generate PostScript (.PS) file. Default is PNG image.
[-i INPUT.TXT] - Input file. Should be the output of "fastx_quality_statistics" program.
[-o OUTPUT] - Output file name. default is STDOUT.
[-t TITLE] - Title - will be plotted on the graph.
$ fastx_nucleotide_distribution_graph.sh -i test.fastq_qualityStats -o test.fastq_nucleotide_distribution_graph.png
test.fastq_boxplot.png
上面空出来的白色的一块是因为最后测出来的结果,序列中的碱基数小于40个。
$ fastx_artifacts_filter -h
usage: fastx_artifacts_filter [-h] [-v] [-z] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.13 by A. Gordon (gordon@cshl.edu)
[-h] = This helpful help screen.
[-i INFILE] = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA/Q output file. default is STDOUT.
[-z] = Compress output with GZIP.
[-v] = Verbose - report number of processed reads.
If [-o] is specified, report will be printed to STDOUT.
If [-o] is not specified (and output goes to STDOUT),
report will be printed to STDERR.
这个是人工过滤? 但是我用的测试数据中有一条全部都是!的记录竟然没有被过滤掉。不知道这是怎么回事。
Collapsing identical sequences in a FASTQ/A file into a single sequence (while maintaining reads counts)
就是将一样的序列合并到一起,并且给出reads counts.那么测试数据就要复制一下。
$ fastx_collapser -h
usage: fastx_collapser [-h] [-v] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.13 by A. Gordon (gordon@cshl.edu)
[-h] = This helpful help screen.
[-v] = verbose: print short summary of input/output counts
[-i INFILE] = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA/Q output file. default is STDOUT.
显示的结果为:
>1-1
TTTTCTTGTTCTTGACTCTTCTGCATAAGTANTTAAATCC
>2-1
TATGCATCACATTCTTCTGGTTCTACTTTGCNATTTATCT
$ fastx_uncollapser -h
usage: fasta_uncollapser [-c N] [-h] [-v] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.13 by A. Gordon (gordon@cshl.edu)
[-h] = This helpful help screen.
[-v] = verbose: print short summary of input/output counts
[-c N] = Assume input is a tabular file (not FASTA file),
And the collapsed identifier (e.g. '1-1000') is on column N.
[-i INFILE] = FASTA/Tabular input file. default is STDIN.
[-o OUTFILE] = FASTA/Tabular output file. default is STDOUT.
Shortening reads in a FASTQ or FASTQ files (removing barcodes or noise).
$ fastx_trimmer -h
usage: fastx_trimmer [-h] [-f N] [-l N] [-z] [-v] [-i INFILE] [-o OUTFILE] 从3'开始到5'哪些部分保留
version 0.0.6
[-h] = This helpful help screen.
[-f N] = First base to keep. Default is 1 (=first base).从第几个碱基开始保留,默认第一个
[-l N] = Last base to keep. Default is entire read.后面从第几个碱基开始保留,默认全部碱基都保留.
[-t N] = Trim N nucleotides from the end of the read.
'-t' can not be used with '-l' and '-f'.
[-m MINLEN] = With [-t], discard reads shorter than MINLEN. 小于MINLEN长度的read去掉
[-z] = Compress output with GZIP.
[-i INFILE] = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA/Q output file. default is STDOUT.
-f 是保留从N个开始的碱基 -l 是保留到正数(从左到右)的第N个碱基 -t 是除去右端的N个碱基
Renames the sequence identifiers in FASTQ/A file.
$ fastx_renamer -h
usage: fastx_renamer [-n TYPE] [-h] [-z] [-v] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.10 by A. Gordon (gordon@cshl.edu)
[-n TYPE] = rename type:
SEQ - use the nucleotides sequence as the name.
COUNT - use simply counter as the name.
[-h] = This helpful help screen.
[-z] = Compress output with GZIP.
[-i INFILE] = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA/Q output file. default is STDOUT.
$ fastx_renamer -Q 33 -n SEQ -v -i test.fastq -o test.fastq_rename.fastq
Renamed: 9 reads.
就是在fastq文件的第一行用序列作为id
Removing sequencing adapters / linkers
$ fastx_clipper -h
usage: fastx_clipper [-h] [-a ADAPTER] [-D] [-l N] [-n] [-d N] [-c] [-C] [-o] [-v] [-z] [-i INFILE] [-o OUTFILE]
version 0.0.7
[-h] = This helpful help screen.
[-a ADAPTER] = ADAPTER string. default is CCTTAAGG (dummy adapter).接头序列(默认为CCTTAAGG)
[-l N] = discard sequences shorter than N nucleotides. default is 5.忽略那些碱基数目少于N的reads,默认为5
[-d N] = Keep the adapter and N bases after it.保留接头序列后的N个碱基默认 -d 0
(using '-d 0' is the same as not using '-d' at all. which is the default).
[-c] = Discard non-clipped sequences (i.e. - keep only sequences which contained the adapter).
放弃那些没有接头的序列.
[-C] = Discard clipped sequences (i.e. - keep only sequences which did not contained the adapter).
只保留没有接头的序列.
[-k] = Report Adapter-Only sequences.
报告只有接头的序列.
[-n] = keep sequences with unknown (N) nucleotides. default is to discard such sequences.
保留有N多序列,默认不保留
[-v] = Verbose - report number of sequences.详细-报告序列编号
If [-o] is specified, report will be printed to STDOUT.
If [-o] is not specified (and output goes to STDOUT),
report will be printed to STDERR.
[-z] = Compress output with GZIP.压缩输出.
[-D] = DEBUG output.输出调试结果.
[-M N] = require minimum adapter alignment length of N.要求最小能匹配到接头的长度N,如果和接头匹配的长度小于N不修剪
If less than N nucleotides aligned with the adapter - don't clip it.
[-i INFILE] = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA/Q output file. default is STDOUT.
$ fastx_clipper -Q 33 -a TTAAATCC -l -c -v -i test.fastq -o test.fastq_clipperTest_onlyFirstRecordHaveAdapter.fastq
Clipping Adapter: TTAAATCC
Min. Length: 0
Input: 9 reads.
Output: 2 reads.
discarded 0 too-short reads.
discarded 0 adapter-only reads.
discarded 7 N reads.
感觉这里有些参数貌似不对输出的结果起作用:-c和-C 貌似是和-v 参数在一起起作用的。
-D 这个参数挺好用的可以看看去接头的情况,只可意会不能言传,大家自己去体会吧
Producing the Reverse-complement of each sequence in a FASTQ/FASTA file.
$ fastx_reverse_complement -h
usage: fastx_reverse_complement [-h] [-r] [-z] [-v] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.13 by A. Gordon (gordon@cshl.edu)
[-h] = This helpful help screen.
[-z] = Compress output with GZIP.
[-i INFILE] = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA/Q output file. default is STDOUT.
顾名思义
requires two perl modules: PerlIO::gzip and GD::Graph::bars.
Splitting a FASTQ/FASTA files containning multiple samples
where is barcode
$ fastx_barcode_splitter.pl -h
Barcode Splitter, by Assaf Gordon (gordon@cshl.edu), 11sep2008
This program reads FASTA/FASTQ file and splits it into several smaller files,
Based on barcode matching.
基于barcode序列将fastq/a文件进行拆分!
FASTA/FASTQ data is read from STDIN (format is auto-detected.)
Output files will be writen to disk.
Summary will be printed to STDOUT.
usage: /home/lang/fastx-tool/bin/fastx_barcode_splitter.pl --bcfile FILE --prefix PREFIX [--suffix SUFFIX] [--bol|--eol]
[--mismatches N] [--exact] [--partial N] [--help] [--quiet] [--debug]
Arguments:
--bcfile FILE - Barcodes file name. (see explanation below.) bcfile文件,具体的文件格式见下文
--prefix PREFIX - File prefix. will be added to the output files. Can be used
to specify output directories.
--suffix SUFFIX - File suffix (optional). Can be used to specify file
extensions.
--bol - Try to match barcodes at the BEGINNING of sequences.
(What biologists would call the 5' end, and programmers
would call index 0.)
--eol - Try to match barcodes at the END of sequences.
(What biologists would call the 3' end, and programmers
would call the end of the string.)
NOTE: one of --bol, --eol must be specified, but not both.
--mismatches N - Max. number of mismatches allowed. default is 1.
--exact - Same as '--mismatches 0'. If both --exact and --mismatches
are specified, '--exact' takes precedence.
--partial N - Allow partial overlap of barcodes. (see explanation below.)
(Default is not partial matching)
--quiet - Don't print counts and summary at the end of the run.
(Default is to print.)
--debug - Print lots of useless debug information to STDERR.
--help - This helpful help screen.
Example (Assuming 's_2_100.txt' is a FASTQ file, 'mybarcodes.txt' is
the barcodes file):
$ cat s_2_100.txt | /home/lang/fastx-tool/bin/fastx_barcode_splitter.pl --bcfile mybarcodes.txt --bol --mismatches 2 \
--prefix /tmp/bla_ --suffix ".txt"
Barcode file format
-------------------
Barcode files are simple text files. Each line should contain an identifier
(descriptive name for the barcode), and the barcode itself (A/C/G/T),
separated by a TAB character. Example:
#This line is a comment (starts with a 'number' sign)
BC1 GATCT
BC2 ATCGT
BC3 GTGAT
BC4 TGTCT
For each barcode, a new FASTQ file will be created (with the barcode's
identifier as part of the file name). Sequences matching the barcode
will be stored in the appropriate file.
Running the above example (assuming "mybarcodes.txt" contains the above
barcodes), will create the following files:
/tmp/bla_BC1.txt
/tmp/bla_BC2.txt
/tmp/bla_BC3.txt
/tmp/bla_BC4.txt
/tmp/bla_unmatched.txt
The 'unmatched' file will contain all sequences that didn't match any barcode.
Barcode matching
----------------
** Without partial matching:
Count mismatches between the FASTA/Q sequences and the barcodes.
The barcode which matched with the lowest mismatches count (providing the
count is small or equal to '--mismatches N') 'gets' the sequences.
Example (using the above barcodes):
Input Sequence:
GATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
Matching with '--bol --mismatches 1':
GATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
GATCT (1 mismatch, BC1)
ATCGT (4 mismatches, BC2)
GTGAT (3 mismatches, BC3)
TGTCT (3 mismatches, BC4)
This sequence will be classified as 'BC1' (it has the lowest mismatch count).
If '--exact' or '--mismatches 0' were specified, this sequence would be
classified as 'unmatched' (because, although BC1 had the lowest mismatch count,
it is above the maximum allowed mismatches).
Matching with '--eol' (end of line) does the same, but from the other side
of the sequence.
** With partial matching (very similar to indels):
Same as above, with the following addition: barcodes are also checked for
partial overlap (number of allowed non-overlapping bases is '--partial N').
Example:
Input sequence is ATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
(Same as above, but note the missing 'G' at the beginning.)
Matching (without partial overlapping) against BC1 yields 4 mismatches:
ATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
GATCT (4 mismatches)
Partial overlapping would also try the following match:
-ATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
GATCT (1 mismatch)
Note: scoring counts a missing base as a mismatch, so the final
mismatch count is 2 (1 'real' mismatch, 1 'missing base' mismatch).
If running with '--mismatches 2' (meaning allowing upto 2 mismatches) - this
seqeunce will be classified as BC1.
changes the width of sequences line in a FASTA file
$ fasta_formatter -h
usage: fasta_formatter [-h] [-i INFILE] [-o OUTFILE] [-w N] [-t] [-e]
Part of FASTX Toolkit 0.0.13 by gordon@cshl.edu
[-h] = This helpful help screen.
[-i INFILE] = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA/Q output file. default is STDOUT.
[-w N] = max. sequence line width for output FASTA file.
When ZERO (the default), sequence lines will NOT be wrapped -
all nucleotides of each sequences will appear on a single
line (good for scripting).
[-t] = Output tabulated format (instead of FASTA format).
Sequence-Identifiers will be on first column,
Nucleotides will appear on second column (as single line).
[-e] = Output empty sequences (default is to discard them).
Empty sequences are ones who have only a sequence identifier,
but not actual nucleotides.
Input Example:
>MY-ID
AAAAAGGGGG
CCCCCTTTTT
AGCTN
Output example with unlimited line width [-w 0]:
>MY-ID
AAAAAGGGGGCCCCCTTTTTAGCTN
Output example with max. line width=7 [-w 7]:
>MY-ID
AAAAAGG
GGGTTTT
TCCCCCA
GCTN
Output example with tabular output [-t]:
MY-ID AAAAAGGGGGCCCCCTTTTAGCTN
example of empty sequence:
(will be discarded unless [-e] is used)
>REGULAR-SEQUENCE-1
AAAGGGTTTCCC
>EMPTY-SEQUENCE
>REGULAR-SEQUENCE-2
AAGTAGTAGTAGTAGT
GTATTTTATAT
帮助文件中说的很清楚了!
Convets FASTA sequences from/to RNA/DNA
$ fasta_nucleotide_changer -h
usage: fasta_nucleotide_changer [-h] [-z] [-v] [-i INFILE] [-o OUTFILE] [-r] [-d]
Part of FASTX Toolkit 0.0.13 by A. Gordon (gordon@cshl.edu)
[-h] = This helpful help screen.
[-z] = Compress output with GZIP.
[-v] = Verbose mode. Prints a short summary.
with [-o], summary is printed to STDOUT.
Otherwise, summary is printed to STDERR.
[-i INFILE] = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA/Q output file. default is STDOUT.
[-r] = DNA-to-RNA mode - change T's into U's.
[-d] = RNA-to-DNA mode - change U's into T's.
就是把序列中的U和T转转!
Filters sequences based on quality
$ fastq_quality_filter -h
usage: fastq_quality_filter [-h] [-v] [-q N] [-p N] [-z] [-i INFILE] [-o OUTFILE]
version 0.0.6
[-h] = This helpful help screen.
[-q N] = Minimum quality score to keep.最小的需要留下的质量值
[-p N] = Minimum percent of bases that must have [-q] quality.每个reads中最少有百分之多少的碱基需要有-q的质量值
[-z] = Compress output with GZIP.压缩输出
[-i INFILE] = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA/Q output file. default is STDOUT.
[-v] = Verbose - report number of sequences.详细-报告序列编号,如果使用了-o则报告会直接在STDOUT,如果没有则输入到STDERR
If [-o] is specified, report will be printed to STDOUT.
If [-o] is not specified (and output goes to STDOUT),
report will be printed to STDERR.
测试数据的数量为9×4,其中有两条是有最差的!的
$ fastq_quality_filter -i test.fastq -Q 33 -v -q 20 -p 50 -o test.fastq_filterq20p50.fastq
Quality cut-off: 20
Minimum percentage: 50
Input: 9 reads.
Output: 7 reads.
discarded 2 (22%) low-quality reads.
Trims (cuts) sequences based on quality
$ fastq_quality_trimmer -h
usage: fastq_quality_trimmer [-h] [-v] [-t N] [-l N] [-z] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.13 by A. Gordon (gordon@cshl.edu)
[-h] = This helpful help screen.
[-t N] = Quality threshold - nucleotides with lower
quality will be trimmed (from the end of the sequence).
[-l N] = Minimum length - sequences shorter than this (after trimming)
will be discarded. Default = 0 = no minimum length.
[-z] = Compress output with GZIP.
[-i INFILE] = FASTQ input file. default is STDIN.
[-o OUTFILE] = FASTQ output file. default is STDOUT.
[-v] = Verbose - report number of sequences.
If [-o] is specified, report will be printed to STDOUT.
If [-o] is not specified (and output goes to STDOUT),
report will be printed to STDERR.
通过测试,对于 -t 这个选项它的执行结果是:
fastq中的序列中的碱基的测序质量低于某一个阈值时,将序列从这个碱基到最后的哪一部分去掉。
比如ATGCTGAG 的质量分数为 33 22 11 33 21 30 25 23 你选择 30 的阈值的话,它会从右往左检索第一个小于30的碱基,然后把这个碱基之后的序列全部删去。
个人觉得这个价值不大!
$ fastq_quality_converter -h
usage: fastq_quality_converter [-h] [-a] [-n] [-z] [-i INFILE] [-f OUTFILE]
Part of FASTX Toolkit 0.0.13 by A. Gordon (gordon@cshl.edu)
[-h] = This helpful help screen.
[-a] = Output ASCII quality scores (default).
[-n] = Output numeric quality scores.
[-z] = Compress output with GZIP.
[-i INFILE] = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA output file. default is STDOUT.
Masks nucleotides with ‘N’ (or other character) based on quality
$ fastq_masker -h
usage: fastq_masker [-h] [-v] [-q N] [-r C] [-z] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.13 by A. Gordon (gordon@cshl.edu)
[-h] = This helpful help screen.
[-q N] = Quality threshold - nucleotides with lower quality will be masked
Default is 10.
质量门限值,质量值低于这个门限值的将被mask掉(就是变成N),默认值为10
[-r C] = Replace low-quality nucleotides with character C. Default is 'N'
[-z] = Compress output with GZIP.
[-i INFILE] = FASTQ input file. default is STDIN.
[-o OUTFILE] = FASTQ output file. default is STDOUT.
[-v] = Verbose - report number of sequences.
If [-o] is specified, report will be printed to STDOUT.
If [-o] is not specified (and output goes to STDOUT),
report will be printed to STDERR.
masker的意思就是,通过设定-q参数 将低于某一质量分数的碱基用N表示。