Fastx toolkit   

Fastx toolkit

结果是:暂时就以《你好世界那个为模板吧》
它自己会以 default 的style 显示出来!

04 Aug 2016

studyFastx-toolkit

0 简介:

The FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing.

It is sometimes more productive to preprocess the FASTA/FASTQ files before mapping the sequences to the genome - manipulating the sequences to produce better mapping results.

The FASTX-Toolkit tools perform some of these preprocessing tasks.

1 安装:可以直接安装已经编译好的文件

$ wget http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2
$ tar -xjf fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2
#然后把可执行的二进制文件放进PATH环境变量

FAQ:

参考自:http://www.bbioo.com/lifesciences/40-115086-1.html

使用FASTX-Toolkit 的默认设置去adapter sequences序列,出现以下问题:
[root@localhost fastq]# fastx_clipper -i 1.fastq -o 1.fa
fastx_clipper: Invalid quality score value (char '#' ord 35 quality value -29) on line 4
fastq 的质量值得问题
理论上 fastq sanger 是 ord($Q)-33  
                  illumina       ord($Q)-64
最近拿到数据,发现有些数据实际上还是使用了 ord($Q)-33的质量值
所以fastx_clipper报错
fastx_clipper: Invalid quality score value (char '#' ord 35 quality value -29) on line 4
修改参数运行增加-Q 33 , 运行通过

上面是网上答案:

ASCII->SCORE

ASCII->SCORE

根据上图 S 和 L 是减去33;X I J 减去64

2 Available Tools

Command Line Arguments

- Most tools show usage information with ** -h **.
- Tools can read from STDIN and write to STDOUT, or from a specific input file (** -i **) and specific output file (** -o **).
- Tools can operate silently (producing no output if everything was OK), or print a short summary (**-v**).
  If output goes to STDOUT, the summary will be printed to STDERR.
  If output goes to a file, the summary will be printed to STDOUT.
- Some tools can compress the output with GZIP ( ** -z ** ).

测试数据为test.fastq 共9×4条记录。为illumina 1.8+ Phred+33 格式


2.1 FASTQ-to-FASTA converter

Convert FASTQ files to FASTA files.

$ fastq_to_fasta -h
    usage: fastq_to_fasta [-h] [-r] [-n] [-v] [-z] [-i INFILE] [-o OUTFILE]
version 0.0.6
       [-h]         = This helpful help screen.
       [-r]         = Rename sequence identifiers to numbers.把第一行的识别号换成数字
                    即变成:>1
                            ATCGTGT
                            >2
                            ACGTA
       [-n]         = keep sequences with unknown (N) nucleotides.
                          Default is to discard such sequences.留有N的序列,默认不保留
       [-v]         = Verbose - report number of sequences. 详细-报告序列编号,如果使用了-o则报告会直接在STDOUT,如果没有则输入到STDERR
              If [-o] is specified,  report will be printed to STDOUT.
              If [-o] is not specified (and output goes to STDOUT),
              report will be printed to STDERR.
       [-z]         = Compress output with GZIP.压缩输出
       [-i INFILE]  = FASTA/Q input file. default is STDIN.
       [-o OUTFILE] = FASTA output file. default is STDOUT.
$ fastq_to_fasta -Q 33 -i test.fastq -n -v -o test.fastq_TofastaKeepN.fasta
Input: 9 reads.
Output: 9 reads.

2.2 FASTQ Information

Chart Quality Statistics and Nucleotide Distribution ####2.2.1 FASTX Statistics

$ fastx_quality_stats -h
    usage: fastx_quality_stats [-h] [-i INFILE] [-o OUTFILE]

    version 0.0.6 (C) 2008 by Assaf Gordon (gordon@cshl.edu)
       [-h] = This helpful help screen.
       [-i INFILE]  = FASTA/Q input file. default is STDIN.
                      If FASTA file is given, only nucleotides
              distribution is calculated (there's no quality info).
       [-o OUTFILE] = TEXT output file. default is STDOUT.
       [-N] = New output format (with more information per nucleotide/cycle).新的输出格式,默认是旧的

    The output TEXT file will have the following fields (one row per column):
        column  = column number (1 to 36 for a 36-cycles read solexa file)
        count   = number of bases found in this column.这列有多少碱基
        min     = Lowest quality score value found in this column.
        max     = Highest quality score value found in this column.
        sum     = Sum of quality score values for this column.
        mean    = Mean quality score value for this column.
        Q1  = 1st quartile quality score.1/4碱基质量值
        med = Median quality score.
        Q3  = 3rd quartile quality score.
        IQR = Inter-Quartile range (Q3-Q1).Q3减去Q1
        lW  = 'Left-Whisker' value (for boxplotting).
        rW  = 'Right-Whisker' value (for boxplotting).
        A_Count = Count of 'A' nucleotides found in this column.
        C_Count = Count of 'C' nucleotides found in this column.
        G_Count = Count of 'G' nucleotides found in this column.
        T_Count = Count of 'T' nucleotides found in this column.
        N_Count = Count of 'N' nucleotides found in this column.
        max-count = max. number of bases (in all cycles)
        
  The *NEW* output format:
    cycle (previously called 'column') = cycle number
    max-count
    For each nucleotide in the cycle (ALL/A/C/G/T/N):
        count   = number of bases found in this column.
        min     = Lowest quality score value found in this column.
        max     = Highest quality score value found in this column.
        sum     = Sum of quality score values for this column.
        mean    = Mean quality score value for this column.
        Q1  = 1st quartile quality score.
        med = Median quality score.
        Q3  = 3rd quartile quality score.
        IQR = Inter-Quartile range (Q3-Q1).
        lW  = 'Left-Whisker' value (for boxplotting).
        rW  = 'Right-Whisker' value (for boxplotting).
$ fastq_to_fasta -Q 33 -i test.fastq -n -v -o test.fastq_TofastaKeepN.fasta

说明:column和cycle number是sequence的列数,像下面的就是由40列

@ERR013180.1 HWI-EAS-249_38:2:1:2:857/1
TTTTCTTGTTCTTGACTCTTCTGCATAAGTANTTAAATCC
+
BBBBBCBB=BCBB6BBBB6!!!!!!!!!!!!!!!!!!!!!

2.2.2 FASTQ Quality Chart

$ fastq_quality_boxplot_graph.sh -h
    Solexa-Quality BoxPlot plotter
    Generates a solexa quality score box-plot graph 绘制碱基质量分布盒式图

    Usage: /usr/local/bin/fastq_quality_boxplot_graph.sh [-i INPUT.TXT] [-t TITLE] [-p] [-o OUTPUT]

      [-p]           - Generate PostScript (.PS) file. Default is PNG image.产生.PS文件,默认产生png图像
      [-i INPUT.TXT] - Input file. Should be the output of "solexa_quality_statistics" program.
      输入文件为 fastx_quality_stats的输出文件
      [-o OUTPUT]    - Output file name. default is STDOUT.
      [-t TITLE]     - Title (usually the solexa file name) - will be plotted on the graph.
      输出图像的标题
#需要先安装gunplot
$ sudo yum install gnuplot-minimal.x86_64
$ fastq_quality_boxplot_graph.sh -i test.fastq_qualityStats -t test.fastq_boxplot  -o test.fastq_boxplot.png
test.fastq_boxplot.png

test.fastq_boxplot.png

2.2.3 FASTA/Q Nucleotide Distribution

$ fastx_nucleotide_distribution_graph.sh -h
    FASTA/Q Nucleotide Distribution Plotter

    Usage: /usr/local/bin/fastx_nucleotide_distribution_graph.sh [-i INPUT.TXT] [-t TITLE] [-p] [-o OUTPUT]

      [-p]           - Generate PostScript (.PS) file. Default is PNG image.
      [-i INPUT.TXT] - Input file. Should be the output of "fastx_quality_statistics" program.
      [-o OUTPUT]    - Output file name. default is STDOUT.
      [-t TITLE]     - Title - will be plotted on the graph.
$ fastx_nucleotide_distribution_graph.sh -i test.fastq_qualityStats -o test.fastq_nucleotide_distribution_graph.png
test.fastq_boxplot.png

test.fastq_boxplot.png

上面空出来的白色的一块是因为最后测出来的结果,序列中的碱基数小于40个。

2.3 FASTQ/A Artifacts Filter

$ fastx_artifacts_filter -h
usage: fastx_artifacts_filter [-h] [-v] [-z] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.13 by A. Gordon (gordon@cshl.edu)

   [-h]         = This helpful help screen.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
   [-z]         = Compress output with GZIP.
   [-v]         = Verbose - report number of processed reads.
                  If [-o] is specified,  report will be printed to STDOUT.
                  If [-o] is not specified (and output goes to STDOUT),
                  report will be printed to STDERR.

这个是人工过滤? 但是我用的测试数据中有一条全部都是!的记录竟然没有被过滤掉。不知道这是怎么回事。

2. FASTQ/A Collapser

Collapsing identical sequences in a FASTQ/A file into a single sequence (while maintaining reads counts)

就是将一样的序列合并到一起,并且给出reads counts.那么测试数据就要复制一下。

$ fastx_collapser -h
usage: fastx_collapser [-h] [-v] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.13 by A. Gordon (gordon@cshl.edu)

   [-h]         = This helpful help screen.
   [-v]         = verbose: print short summary of input/output counts
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.

显示的结果为:

>1-1
TTTTCTTGTTCTTGACTCTTCTGCATAAGTANTTAAATCC
>2-1
TATGCATCACATTCTTCTGGTTCTACTTTGCNATTTATCT

2. FASTQ/A uncollapser 同上面的相反

$ fastx_uncollapser -h
usage: fasta_uncollapser [-c N] [-h] [-v] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.13 by A. Gordon (gordon@cshl.edu)

   [-h]         = This helpful help screen.
   [-v]         = verbose: print short summary of input/output counts
   [-c N]       = Assume input is a tabular file (not FASTA file),
                  And the collapsed identifier (e.g. '1-1000') is on column N.
   [-i INFILE]  = FASTA/Tabular input file. default is STDIN.
   [-o OUTFILE] = FASTA/Tabular output file. default is STDOUT.

2. FASTQ/A Trimmer

Shortening reads in a FASTQ or FASTQ files (removing barcodes or noise).

$ fastx_trimmer -h
    usage: fastx_trimmer [-h] [-f N] [-l N] [-z] [-v] [-i INFILE] [-o OUTFILE] 从3'开始到5'哪些部分保留

    version 0.0.6
       [-h]         = This helpful help screen.
       [-f N]       = First base to keep. Default is 1 (=first base).从第几个碱基开始保留,默认第一个
       [-l N]       = Last base to keep. Default is entire read.后面从第几个碱基开始保留,默认全部碱基都保留.
       [-t N]       = Trim N nucleotides from the end of the read.
                  '-t'  can not be used with '-l' and '-f'.
     [-m MINLEN]  = With [-t], discard reads shorter than MINLEN. 小于MINLEN长度的read去掉
       [-z]         = Compress output with GZIP.
       [-i INFILE]  = FASTA/Q input file. default is STDIN.
       [-o OUTFILE] = FASTA/Q output file. default is STDOUT.

-f 是保留从N个开始的碱基 -l 是保留到正数(从左到右)的第N个碱基 -t 是除去右端的N个碱基

2. FASTQ/A Renamer

Renames the sequence identifiers in FASTQ/A file.

$ fastx_renamer -h
    usage: fastx_renamer [-n TYPE] [-h] [-z] [-v] [-i INFILE] [-o OUTFILE]
    Part of FASTX Toolkit 0.0.10 by A. Gordon (gordon@cshl.edu)

       [-n TYPE]    = rename type:
              SEQ - use the nucleotides sequence as the name.
              COUNT - use simply counter as the name.
       [-h]         = This helpful help screen.
       [-z]         = Compress output with GZIP.
       [-i INFILE]  = FASTA/Q input file. default is STDIN.
       [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
$ fastx_renamer -Q 33 -n SEQ -v -i test.fastq -o test.fastq_rename.fastq
Renamed: 9 reads.

就是在fastq文件的第一行用序列作为id

2. FASTQ/A Clipper

Removing sequencing adapters / linkers

$ fastx_clipper -h
    usage: fastx_clipper [-h] [-a ADAPTER] [-D] [-l N] [-n] [-d N] [-c] [-C] [-o] [-v] [-z] [-i INFILE] [-o OUTFILE]

    version 0.0.7
       [-h]         = This helpful help screen.
       [-a ADAPTER] = ADAPTER string. default is CCTTAAGG (dummy adapter).接头序列(默认为CCTTAAGG)
       [-l N]       = discard sequences shorter than N nucleotides. default is 5.忽略那些碱基数目少于N的reads,默认为5
       [-d N]       = Keep the adapter and N bases after it.保留接头序列后的N个碱基默认  -d 0
              (using '-d 0' is the same as not using '-d' at all. which is the default).
       [-c]         = Discard non-clipped sequences (i.e. - keep only sequences which contained the adapter).
                    放弃那些没有接头的序列.
       [-C]         = Discard clipped sequences (i.e. - keep only sequences which did not contained the adapter).
                    只保留没有接头的序列.
       [-k]         = Report Adapter-Only sequences.
                    报告只有接头的序列.
       [-n]         = keep sequences with unknown (N) nucleotides. default is to discard such sequences.
                    保留有N多序列,默认不保留
       [-v]         = Verbose - report number of sequences.详细-报告序列编号
              If [-o] is specified,  report will be printed to STDOUT.
              If [-o] is not specified (and output goes to STDOUT),
              report will be printed to STDERR.
       [-z]         = Compress output with GZIP.压缩输出.
       [-D]     = DEBUG output.输出调试结果.
       [-M N]       = require minimum adapter alignment length of N.要求最小能匹配到接头的长度N,如果和接头匹配的长度小于N不修剪
                  If less than N nucleotides aligned with the adapter - don't clip it.
       [-i INFILE]  = FASTA/Q input file. default is STDIN.
       [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
$ fastx_clipper -Q 33 -a TTAAATCC -l -c -v -i test.fastq -o test.fastq_clipperTest_onlyFirstRecordHaveAdapter.fastq
Clipping Adapter: TTAAATCC
Min. Length: 0
Input: 9 reads.
Output: 2 reads.
discarded 0 too-short reads.
discarded 0 adapter-only reads.
discarded 7 N reads.

感觉这里有些参数貌似不对输出的结果起作用:-c和-C 貌似是和-v 参数在一起起作用的。

-D 这个参数挺好用的可以看看去接头的情况,只可意会不能言传,大家自己去体会吧

2. FASTQ/A Reverse-Complement

Producing the Reverse-complement of each sequence in a FASTQ/FASTA file.

$ fastx_reverse_complement -h
usage: fastx_reverse_complement [-h] [-r] [-z] [-v] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.13 by A. Gordon (gordon@cshl.edu)

   [-h]         = This helpful help screen.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.

顾名思义

2. FASTA Clipping Histogram

requires two perl modules: PerlIO::gzip and GD::Graph::bars.

2. FASTQ/A Barcode splitter

Splitting a FASTQ/FASTA files containning multiple samples

where is barcode

where is barcode

$ fastx_barcode_splitter.pl -h
Barcode Splitter, by Assaf Gordon (gordon@cshl.edu), 11sep2008

This program reads FASTA/FASTQ file and splits it into several smaller files,
Based on barcode matching.
基于barcode序列将fastq/a文件进行拆分!
FASTA/FASTQ data is read from STDIN (format is auto-detected.)
Output files will be writen to disk.
Summary will be printed to STDOUT.

usage: /home/lang/fastx-tool/bin/fastx_barcode_splitter.pl --bcfile FILE --prefix PREFIX [--suffix SUFFIX] [--bol|--eol] 
         [--mismatches N] [--exact] [--partial N] [--help] [--quiet] [--debug]

Arguments:

--bcfile FILE   - Barcodes file name. (see explanation below.) bcfile文件,具体的文件格式见下文
--prefix PREFIX - File prefix. will be added to the output files. Can be used
          to specify output directories.
--suffix SUFFIX - File suffix (optional). Can be used to specify file
          extensions.
--bol       - Try to match barcodes at the BEGINNING of sequences.
          (What biologists would call the 5' end, and programmers
          would call index 0.)
--eol       - Try to match barcodes at the END of sequences.
          (What biologists would call the 3' end, and programmers
          would call the end of the string.)
          NOTE: one of --bol, --eol must be specified, but not both.
--mismatches N  - Max. number of mismatches allowed. default is 1.
--exact     - Same as '--mismatches 0'. If both --exact and --mismatches 
          are specified, '--exact' takes precedence.
--partial N - Allow partial overlap of barcodes. (see explanation below.)
          (Default is not partial matching)
--quiet     - Don't print counts and summary at the end of the run.
          (Default is to print.)
--debug     - Print lots of useless debug information to STDERR.
--help      - This helpful help screen.

Example (Assuming 's_2_100.txt' is a FASTQ file, 'mybarcodes.txt' is 
the barcodes file):

   $ cat s_2_100.txt | /home/lang/fastx-tool/bin/fastx_barcode_splitter.pl --bcfile mybarcodes.txt --bol --mismatches 2 \
    --prefix /tmp/bla_ --suffix ".txt"

Barcode file format
-------------------
Barcode files are simple text files. Each line should contain an identifier 
(descriptive name for the barcode), and the barcode itself (A/C/G/T), 
separated by a TAB character. Example:

    #This line is a comment (starts with a 'number' sign)
    BC1 GATCT
    BC2 ATCGT
    BC3 GTGAT
    BC4 TGTCT

For each barcode, a new FASTQ file will be created (with the barcode's 
identifier as part of the file name). Sequences matching the barcode 
will be stored in the appropriate file.

Running the above example (assuming "mybarcodes.txt" contains the above 
barcodes), will create the following files:
    /tmp/bla_BC1.txt
    /tmp/bla_BC2.txt
    /tmp/bla_BC3.txt
    /tmp/bla_BC4.txt
    /tmp/bla_unmatched.txt
The 'unmatched' file will contain all sequences that didn't match any barcode.

Barcode matching
----------------

** Without partial matching:

Count mismatches between the FASTA/Q sequences and the barcodes.
The barcode which matched with the lowest mismatches count (providing the
count is small or equal to '--mismatches N') 'gets' the sequences.

Example (using the above barcodes):
Input Sequence:
    GATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG

Matching with '--bol --mismatches 1':
   GATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
   GATCT (1 mismatch, BC1)
   ATCGT (4 mismatches, BC2)
   GTGAT (3 mismatches, BC3)
   TGTCT (3 mismatches, BC4)

This sequence will be classified as 'BC1' (it has the lowest mismatch count).
If '--exact' or '--mismatches 0' were specified, this sequence would be 
classified as 'unmatched' (because, although BC1 had the lowest mismatch count,
it is above the maximum allowed mismatches).

Matching with '--eol' (end of line) does the same, but from the other side
of the sequence.

** With partial matching (very similar to indels):

Same as above, with the following addition: barcodes are also checked for
partial overlap (number of allowed non-overlapping bases is '--partial N').

Example:
Input sequence is ATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
(Same as above, but note the missing 'G' at the beginning.)

Matching (without partial overlapping) against BC1 yields 4 mismatches:
   ATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
   GATCT (4 mismatches)

Partial overlapping would also try the following match:
   -ATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
   GATCT (1 mismatch)

Note: scoring counts a missing base as a mismatch, so the final
mismatch count is 2 (1 'real' mismatch, 1 'missing base' mismatch).
If running with '--mismatches 2' (meaning allowing upto 2 mismatches) - this 
seqeunce will be classified as BC1.

2. FASTA Formatter

changes the width of sequences line in a FASTA file

$ fasta_formatter -h
usage: fasta_formatter [-h] [-i INFILE] [-o OUTFILE] [-w N] [-t] [-e]
Part of FASTX Toolkit 0.0.13 by gordon@cshl.edu

   [-h]         = This helpful help screen.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
   [-w N]       = max. sequence line width for output FASTA file.
                  When ZERO (the default), sequence lines will NOT be wrapped -
                  all nucleotides of each sequences will appear on a single 
                  line (good for scripting).
   [-t]         = Output tabulated format (instead of FASTA format).
                  Sequence-Identifiers will be on first column,
                  Nucleotides will appear on second column (as single line).
   [-e]         = Output empty sequences (default is to discard them).
                  Empty sequences are ones who have only a sequence identifier,
                  but not actual nucleotides.

Input Example:
   >MY-ID
   AAAAAGGGGG
   CCCCCTTTTT
   AGCTN

Output example with unlimited line width [-w 0]:
   >MY-ID
   AAAAAGGGGGCCCCCTTTTTAGCTN

Output example with max. line width=7 [-w 7]:
   >MY-ID
   AAAAAGG
   GGGTTTT
   TCCCCCA
   GCTN

Output example with tabular output [-t]:
   MY-ID    AAAAAGGGGGCCCCCTTTTAGCTN

example of empty sequence:
(will be discarded unless [-e] is used)
  >REGULAR-SEQUENCE-1
  AAAGGGTTTCCC
  >EMPTY-SEQUENCE
  >REGULAR-SEQUENCE-2
  AAGTAGTAGTAGTAGT
  GTATTTTATAT

帮助文件中说的很清楚了!

FASTA Nucleotide Changer

Convets FASTA sequences from/to RNA/DNA

$ fasta_nucleotide_changer -h
usage: fasta_nucleotide_changer [-h] [-z] [-v] [-i INFILE] [-o OUTFILE] [-r] [-d]
Part of FASTX Toolkit 0.0.13 by A. Gordon (gordon@cshl.edu)

   [-h]         = This helpful help screen.
   [-z]         = Compress output with GZIP.
   [-v]         = Verbose mode. Prints a short summary.
                  with [-o], summary is printed to STDOUT.
                  Otherwise, summary is printed to STDERR.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
   [-r]         = DNA-to-RNA mode - change T's into U's.
   [-d]         = RNA-to-DNA mode - change U's into T's.

就是把序列中的U和T转转!

2. FASTQ Quality Filter

Filters sequences based on quality

$ fastq_quality_filter -h
    usage: fastq_quality_filter [-h] [-v] [-q N] [-p N] [-z] [-i INFILE] [-o OUTFILE]

    version 0.0.6
       [-h]         = This helpful help screen.
       [-q N]       = Minimum quality score to keep.最小的需要留下的质量值
       [-p N]       = Minimum percent of bases that must have [-q] quality.每个reads中最少有百分之多少的碱基需要有-q的质量值
       [-z]         = Compress output with GZIP.压缩输出
       [-i INFILE]  = FASTA/Q input file. default is STDIN.
       [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
       [-v]         = Verbose - report number of sequences.详细-报告序列编号,如果使用了-o则报告会直接在STDOUT,如果没有则输入到STDERR
              If [-o] is specified,  report will be printed to STDOUT.
              If [-o] is not specified (and output goes to STDOUT),
              report will be printed to STDERR.

测试数据的数量为9×4,其中有两条是有最差的!的

$ fastq_quality_filter -i test.fastq -Q 33 -v -q 20 -p 50 -o test.fastq_filterq20p50.fastq
Quality cut-off: 20
Minimum percentage: 50
Input: 9 reads.
Output: 7 reads.
discarded 2 (22%) low-quality reads.

2. FASTQ Quality Trimmer

Trims (cuts) sequences based on quality

$ fastq_quality_trimmer -h
usage: fastq_quality_trimmer [-h] [-v] [-t N] [-l N] [-z] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.13 by A. Gordon (gordon@cshl.edu)

   [-h]         = This helpful help screen.
   [-t N]       = Quality threshold - nucleotides with lower 
                  quality will be trimmed (from the end of the sequence).
   [-l N]       = Minimum length - sequences shorter than this (after trimming)
                  will be discarded. Default = 0 = no minimum length. 
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTQ input file. default is STDIN.
   [-o OUTFILE] = FASTQ output file. default is STDOUT.
   [-v]         = Verbose - report number of sequences.
                  If [-o] is specified,  report will be printed to STDOUT.
                  If [-o] is not specified (and output goes to STDOUT),
                  report will be printed to STDERR.

通过测试,对于 -t 这个选项它的执行结果是:

fastq中的序列中的碱基的测序质量低于某一个阈值时,将序列从这个碱基到最后的哪一部分去掉。

比如ATGCTGAG 的质量分数为 33 22 11 33 21 30 25 23 你选择 30 的阈值的话,它会从右往左检索第一个小于30的碱基,然后把这个碱基之后的序列全部删去。

2. fastq_quality_converter

个人觉得这个价值不大!

$ fastq_quality_converter -h
usage: fastq_quality_converter [-h] [-a] [-n] [-z] [-i INFILE] [-f OUTFILE]
Part of FASTX Toolkit 0.0.13 by A. Gordon (gordon@cshl.edu)

   [-h]         = This helpful help screen.
   [-a]         = Output ASCII quality scores (default).
   [-n]         = Output numeric quality scores.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA output file. default is STDOUT.

2. FASTQ Masker

Masks nucleotides with ‘N’ (or other character) based on quality

$ fastq_masker -h
usage: fastq_masker [-h] [-v] [-q N] [-r C] [-z] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.13 by A. Gordon (gordon@cshl.edu)

   [-h]         = This helpful help screen.
   [-q N]       = Quality threshold - nucleotides with lower quality will be masked
                  Default is 10.
                  质量门限值,质量值低于这个门限值的将被mask掉(就是变成N),默认值为10
   [-r C]       = Replace low-quality nucleotides with character C. Default is 'N'
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTQ input file. default is STDIN.
   [-o OUTFILE] = FASTQ output file. default is STDOUT.
   [-v]         = Verbose - report number of sequences.
                  If [-o] is specified,  report will be printed to STDOUT.
                  If [-o] is not specified (and output goes to STDOUT),
                  report will be printed to STDERR.

masker的意思就是,通过设定-q参数 将低于某一质量分数的碱基用N表示。

部分中文翻译摘自:http://blog.sciencenet.cn/blog-1509670-848270.html

Go back