study the Bedtools 抛砖引玉   

study the Bedtools 抛砖引玉

Bedtools 入门,之后就可以看官方说明文档了

06 Aug 2016

Go back studyBedtools

how to explore, process and manipulate genomic interval files (BED, GFF/GTF, VCF, and SAM/BAM) with the bedtools software package

官方自述的评价:

Collectively, the bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks.

本文根据官方的说明文档(Toturial)《Bedtools Documentation》 编写而成。

Version: bedtools2.26.0

安装

# 官网下载安装文件
$ tar -zxvf bedtools-2.25.0.tar.gz
$ cd bedtools2
$ make

# 编译好之后将可执行文件加入到环境变量中

基础知识:

Genome feature

  • genome features: 功能元素(比如常规蛋白质编码gene), 遗传多态性 (SNPs, INDELs, or structural variants),已经由测序或者其他方法得到的注释信息,也可以是自定义的一些特征信息。 >Genome features can be functional elements (e.g., genes), genetic polymorphisms (e.g. SNPs, INDELs, or structural variants), or other annotations that have been discovered or curated by genome sequencing groups or genome browser groups. In addition, genome features can be custom annotations that an individual lab or researcher defines (e.g., my novel gene or variant).

  • genome features的基本信息: 染色体或者scaffold的位置, 起始位置,终止位置,哪条链,feature的name

The basic characteristics of a genome feature are the chromosome or scaffold on which the feature “resides”, the base pair on which the feature starts (i.e. the “start”), the base pair on which feature ends (i.e. the “end”), the strand on which the feature exists (i.e. “+” or “-“), and the name of the feature if one is applicable.

  • Overlapping / intersecting features: 两个genome features的区域至少有一个bp的共同片段

BED和GFF文件的一个差异

BED文件中起始坐标为0,结束坐标至少是1,; GFF中起始坐标是1而结束坐标至少是1。

格式说明:

    1. BED format

BEDTools主要使用BED格式的前三列,BED可以最多有12列。BED格式的常用列描述如下:

* chrom: 染色体信息, 如chr1, III, myCHrom, contig1112.23, 必须注明。
* start: genome feature的起始位点,从0开始, 必须有
* end: genome feature的终止位点,至少为1, 必须有
* name: genome feature的官方名称或者自定义的一个名字
* score: 可以是p值等等一些可以刻量化的数值信息
* strands: 正反链信息
    1. SAM/BAM 还有GFF 暂时先不说了

示例文件以及本教程来源

说明:本教程按照官方说明文档1和自带的tutial2进行学习

本教程的测试文件前四个(cpg.bed、exons.bed、gwas.bed和hesc.chromHmm.bed)来自于bedtools.html网页说明文件中的四个文件

从UCSC上得到注释文件

从UCSC上得到注释文件

但是无法找到全部一样的。与教程上的有一些出入!好,下面可以来看一下所拥有的BED文件:

ls
## exon.bed
## gene.bed
## getTestData.PNG
## gwas.bed
## snp.bed
## studyBedtools_cache
## studyBedtools.html
## studyBedtools.Rmd

我们暂时就用这四个BED文件来做测试,利用Rmarkdown来进行学习。

Subprogram: 类似于samtools我们来看看Bedtools来看看它具体有哪些功能:

~/Bedtools/bedtools2/bin/bedtools
## bedtools: flexible tools for genome arithmetic and DNA sequence analysis.
## usage:    bedtools <subcommand> [options]
## 
## The bedtools sub-commands include:
## 
## [ Genome arithmetic ]
##     intersect     Find overlapping intervals in various ways.
##     window        Find overlapping intervals within a window around an interval.
##     closest       Find the closest, potentially non-overlapping interval.
##     coverage      Compute the coverage over defined intervals.
##     map           Apply a function to a column for each overlapping interval.
##     genomecov     Compute the coverage over an entire genome.
##     merge         Combine overlapping/nearby intervals into a single interval.
##     cluster       Cluster (but don't merge) overlapping/nearby intervals.
##     complement    Extract intervals _not_ represented by an interval file.
##     shift         Adjust the position of intervals.
##     subtract      Remove intervals based on overlaps b/w two files.
##     slop          Adjust the size of intervals.
##     flank         Create new intervals from the flanks of existing intervals.
##     sort          Order the intervals in a file.
##     random        Generate random intervals in a genome.
##     shuffle       Randomly redistrubute intervals in a genome.
##     sample        Sample random records from file using reservoir sampling.
##     spacing       Report the gap lengths between intervals in a file.
##     annotate      Annotate coverage of features from multiple files.
## 
## [ Multi-way file comparisons ]
##     multiinter    Identifies common intervals among multiple interval files.
##     unionbedg     Combines coverage intervals from multiple BEDGRAPH files.
## 
## [ Paired-end manipulation ]
##     pairtobed     Find pairs that overlap intervals in various ways.
##     pairtopair    Find pairs that overlap other pairs in various ways.
## 
## [ Format conversion ]
##     bamtobed      Convert BAM alignments to BED (& other) formats.
##     bedtobam      Convert intervals to BAM records.
##     bamtofastq    Convert BAM records to FASTQ records.
##     bedpetobam    Convert BEDPE intervals to BAM records.
##     bed12tobed6   Breaks BED12 intervals into discrete BED6 intervals.
## 
## [ Fasta manipulation ]
##     getfasta      Use intervals to extract sequences from a FASTA file.
##     maskfasta     Use intervals to mask sequences from a FASTA file.
##     nuc           Profile the nucleotide content of intervals in a FASTA file.
## 
## [ BAM focused tools ]
##     multicov      Counts coverage from multiple BAMs at specific intervals.
##     tag           Tag BAM alignments based on overlaps with interval files.
## 
## [ Statistical relationships ]
##     jaccard       Calculate the Jaccard statistic b/w two sets of intervals.
##     reldist       Calculate the distribution of relative distances b/w two files.
##     fisher        Calculate Fisher statistic b/w two feature files.
## 
## [ Miscellaneous tools ]
##     overlap       Computes the amount of overlap from two intervals.
##     igv           Create an IGV snapshot batch script.
##     links         Create a HTML page of links to UCSC locations.
##     makewindows   Make interval "windows" across a genome.
##     groupby       Group by common cols. & summarize oth. cols. (~ SQL "groupBy")
##     expand        Replicate lines based on lists of values in columns.
##     split         Split a file into multiple files with equal records or base pairs.
## 
## [ General help ]
##     --help        Print this help menu.
##     --version     What version of bedtools are you using?.
##     --contact     Feature requests, bugs, mailing lists, etc.

1. bedtools “intersect”

The intersect command is the workhorse of the bedtools suite. It compares two or more BED/BAM/VCF/GFF files and identifies all the regions in the gemome where the features in the two files overlap (that is, share at least one base pair in common).

1.1 Default behavior:取享有公共区段的交集

~/Bedtools/bedtools2/bin/bedtools intersect -a snp.bed -b exon.bed|/usr/bin/head -5
## chrX 17036   17037   rs372225483 0   +
## chrX 17036   17037   rs372225483 0   +
## chrX 17053   17054   rs771009492 0   +
## chrX 17053   17054   rs771009492 0   +
## chrX 27705   27706   rs753268402 0   +

但是:这里显示的已经不是原来a b两个文件的交集(original interval)了

1.2 Reporting the original feature in each file.

The -wa (write A) and -wb (write B) options allow one to see the original records from the A and B files that overlapped. As such, instead of not only showing you where the intersections occurred, it shows you what intersected.

加上 -wa 和 -wb 参数可以显示原来的情况

~/Bedtools/bedtools2/bin/bedtools intersect -a snp.bed -b exon.bed -wa -wb|/usr/bin/head -5
## chrX 17036   17037   rs372225483 0   +   chrX    16847   17285   AI080143    0   +   16847   17285   0   1   438,    0,
## chrX 17036   17037   rs372225483 0   +   chrX    16847   17386   BX091087    0   -   16847   17386   0   1   539,    0,
## chrX 17053   17054   rs771009492 0   +   chrX    16847   17285   AI080143    0   +   16847   17285   0   1   438,    0,
## chrX 17053   17054   rs771009492 0   +   chrX    16847   17386   BX091087    0   -   16847   17386   0   1   539,    0,
## chrX 27705   27706   rs753268402 0   +   chrX    27334   30828   CV309219    0   +   27334   30828   0   3   171,130,21, 0,3342,3473,

1.3 How many base pairs of overlap were there?

加上 -wo 参数

The -wo (write overlap) option allows one to also report the number of base pairs of overlap between the features that overlap between each of the files.

~/Bedtools/bedtools2/bin/bedtools intersect -a snp.bed -b exon.bed -wo|/usr/bin/head -5
## chrX 17036   17037   rs372225483 0   +   chrX    16847   17285   AI080143    0   +   16847   17285   0   1   438,    0,  1
## chrX 17036   17037   rs372225483 0   +   chrX    16847   17386   BX091087    0   -   16847   17386   0   1   539,    0,  1
## chrX 17053   17054   rs771009492 0   +   chrX    16847   17285   AI080143    0   +   16847   17285   0   1   438,    0,  1
## chrX 17053   17054   rs771009492 0   +   chrX    16847   17386   BX091087    0   -   16847   17386   0   1   539,    0,  1
## chrX 27705   27706   rs753268402 0   +   chrX    27334   30828   CV309219    0   +   27334   30828   0   3   171,130,21, 0,3342,3473,    1

以上由在linux中的Rstudio server利用远程网页登录编辑而成,R studio中的内容可以直接进行阅读。

这样就能非常方便地理解原来的tutial中的内容了。


  1. 见bedtools.pdf 说明文件!

  2. 见bedtools.html网页说明文件!

Go back