Bedtools 入门,之后就可以看官方说明文档了
06 Aug 2016
Go backhow to explore, process and manipulate genomic interval files (BED, GFF/GTF, VCF, and SAM/BAM) with the bedtools software package
官方自述的评价:
Collectively, the bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks.
本文根据官方的说明文档(Toturial)《Bedtools Documentation》 编写而成。
# 官网下载安装文件
$ tar -zxvf bedtools-2.25.0.tar.gz
$ cd bedtools2
$ make
# 编译好之后将可执行文件加入到环境变量中
genome features: 功能元素(比如常规蛋白质编码gene), 遗传多态性 (SNPs, INDELs, or structural variants),已经由测序或者其他方法得到的注释信息,也可以是自定义的一些特征信息。 >Genome features can be functional elements (e.g., genes), genetic polymorphisms (e.g. SNPs, INDELs, or structural variants), or other annotations that have been discovered or curated by genome sequencing groups or genome browser groups. In addition, genome features can be custom annotations that an individual lab or researcher defines (e.g., my novel gene or variant).
genome features的基本信息: 染色体或者scaffold的位置, 起始位置,终止位置,哪条链,feature的name
The basic characteristics of a genome feature are the chromosome or scaffold on which the feature “resides”, the base pair on which the feature starts (i.e. the “start”), the base pair on which feature ends (i.e. the “end”), the strand on which the feature exists (i.e. “+” or “-“), and the name of the feature if one is applicable.
BED文件中起始坐标为0,结束坐标至少是1,; GFF中起始坐标是1而结束坐标至少是1。
BEDTools主要使用BED格式的前三列,BED可以最多有12列。BED格式的常用列描述如下:
* chrom: 染色体信息, 如chr1, III, myCHrom, contig1112.23, 必须注明。
* start: genome feature的起始位点,从0开始, 必须有
* end: genome feature的终止位点,至少为1, 必须有
* name: genome feature的官方名称或者自定义的一个名字
* score: 可以是p值等等一些可以刻量化的数值信息
* strands: 正反链信息
说明:本教程按照官方说明文档1和自带的tutial2进行学习
本教程的测试文件前四个(cpg.bed、exons.bed、gwas.bed和hesc.chromHmm.bed)来自于bedtools.html网页说明文件中的四个文件
从UCSC上得到注释文件
但是无法找到全部一样的。与教程上的有一些出入!好,下面可以来看一下所拥有的BED文件:
ls
## exon.bed
## gene.bed
## getTestData.PNG
## gwas.bed
## snp.bed
## studyBedtools_cache
## studyBedtools.html
## studyBedtools.Rmd
我们暂时就用这四个BED文件来做测试,利用Rmarkdown来进行学习。
~/Bedtools/bedtools2/bin/bedtools
## bedtools: flexible tools for genome arithmetic and DNA sequence analysis.
## usage: bedtools <subcommand> [options]
##
## The bedtools sub-commands include:
##
## [ Genome arithmetic ]
## intersect Find overlapping intervals in various ways.
## window Find overlapping intervals within a window around an interval.
## closest Find the closest, potentially non-overlapping interval.
## coverage Compute the coverage over defined intervals.
## map Apply a function to a column for each overlapping interval.
## genomecov Compute the coverage over an entire genome.
## merge Combine overlapping/nearby intervals into a single interval.
## cluster Cluster (but don't merge) overlapping/nearby intervals.
## complement Extract intervals _not_ represented by an interval file.
## shift Adjust the position of intervals.
## subtract Remove intervals based on overlaps b/w two files.
## slop Adjust the size of intervals.
## flank Create new intervals from the flanks of existing intervals.
## sort Order the intervals in a file.
## random Generate random intervals in a genome.
## shuffle Randomly redistrubute intervals in a genome.
## sample Sample random records from file using reservoir sampling.
## spacing Report the gap lengths between intervals in a file.
## annotate Annotate coverage of features from multiple files.
##
## [ Multi-way file comparisons ]
## multiinter Identifies common intervals among multiple interval files.
## unionbedg Combines coverage intervals from multiple BEDGRAPH files.
##
## [ Paired-end manipulation ]
## pairtobed Find pairs that overlap intervals in various ways.
## pairtopair Find pairs that overlap other pairs in various ways.
##
## [ Format conversion ]
## bamtobed Convert BAM alignments to BED (& other) formats.
## bedtobam Convert intervals to BAM records.
## bamtofastq Convert BAM records to FASTQ records.
## bedpetobam Convert BEDPE intervals to BAM records.
## bed12tobed6 Breaks BED12 intervals into discrete BED6 intervals.
##
## [ Fasta manipulation ]
## getfasta Use intervals to extract sequences from a FASTA file.
## maskfasta Use intervals to mask sequences from a FASTA file.
## nuc Profile the nucleotide content of intervals in a FASTA file.
##
## [ BAM focused tools ]
## multicov Counts coverage from multiple BAMs at specific intervals.
## tag Tag BAM alignments based on overlaps with interval files.
##
## [ Statistical relationships ]
## jaccard Calculate the Jaccard statistic b/w two sets of intervals.
## reldist Calculate the distribution of relative distances b/w two files.
## fisher Calculate Fisher statistic b/w two feature files.
##
## [ Miscellaneous tools ]
## overlap Computes the amount of overlap from two intervals.
## igv Create an IGV snapshot batch script.
## links Create a HTML page of links to UCSC locations.
## makewindows Make interval "windows" across a genome.
## groupby Group by common cols. & summarize oth. cols. (~ SQL "groupBy")
## expand Replicate lines based on lists of values in columns.
## split Split a file into multiple files with equal records or base pairs.
##
## [ General help ]
## --help Print this help menu.
## --version What version of bedtools are you using?.
## --contact Feature requests, bugs, mailing lists, etc.
The intersect command is the workhorse of the bedtools suite. It compares two or more BED/BAM/VCF/GFF files and identifies all the regions in the gemome where the features in the two files overlap (that is, share at least one base pair in common).
~/Bedtools/bedtools2/bin/bedtools intersect -a snp.bed -b exon.bed|/usr/bin/head -5
## chrX 17036 17037 rs372225483 0 +
## chrX 17036 17037 rs372225483 0 +
## chrX 17053 17054 rs771009492 0 +
## chrX 17053 17054 rs771009492 0 +
## chrX 27705 27706 rs753268402 0 +
但是:这里显示的已经不是原来a b两个文件的交集(original interval)了
The -wa (write A) and -wb (write B) options allow one to see the original records from the A and B files that overlapped. As such, instead of not only showing you where the intersections occurred, it shows you what intersected.
加上 -wa 和 -wb 参数可以显示原来的情况
~/Bedtools/bedtools2/bin/bedtools intersect -a snp.bed -b exon.bed -wa -wb|/usr/bin/head -5
## chrX 17036 17037 rs372225483 0 + chrX 16847 17285 AI080143 0 + 16847 17285 0 1 438, 0,
## chrX 17036 17037 rs372225483 0 + chrX 16847 17386 BX091087 0 - 16847 17386 0 1 539, 0,
## chrX 17053 17054 rs771009492 0 + chrX 16847 17285 AI080143 0 + 16847 17285 0 1 438, 0,
## chrX 17053 17054 rs771009492 0 + chrX 16847 17386 BX091087 0 - 16847 17386 0 1 539, 0,
## chrX 27705 27706 rs753268402 0 + chrX 27334 30828 CV309219 0 + 27334 30828 0 3 171,130,21, 0,3342,3473,
加上 -wo 参数
The -wo (write overlap) option allows one to also report the number of base pairs of overlap between the features that overlap between each of the files.
~/Bedtools/bedtools2/bin/bedtools intersect -a snp.bed -b exon.bed -wo|/usr/bin/head -5
## chrX 17036 17037 rs372225483 0 + chrX 16847 17285 AI080143 0 + 16847 17285 0 1 438, 0, 1
## chrX 17036 17037 rs372225483 0 + chrX 16847 17386 BX091087 0 - 16847 17386 0 1 539, 0, 1
## chrX 17053 17054 rs771009492 0 + chrX 16847 17285 AI080143 0 + 16847 17285 0 1 438, 0, 1
## chrX 17053 17054 rs771009492 0 + chrX 16847 17386 BX091087 0 - 16847 17386 0 1 539, 0, 1
## chrX 27705 27706 rs753268402 0 + chrX 27334 30828 CV309219 0 + 27334 30828 0 3 171,130,21, 0,3342,3473, 1
以上由在linux中的Rstudio server利用远程网页登录编辑而成,R studio中的内容可以直接进行阅读。