Featured image of post Remove all features from A that intersect/overlap with those in B

Remove all features from A that intersect/overlap with those in B

I have a bed file that containing all Isoseq reads and I would like to remove those overlapped with multiple genes in the annotation file.

The bed file looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
SM_V7_1	323629	324345	SM_V7_1#2	40	+
SM_V7_1	1215610	1253189	SM_V7_1#3	40	+
SM_V7_1	1215631	1253205	SM_V7_1#4	40	+
SM_V7_1	1220420	1253192	SM_V7_1#5	40	+
SM_V7_1	1220506	1253206	SM_V7_1#6	40	+
SM_V7_1	1220507	1245199	SM_V7_1#7	40	+
SM_V7_1	1220507	1253205	SM_V7_1#8	40	+
SM_V7_1	1220507	1253205	SM_V7_1#9	40	+
SM_V7_1	1220507	1253205	SM_V7_1#10	40	+
...
SM_V7_1	3867141	4060784	SM_V7_1#1107	40	-
SM_V7_1	3867141	4060784	SM_V7_1#1152	40	-
SM_V7_1	3867143	4060784	SM_V7_1#1309	40	-
SM_V7_1	3867151	4060784	SM_V7_1#1398	40	-
SM_V7_1	3867245	4060765	SM_V7_1#1454	40	-
...

Note that there are quite many duplicate features from different reads. I don’t want to remove the duplicate features at this step because I need the information about counts in the late merge step. For now I would like to remove those extra long reads, which normally mapped to several genes in the annotation, eg. #1107 to #1454. I have tried several tools:

gffcompare

It’s one of my favourite tools to find overlaps between two annotation files. So I did:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
awk '{print $1, "isoseq", "gene", $2, $3, $5, $6, ".", "ID=" $4}' prefiltered_mergedIsoseq.bed| tr ' ' '\t' > Filter_isoseq.gff
~/Tools/gffcompare/gffcompare -r Filter_isoseq.gff -T -o Filter_v7.3 ../../Smv7.3/Sm_v7.3_Basic-noGaps.gff
grep '#' Filter_v7.3.annotated.gtf | awk '{print $14 "\t" $12}'| tr -d \" | tr -d \; | sort -u | awk '$1!=p{if(p)print s; p=$1; s=$0; next}{sub(p,x); s=s $0} END{print s}' | awk '$3>0'|tr '\t' ','| sed 's/,/ /' | head

Which shows me the isoseq reads that overlapped with multiple genes:

~~~~~~
SM_V7_1#10831 Smp_167310,Smp_325900
SM_V7_1#11383 Smp_173250,Smp_173280
SM_V7_1#1152 Smp_067060,Smp_103610,Smp_179960,Smp_326440,Smp_333220,Smp_333870,Smp_333930,Smp_334070,Smp_334170,Smp_334240
SM_V7_1#11806 Smp_141290,Smp_141390
~~~~~~

We can see that #1152 was spotted but the others were not. Even with *-A* option I couldn't find any information about eg. #1107

### bedtools

There are two functions in bedtools: intersect and overlap

#### intersect

```bash
bedtools intersect -wo -a prefiltered_mergedIsoseq.bed -b Sm_v7.3.bed

Which nicely gave all reads mentioned above. Note that the last column is the number of base pairs of overlap between the two features.

1
2
3
4
5
SM_V7_1	3867141	4060784	SM_V7_1#1107	40	-	SM_V7_1	3867138	3869857	Smp_103610	.	-	2716
SM_V7_1	3867141	4060784	SM_V7_1#1107	40	-	SM_V7_1	3887599	3889162	Smp_333870	.	+	1563
SM_V7_1	3867141	4060784	SM_V7_1#1107	40	-	SM_V7_1	3891316	3891601	Smp_179960	.	-	285
SM_V7_1	3867141	4060784	SM_V7_1#1107	40	-	SM_V7_1	3908953	3911250	Smp_333930	.	-	2297
SM_V7_1	3867141	4060784	SM_V7_1#1107	40	-	SM_V7_1	3925420	3927716	Smp_333220	.	-	2296

overlap

1
windowBed -a prefiltered_mergedIsoseq.bed -b Sm_v7.3.bed | bedtools overlap -i stdin -cols 2,3,8,9

gave nearly the same results as above, except that it also showed the distance between nearby features on the opposite strands. For instance:

SM_V7_1 23952758 23986080 SM_V7_1#12130 40 + SM_V7_1 23986434 23992052 Smp_035480 . - -354

overlap-dist

comments powered by Disqus
CC-BY-NC 4.0
Built with Hugo Theme Stack