Pairwise sequence alignment using associative array

Using bash associative array to make pairwise alignment of sequences

When making annotation transfers using RATT, you will get a report about the general transfer stats, e.g., number of gene models being correctly transferred, and those not transferred.

But you might want to know about more details. One simple way is to align the sequences before and after transfer using pairwise sequence alignmnet. The Associative Array in Bash can simply perform this job with just the following files:

  • a list containing id pairs (e.g., id-pairs.txt)
  • single aa/nt sequence files for each gene

And the bash script could look like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#!/usr/local/bin/bash
declare -A idpairs # associative array

while IFS=\| read id1 id2  # columns are seperated by pipe |; such as "Smp_070360|g1258.t1"
do
    idpairs[$id1]=$id2
done < id-pairs.txt

for id1 in ${!idpairs[*]}
do  # make directories with single fa files, and another for the blast output
    blastp -query query-fa/${idpairs[$id1]}.fa -subject subj-fa/$id2.fa -out blast-out/${idpairs[$id1]}-$id1.txt
done

Once you have finished the blast job, you need to parse the results and get top hits. I used the Perl script ncbiblastparser.pl, which works quite well.

comments powered by Disqus
CC-BY-NC 4.0
Built with Hugo Theme Stack