Sort Fasta By Sequence Size

April 22, 2024 Post a Comment

I currently want to sort a hudge fasta file (+10**8 lines and sequences) by sequence size. fasta is a clear defined format in biology use to store sequence (genetic or proteic): &g

Solution 1:

Seems like opening two files for each sequence is probably contibuting to a lot to the run time. You could pass file handles to your get/write functions rather than file names, but I would suggest using an established fasta parser/indexer like biopython or samtools. Here's an (untested) solution with samtools:

subprocess.call(["samtools", "faidx", args.i])
withopen(args.fai) asref:

    for line inref:

        spt = line.split()
        id_ = spt[0]
        subprocess.call(["samtools", "faidx", args.i, id_, ">>", args.out], shell=True)

Solution 2:

What about bash and some basic unix commands (csplit is the clue)? I wrote this simple script, but you can customize/improve it. It's not highly optimized and doesn't use index file, but nevertheless may run faster.

csplit -z -f tmp_fasta_file_ $1'/>/''{*}'for file in tmp_fasta_file_*
do
  TMP_FASTA_WC=$(wc -l < $file | tr -d ' ')
  FASTA_WC+=$(echo"$file$TMP_FASTA_WC\n")
donefor filename in $(echo -e $FASTA_WC | sort -k2 -r -n | awk -F" "'{print $1}')
docat"$filename" >> $2donerm tmp_fasta_file*

First positional argument is a filepath to your fasta file, second one is a filepath for output, i.e. ./script.sh input.fasta output.fasta

Solution 3:

Using a modified version of fastq-sort (currently available at https://github.com/blaiseli/fastq-tools), we can convert the file to fastq format using bioawk, sort with the -L option I added, and convert back to fasta:

cat test.fasta \
    | tee >(wc -l > nb_lines_fasta.txt) \
    | bioawk -c fastx '{l = length($seq); printf "@"$name"\n"$seq"\n+\n%.*s\n", l, "IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII"}' \
    | tee >(wc -l > nb_lines_fastq.txt) \
    | fastq-sort -L \
    | tee >(wc -l > nb_lines_fastq_sorted.txt) \
    | bioawk -c fastx '{print ">"$name"\n"$seq}' \
    | tee >(wc -l > nb_lines_fasta_sorted.txt) \
    > test_sorted.fasta

The fasta -> fastq conversion step is quite ugly. We need to generate dummy fastq qualities with the same length as the sequence. I found no better way to do it with (bio)awk than this hack based on the "dynamic width" thing mentioned at the end of https://www.gnu.org/software/gawk/manual/html_node/Format-Modifiers.html#Format-Modifiers.

The IIIII... string should be longer than the longest of the input sequences, otherwise, invalid fastq will be obtained, and when converting back to fasta, bioawk seems to silently skip such invalid reads.

In the above example, I added steps to count the lines. If the line numbers are not coherent, it may be because the IIIII... string was too short.

The resulting fasta file will have the shorter sequences first. To get the longest sequences at the top of the file, add the -r option to fastq-sort.

Baca Juga

Note that fastq-sort writes intermediate files in /tmp. If for some reason it is interrupted before erasing them, you may want to clean your /tmp manually and not wait for the next reboot.

Edit

I actually found a better way to generate dummy qualities of the same length as the sequence: simply using the sequence itself:

cat test.fasta \
    | bioawk -c fastx '{print "@"$name"\n"$seq"\n+\n"$seq}' \
    | fastq-sort -L \
    | bioawk -c fastx '{print ">"$name"\n"$seq}' \
    > test_sorted.fasta

This solution is cleaner (and slightly faster), but I keep my original version above because the "dynamic width" feature of printf and the usage of tee to check intermediate data length may be interesting to know about.

Solution 4:

You can also do it very conveniently with awk, check the code below:

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}'  input.fasta  |\
awk -F '\t' '{printf("%d\t%s\n",length($2),$0);}' |\
sort -k1,1n | cut -f 2-| tr "\t""\n"

This and other methods have been posted in Biostars (e.g. using BBMap's sortbyname.sh script), and I strongly recommend this community for questions such like this one.

Learn Python Tutorials