sequence

Manipulation on FASTA/Q format file

Recommend my toolkit SeqKit, a cross-platform and efficient toolkit for FASTA/Q file manipulation, which integrades most of the functions provided by these scripts.

FASTA

fasta2tab and tab2fasta

fasta2tab and tab2fasta are used in pair. fasta2tab transforms the FASTA fromat to two-column table, fist column is the header and the second is sequence. Its could also compute the reverse complement sequence and remove gaps. Sequence length and GC content could be outputted as another column, which could be used for filtering and sorting. tab2fasta just tranform the table back to FASTA format. Combining with shell tool like awk and sed, it’s easy to filter, sort FASTA files.

Examples

1. sort fasta by sequnece length

cat seq.fa | fasta2tab -t -l | sort -r -t"`echo -e '\t'`" -n -k3,3 \
| tab2fasta -l 70 > seq.sorted.fa

2. extract sub sequence

fasta2tab -t -sub 3,10 -rc seq.fa | tab2fasta

3. extract sequence longer than 1000 bp

cat seq.fa | fasta2tab -t -l | awk -F'\t' '$3 >= 1000' | tab2fasta -l 70

4. extract aligned sequence of which the original sequence is longer than 1000 bp

cat seq.fa | fasta2tab -l2 | awk -F'\t' '$3 >= 1000' | tab2fasta -l 70

5. reverse complement sequence, uppercase, and trim gaps

zcat seq.fa.gz | fasta2tab -uc -rc -t | tab2fasta

fasta_extract_by_pattern.pl

fasta_extract_by_pattern.pl could extract FASTA sequences by header or sequence, exactly matching or regular expression matching are both supported. The query pattern could read from files. And negation of the result is also easy to get. What's the most important, it could read from STDIN.

Combining fasta2tab and tab2fasta with cvs_grep could also have the same function.

Examples

1. sequences WITH "bacteria" in header

fasta_extract_by_pattern.pl -r -p Bacteria *.fa > result.fa

2. sequences WITHOUT “bacteria” in header

fasta_extract_by_pattern.pl -r -n -p Bacteria seq1.fa seq2.fa > result.fa

3. sequences with TTSAA (AgsI digest site) in SEQUENCE. Base S stands for C or G.

fasta_extract_by_pattern.pl -r -s -p 'TT[C|G]AA' seq.fa > result.fa

4. sequences (read from STDIN ) with header that matches any patterns in list file

zcat seq.fa.gz | fasta_extract_by_pattern.pl -pf name_list.txt > result.fa

fasta_common_seqs.pl

fasta_common_seqs.pl is used to find common sequences in multiple files. It supports comparing by header or sequence. By storing the MD5 value of sequences, it has a low memory usage. It’s also could be used to remove duplicated records, by finding common sequencing from the file and its copy or soft link.

fasta_remove_duplicates.pl

fasta_remove_duplicates.pl could remove duplicated records from file or STDIN, by both sequence and header.

fasta_locate_motif.pl

fasta_locate_motif.pl could find restrict enzyme recognition site or other motif location.

fasta_gc_skew.py and fasta_gc_skew.plot.R

Sample out:

FASTQ

fastq2tab and tab2fastq

fastq2tab and tab2fastq are similar to fasta2tab and tab2fasta. It could use to filter fastq with help of cvs_grep.

Example: removing contaminate reads

zcat reads.fq.gz                                \
   | fastq2tab                                  \
   | csv_grep -t -pf <(cat idlist) -i -d        \
   | tab2fastq                                  \
   | gzip -c                                    \
   > reads2.fq.gz

Name		Name	Last commit message	Last commit date
parent directory ..
sample		sample
README.md		README.md
fasta2tab		fasta2tab
fasta_common_seqs.pl		fasta_common_seqs.pl
fasta_extract_by_pattern.pl		fasta_extract_by_pattern.pl
fasta_extract_randomly.pl		fasta_extract_randomly.pl
fasta_gc_skew.plot.R		fasta_gc_skew.plot.R
fasta_gc_skew.py		fasta_gc_skew.py
fasta_locate_motif.pl		fasta_locate_motif.pl
fasta_remove_duplicates.pl		fasta_remove_duplicates.pl
fasta_rename_duplicated_names.pl		fasta_rename_duplicated_names.pl
fasta_reset_start_position_for_circular_genome.pl		fasta_reset_start_position_for_circular_genome.pl
fasta_sliding_window.pl		fasta_sliding_window.pl
fasta_trim_aligned_fasta.pl		fasta_trim_aligned_fasta.pl
fastq2tab		fastq2tab
fastq_extract_paired_reads.pl		fastq_extract_paired_reads.pl
fastx_mapping_with_bwa.pl		fastx_mapping_with_bwa.pl
fastx_pwm.py		fastx_pwm.py
fastx_tm.py		fastx_tm.py
fastx_translate.py		fastx_translate.py
run_clustalo.pl		run_clustalo.pl
seqcomp		seqcomp
seqrc		seqrc
seqrev		seqrev
tab2fasta		tab2fasta
tab2fastq		tab2fastq

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sequence

sequence

README.md

Manipulation on FASTA/Q format file

FASTA

fasta2tab and tab2fasta

Examples

1. sort fasta by sequnece length

2. extract sub sequence

3. extract sequence longer than 1000 bp

4. extract aligned sequence of which the original sequence is longer than 1000 bp

5. reverse complement sequence, uppercase, and trim gaps

fasta_extract_by_pattern.pl

Examples

1. sequences WITH "bacteria" in header

2. sequences WITHOUT “bacteria” in header

3. sequences with TTSAA (AgsI digest site) in SEQUENCE. Base S stands for C or G.

4. sequences (read from STDIN ) with header that matches any patterns in list file

fasta_common_seqs.pl

fasta_remove_duplicates.pl

fasta_locate_motif.pl

fasta_gc_skew.py and fasta_gc_skew.plot.R

FASTQ

fastq2tab and tab2fastq

Files

sequence

Directory actions

More options

Directory actions

More options

Latest commit

History

sequence

Folders and files

parent directory

README.md

Manipulation on FASTA/Q format file

FASTA

fasta2tab and tab2fasta

Examples

1. sort fasta by sequnece length

2. extract sub sequence

3. extract sequence longer than 1000 bp

4. extract aligned sequence of which the original sequence is longer than 1000 bp

5. reverse complement sequence, uppercase, and trim gaps

fasta_extract_by_pattern.pl

Examples

1. sequences WITH "bacteria" in header

2. sequences WITHOUT “bacteria” in header

3. sequences with TTSAA (AgsI digest site) in SEQUENCE. Base S stands for C or G.

4. sequences (read from STDIN ) with header that matches any patterns in list file

fasta_common_seqs.pl

fasta_remove_duplicates.pl

fasta_locate_motif.pl

fasta_gc_skew.py and fasta_gc_skew.plot.R

FASTQ

fastq2tab and tab2fastq