[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#22001: Is it possible to tab separate concatenated files?
From: |
Assaf Gordon |
Subject: |
bug#22001: Is it possible to tab separate concatenated files? |
Date: |
Mon, 23 Nov 2015 18:47:49 -0500 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 |
Hello Kim,
On 11/23/2015 06:09 PM, Bob Proulx wrote:
Macdonald, Kim - BCCDC wrote:
For Example:
Concatenate the files like so:
gi|452742846|ref|NZ_CAFD010000001.1| Salmonella enterica subsp., whole genome
shotgun sequenceTTTCAGCATATATATAGGCCATCATACATAGCCATATAT
gi|452742846|ref|NZ_CAFD010000002.1| Salmonella enterica subsp., whole genome
shotgun
sequenceCATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTACGTCGACTGACGTC
gi|452742846|ref|NZ_CAFD010000003.1| Salmonella enterica subsp., whole genome
shotgun sequenceTATATAGATACATATATCGCGATATCAGACTGCATAGCGTCAG
That example shows a completely different problem. It shows that your
input plain text files have no terminating newline, making them
officially not plain text files but binary files.
Based on the content of your files, I'm guessing that you are working with
mangled FASTA file.
In that case, it is possible that fixing the original files might be more
efficient than trying to amend them later on.
The original FASTA files likely looked like so:
>gi|452742846|ref|NZ_CAFD010000001.1| Salmonella enterica subsp., whole
genome shotgun sequence
TTTCAGCATATATATAGGCCATCATACATAGCCATATAT
And I'm also guessing that with some script you've removed the ">" prefix and
joined the two lines into one.
First,
I suggest ensuring the original files have unix-style new-lines (LF) and not
windows style (CR-LF) or Mac-style (CR).
The programs 'dos2unix' and 'mac2unix' would be able to fix it.
simply run the programs on each file, they will fix it inplace.
I would also recommend ensuring each file does end with a newline.
Second,
The FASTA id (the long text before your nucleotide sequence) contains spaces,
and this will make downstream processing a bit of a pain.
I would recommend trimming the FASTA identifier and keeping only the first part
(since it contains your IDs, you should have no problem
recovering the organism name later).
Example:
$ cat 1.fa
>gi|452742846|ref|NZ_CAFD010000001.1| Salmonella enterica subsp., whole
genome shotgun sequence
TTTCAGCATATATATAGGCCATCATACATAGCCATATAT
$ sed '/^>/s/ .*$//' 1.fa > 2.fa
$ cat 2.fa
>gi|452742846|ref|NZ_CAFD010000001.1|
TTTCAGCATATATATAGGCCATCATACATAGCCATATAT
Or do it inplace for all your FA file (be sure to have a backup, though):
for i in *.fa ; do sed -i '/^>/s/ .*$//' $i ; done
Third,
To combine and convert the files into a table (i.e. 1st column=ID, 2nd
column=sequence),
then, assuming all your sequences are short and contained on one line, the
following would work:
$ cat 2.fa
>gi|452742846|ref|NZ_CAFD010000001.1|
TTTCAGCATATATATAGGCCATCATACATAGCCATATAT
$ cat 3.fa
>gi|452742846|ref|NZ_CAFD010000002.1|
CATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTACGTCGACTGACGTC
$ cat *.fa | paste - - | sed 's/^>//' > final.txt
$ cat final.txt
gi|452742846|ref|NZ_CAFD010000001.1| TTTCAGCATATATATAGGCCATCATACATAGCCATATAT
gi|452742846|ref|NZ_CAFD010000002.1|
CATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTACGTCGACTGACGTC
the 'final.txt' will be an easy-to-work-with tabular file.
Fourth,
If you FASTA files contain multi-lined long sequences, like so:
>gi|452742846|ref|NZ_CAFD010000002.1|
CATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTAC
GTCGACTGACGTCTGTACACCACACGTTGTGACGAGCATCGACTAGCATCAG
TTGAGCGACATCATCAGCGACGAGATCACGAGCACTAGCACTACGACTACGA
You might consider using a specialized tool to convert them to a table, such as:
http://manpages.ubuntu.com/manpages/trusty/man1/fasta_formatter.1.html (*)
or http://kirill-kryukov.com/study/tools/fasta-formatter/ .
Hope this helps,
- assaf
(* shameless plug: I wrote fasta_formatter long ago)
- bug#22001: Is it possible to tab separate concatenated files?, Macdonald, Kim - BCCDC, 2015/11/23
- bug#22001: Is it possible to tab separate concatenated files?, Assaf Gordon, 2015/11/23
- bug#22001: Is it possible to tab separate concatenated files?, Assaf Gordon, 2015/11/23
- bug#22001: Is it possible to tab separate concatenated files?, Macdonald, Kim - BCCDC, 2015/11/23
- bug#22001: Is it possible to tab separate concatenated files?, Linda Walsh, 2015/11/26
- bug#22001: Is it possible to tab separate concatenated files?, Eric Blake, 2015/11/26
- bug#22001: Is it possible to tab separate concatenated files?, Erik Auerswald, 2015/11/27