GNU datamash を使って転置

バイオインフォで扱うデータってcolumnが多い場合が多々あります。

例えば、10X Genomicsの公開データ(bam)をsamtoolsで見てみると。

kimoton@DESKTOP-BL78EM7:~$ samtools view http://s3-us-west-2.amazonaws.com/10x.files/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_possorted_genome_bam.bam | head -1
[knet_seek] SEEK_END is not supported for HTTP. Offset is unchanged.
[bam_header_read] EOF marker is absent. The input is probably truncated.
ST-K00126:314:HFYL2BBXX:7:2103:14996:4725       272     1       10001   1       3S95M   *       0       0       CCGTAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC      <)-7-)7<JF7<FA7---<<A-FFFF-A<<AA----FFAJJJAAAJJAFJJJF<JJJA---FFA-AFFFFJ<FJF-FFFA7FFFJ7JFFAJ7FAFF<A      NH:i:3  HI:i:3  AS:i:91 nM:i:1  RE:A:I  BC:Z:TCGTCACG   QT:Z:AAFFFJJJ   CR:Z:TTAACTCGTAGAAGGA   CY:Z:AAFFFJJJJJJJJJJJ   CB:Z:TTAACTCGTAGAAGGA-1 UR:Z:GTCCGGCGAC UY:Z:JJJJJJJJJJ UB:Z:GTCCGGCGAC RG:Z:pbmc8k:MissingLibrary:1:HFYL2BBXX:7

横長でとても見にくい。 データの内容がわかりづらい。。

こんな時はdatamash transposeに渡してやりましょう。

kimoton@DESKTOP-BL78EM7:~$ samtools view http://s3-us-west-2.amazonaws.com/10x.files/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_possorted_genome_bam.bam | head -1 | datamash transpose
[knet_seek] SEEK_END is not supported for HTTP. Offset is unchanged.
[bam_header_read] EOF marker is absent. The input is probably truncated.
ST-K00126:314:HFYL2BBXX:7:2103:14996:4725
272
1
10001
1
3S95M
*
0
0
CCGTAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC
<)-7-)7<JF7<FA7---<<A-FFFF-A<<AA----FFAJJJAAAJJAFJJJF<JJJA---FFA-AFFFFJ<FJF-FFFA7FFFJ7JFFAJ7FAFF<A
NH:i:3
HI:i:3
AS:i:91
nM:i:1
RE:A:I
BC:Z:TCGTCACG
QT:Z:AAFFFJJJ
CR:Z:TTAACTCGTAGAAGGA
CY:Z:AAFFFJJJJJJJJJJJ
CB:Z:TTAACTCGTAGAAGGA-1
UR:Z:GTCCGGCGAC
UY:Z:JJJJJJJJJJ
UB:Z:GTCCGGCGAC
RG:Z:pbmc8k:MissingLibrary:1:HFYL2BBXX:7

転置してるだけですけど、とっても見やすくなりました。

Rのt()と同じですね。同じですけど、linuxコマンドとしてパイプで繋げられるのはとっても便利。

逆順にもできます。

kimoton@DESKTOP-BL78EM7:~$ samtools view http://s3-us-west-2.amazonaws.com/10x.files/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_possorted_genome_bam.bam | head -1 | datamash reverse | datamash transpose
[knet_seek] SEEK_END is not supported for HTTP. Offset is unchanged.
[bam_header_read] EOF marker is absent. The input is probably truncated.
RG:Z:pbmc8k:MissingLibrary:1:HFYL2BBXX:7
UB:Z:GTCCGGCGAC
UY:Z:JJJJJJJJJJ
UR:Z:GTCCGGCGAC
CB:Z:TTAACTCGTAGAAGGA-1
CY:Z:AAFFFJJJJJJJJJJJ
CR:Z:TTAACTCGTAGAAGGA
QT:Z:AAFFFJJJ
BC:Z:TCGTCACG
RE:A:I
nM:i:1
AS:i:91
HI:i:3
NH:i:3
<)-7-)7<JF7<FA7---<<A-FFFF-A<<AA----FFAJJJAAAJJAFJJJF<JJJA---FFA-AFFFFJ<FJF-FFFA7FFFJ7JFFAJ7FAFF<A
CCGTAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC
0
0
*
3S95M
1
10001
1
272
ST-K00126:314:HFYL2BBXX:7:2103:14996:4725

インストール法

Ubuntu

sudo apt-get install datamash

RHEL

wget http://files.housegordon.org/datamash/bin/datamash-1.0.6-1.el6.x86_64.rpm
sudo rpm -i datamash-1.0.6-1.el6.x86_64.rpm

MacOS

brew install datamash

参考

datamash - GNU Project - Free Software Foundation