seqkit 使用说明

seqkit 是 Wei Shen 使用 go 语言编写处理 fa 和 fq 文件的一把利器,当前介绍版本为0.10.1。这里不详细介绍各个函数的参数,官方给出的文档已经足够。

软件地址:https://github.com/shenwei356/seqkit

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Available Commands:
common find common sequences of multiple files by id/name/sequence
concat concatenate sequences with same ID from multiple files
convert convert FASTQ quality encoding between Sanger, Solexa and Illumina
duplicate duplicate sequences N times
faidx create FASTA index file and extract subsequence
fq2fa convert FASTQ to FASTA
fx2tab convert FASTA/Q to tabular format (with length/GC content/GC skew)
genautocomplete generate shell autocompletion script
grep search sequences by ID/name/sequence/sequence motifs, mismatch allowed
head print first N FASTA/Q records
help Help about any command
locate locate subsequences/motifs, mismatch allowed
range print FASTA/Q records in a range (start:end)
rename rename duplicated IDs
replace replace name/sequence by regular expression
restart reset start position for circular genome
rmdup remove duplicated sequences by id/name/sequence
sample sample sequences by number or proportion
seq transform sequences (revserse, complement, extract ID...)
shuffle shuffle sequences
sliding sliding sequences, circular genome supported
sort sort sequences by id/name/sequence/length
split split sequences into files by id/seq region/size/parts (mainly for FASTA)
split2 split sequences into files by size/parts (FASTA, PE/SE FASTQ)
stats simple statistics of FASTA/Q files
subseq get subsequences by region/gtf/bed, including flanking sequences
tab2fx convert tabular format to FASTA/Q format
translate translate DNA/RNA to protein sequence
version print version information and check for update

seq

1
2
3
4
5
6
7
8
9
10
$ seqkit seq hairpin.fa.gz  #展示fa文件
>cel-let-7 MI0000001 Caenorhabditis elegans let-7 stem-loop
UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAUUACCACCGGUGAAC
UAUGCAAUUUUCUACCUUACCGGAGACAGAACUCUUCGA

$ seqkit seq read_1.fq.gz #展示fq文件
@HWI-D00523:240:HF3WGBCXX:1:1101:2574:2226 1:N:0:CTGTAG
TGAGGAATATTGGTCAATGGGCGCGAGCCTGAACCAGCCAAGTAGCGTGAAGGATGACTGCCCTACGGG
+
HIHIIIIIHIIHGHHIHHIIIIIIIIIIIIIIIHHIIIIIHHIHIIIIIGIHIIIIHHHHHHGHIHIII
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
$ seqkit seq hairpin.fa.gz -n  #展示序列全名
cel-let-7 MI0000001 Caenorhabditis elegans let-7 stem-loop
cel-lin-4 MI0000002 Caenorhabditis elegans lin-4 stem-loop
cel-mir-1 MI0000003 Caenorhabditis elegans miR-1 stem-loop

$ seqkit seq hairpin.fa.gz -n -i #展示序列ID
cel-let-7
cel-lin-4
cel-mir-1

$ seqkit seq hairpin.fa.gz -n -i --id-regexp "^[^\s]+\s([^\s]+)\s" #使用正则匹配序列名
MI0000001
MI0000002
MI0000003

$ seqkit seq hairpin.fa.gz -s -w 0 #只展示序列 并设置每行碱基数为默认
UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAUUACCACCGGUGAACUAUGCAAUUUUCUACCUUACCG
GAGACAGAACUCUUCGA
AUGCUUCCGGCCUGUUCCCUGAGACCUCAAGUGUGAGUGUACUAUUGAUGCUUCACACCUGGGCUCUCCGGGUACCAGGACG
GUUUGAGCAGAU
AAAGUGACCGUACCGAGCUGCAUACUUCCUUACAUGCCCAUACUAUAUCAUAAAUGGAUAUGGAAUGUAAAGAAGUAUGUAG
AACGGGGUGGUAGU
1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ seqkit seq hairpin.fa.gz -m 50 -M 150   #过滤fq文件,使序列长度在50-150bp之间。

$ seqkit seq hairpin.fa.gz -r -p #反转录序列
>cel-let-7 MI0000001 Caenorhabditis elegans let-7 stem-loop
UCGAAGAGUUCUGUCUCCGGUAAGGUAGAAAAUUGCAUAGUUCACCGGUGGUAAUAUUCC
AAACUAUACAACCUACUACCUCACCGGAUCCACAGUGUA

$ echo -e ">seq\nACGT-actgc-ACC" | seqkit seq -g -u #去除序列gap 并大写碱基
>seq
ACGTACTGCACC

$ echo -e ">seq\nUCAUAUGCUUGUCUCAAAGAUUA" | seqkit seq --rna2dna #DNA转RNA,--dna2rna亦可
>seq
TCATATGCTTGTCTCAAAGATTA

subseq

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
$ zcat hairpin.fa.gz | seqkit subseq -r 1:12  #展示序列前12个碱基
$ zcat hairpin.fa.gz | seqkit subseq -r -12:-1 #后12个

$ cat t.fa
>seq
actgACTGactgn
$ cat t.gtf #注意gtf文件格式,必须以\t分割。
seq test CDS 5 8 . . . gene_id "A"; transcript_id "";
seq test CDS 5 8 . - . gene_id "B"; transcript_id "";
$ seqkit subseq --gtf t.gtf t.fa #使用gtf位置信息,挑选fa序列
>seq_5:8:. A
ACTG
>seq_5:8:- B
CAGT

$ seqkit subseq --gtf Homo_sapiens.GRCh38.84.gtf.gz --chr 1 --feature cds hsa.fa
> chr1.gtf.cds.fa #指定染色体和特征

$ seqkit subseq --gtf t.gtf t.fa -u 3 #另加3bp上游序列
>seq_5:8:._us:3 A
ctgACTG
>seq_5:8:-_us:3 B
agtCAGT

$ seqkit subseq --gtf t.gtf t.fa -u 3 -f #只取上游3bp序列
>seq_5:8:._usf:3 A
ctg
>seq_5:8:-_usf:3 B
agt

gff3 文件第九列格式为ID=XXXXX; gtf 文件第九列格式为 gene_id “A”; transcript_id “”

sample

1
2
seqkit sample -p 0.1 -o sample.fq.gz #取序列文件的百分之十
seqkit sample -n 1000 -o sample.fq.gz #取1000条序列

shuffle

1
2
3
4
5
seqkit shuffle hairpin.fq.gz > shuffled.fq #打乱序列顺序
[INFO] read sequences ...
[INFO] 28645 sequences loaded
[INFO] shuffle ...
[INFO] output ...

stats

1
2
3
4
5
6
7
8
9
10
11
12
13
$ seqkit stats *.f{a,q}.gz  #统计序列信息
file format type num_seqs sum_len min_len avg_len max_len
hairpin.fa.gz FASTA RNA 28,645 2,949,871 39 103 2,354
mature.fa.gz FASTA RNA 35,828 781,222 15 21.8 34
reads_1.fq.gz FASTQ DNA 2,500 567,516 226 227 229
reads_2.fq.gz FASTQ DNA 2,500 560,002 223 224 225
$ seqkit stats *.f{a,q}.gz -a #列出所有统计结果
file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 Q20(%) Q30(%)
hairpin.fa.gz FASTA RNA 28,645 2,949,871 39 103 2,354 76 91 111 0 101 0 0
mature.fa.gz FASTA RNA 35,828 781,222 15 21.8 34 21 22 22 0 22 0 0
Illimina1.8.fq.gz FASTQ DNA 10,000 1,500,000 150 150 150 150 150 150 0 150 96.16 89.71
reads_1.fq.gz FASTQ DNA 2,500 567,516 226 227 229 227 227 227 0 227 91.24 86.62
reads_2.fq.gz FASTQ DNA 2,500 560,002 223 224 225 224 224 224 0 224 91.06 87.66

faidx

1
2
3
4
5
6
7
$ seqkit faidx hairpin.fa  #建立序列索引
>hsa-let-7a-1
UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCACUGGGAGAUAACUAU
ACAAUCUACUGUCUUUCCUA
>hsa-let-7a-2
AGGUUGAGGUAGUAGGUUGUAUAGUUUAGAAUUACAUCAAGGGAGAUAACUGUACAGCCU
CCUAGCUUUCCU

fq2fa

1
$ seqkit fq2fa reads_1.fq.gz -o reads1_.fa.gz  #fq转fa

convert

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
$ seqkit head -n 1 tests/Illimina1.8.fq.gz
...
#AAAFAAJFFFJJJ<JJJJJFFFJFJJJJJFJJAJJJFJJFJFJJJJFAFJ<JA<FFJ7FJJFJJAAJJJJ<JJJJJJJFJJJAJJJ
JJFJJ77<JJJJ-F7A-FJFFJJJJJJ<FFJ-<7FJJJFJJ)A7)7AA<7--)<-7F-A7FA<

$ seqkit convert tests/Illimina1.8.fq.gz | seqkit head -n 1 #默认转换fq文件质量值到1.8+
[INFO] possible quality encodings: [Illumina-1.8+]
[INFO] guessed quality encoding: Illumina-1.8+
[INFO] converting Illumina-1.8+ -> Sanger
[WARN] source and target quality encoding match.
...
#AAAFAAJFFFJJJ<JJJJJFFFJFJJJJJFJJAJJJFJJFJFJJJJFAFJ<JA<FFJ7FJJFJJAAJJJJ<JJJJJJJFJJJAJJJ
JJFJJ77<JJJJ-F7A-FJFFJJJJJJ<FFJ-<7FJJJFJJ)A7)7AA<7--)<-7F-A7FA<

$ seqkit convert tests/Illimina1.8.fq.gz --to Illumina-1.5+ | seqkit head -n 1
[INFO] possible quality encodings: [Illumina-1.8+]
[INFO] guessed quality encoding: Illumina-1.8+
[INFO] converting Illumina-1.8+ -> Illumina-1.5+ #转换 Illumina1.8+ -> Illumina1.5+
...
B```e``ieeeiii[iiiiieeeieiiiiieii`iiieiieieiiiie`ei[i`[eeiVeiieii``iiii[iiiiiiieiii`iii
iieiiVV[iiiiLeV`Leieeiiiiii[eeiL[VeiiieiiH`VHV``[VLLH[LVeL`Ve`[

grep

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ zcat hairpin.fa.gz | seqkit grep -r -p ^hsa  #正则匹配序列名
>hsa-let-7a-1 MI0000060 Homo sapiens let-7a-1 stem-loop
UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCACUGGGAGAUAACUAU
ACAAUCUACUGUCUUUCCUA
>hsa-let-7a-2 MI0000061 Homo sapiens let-7a-2 stem-loop
AGGUUGAGGUAGUAGGUUGUAUAGUUUAGAAUUACAUCAAGGGAGAUAACUGUACAGCCU
CCUAGCUUUCCU

$ zcat hairpin.fa.gz | seqkit grep -r -p ^hsa -p ^mmu -v #2个条件并取反

$ zcat hairpin.fa.gz | seqkit grep -f list > new.fa #将需要提取的序列名放在list中

$ zcat hairpin.fa.gz | seqkit grep -s -r -i -p ^aggcg #正则匹配序列碱基,-i 忽略大小写

$ seqkit grep -s -R 1:30 -i -r -p GCTGG #-R 在前30个碱基中正则匹配

rmdup

1
2
3
4
5
6
7
8
9
10
$ zcat hairpin.fa.gz | seqkit rmdup -s -o clean.fa.gz #去除重复的序列
[INFO] 2226 duplicated records removed

$ zcat hairpin.fa.gz | seqkit rmdup -s -i -m -o clean.fa.gz -d duplicated.fa.gz -D
duplicated.detail.txt #-d输出重复序列,-D统计重复序列
$ cat duplicated.detail.txt # here is not the entire list
3 hsa-mir-424, mml-mir-424, ppy-mir-424
3 hsa-mir-342, mml-mir-342, ppy-mir-342
2 ngi-mir-932, nlo-mir-932
2 ssc-mir-9784-1, ssc-mir-9784-2

common

1
2
3
$ seqkit common file*.fa -o common.fasta  #通过ID寻找共同序列
$ seqkit common file*.fa -n -o common.fasta #通过全名
$ seqkit common file*.fa -s -i -o common.fasta #通过序列

split

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
$ seqkit split hairpin.fa.gz -s 10000  #按序列数分割文件
[INFO] split into 10000 seqs per file
[INFO] write 10000 sequences to file: hairpin.fa.part_001.gz
[INFO] write 10000 sequences to file: hairpin.fa.part_002.gz
[INFO] write 8645 sequences to file: hairpin.fa.part_003.gz

$ seqkit split hairpin.fa.gz -p 4 #按文件个数分割文件
[INFO] split into 4 parts
[INFO] read sequences ...
[INFO] read 28645 sequences
[INFO] write 7162 sequences to file: hairpin.fa.part_001.gz
[INFO] write 7162 sequences to file: hairpin.fa.part_002.gz
[INFO] write 7162 sequences to file: hairpin.fa.part_003.gz
[INFO] write 7159 sequences to file: hairpin.fa.part_004.gz

$ seqkit split hairpin.fa.gz -p 4 -2 #-2减少内存使用
[INFO] split into 4 parts
[INFO] read and write sequences to tempory file: hairpin.fa.gz.fa ...
[INFO] create and read FASTA index ...
[INFO] read sequence IDs from FASTA index ...
[INFO] 28645 sequences loaded
[INFO] write 7162 sequences to file: hairpin.part_001.fa.gz
[INFO] write 7162 sequences to file: hairpin.part_002.fa.gz
[INFO] write 7162 sequences to file: hairpin.part_003.fa.gz
[INFO] write 7159 sequences to file: hairpin.part_004.fa.gz

$ seqkit split hairpin.fa.gz -i --id-regexp "^([\w]+)\-" -2 #按ID
[INFO] split by ID. idRegexp: ^([\w]+)\-
[INFO] read and write sequences to tempory file: hairpin.fa.gz.fa ...
[INFO] create and read FASTA index ...
[INFO] create FASTA index for hairpin.fa.gz.fa
[INFO] read sequence IDs from FASTA index ...
[INFO] 28645 sequences loaded
[INFO] write 48 sequences to file: hairpin.id_cca.fa.gz
[INFO] write 3 sequences to file: hairpin.id_hci.fa.gz

$ seqkit split hairpin.fa.gz -r 1:3 -2 #按前3个碱基
[INFO] split by region: 1:3
[INFO] read and write sequences to tempory file: hairpin.fa.gz.fa ...
[INFO] read sequence IDs and sequence region from FASTA file ...
[INFO] create and read FASTA index ...
[INFO] write 463 sequences to file: hairpin.region_1:3_AUG.fa.gz
[INFO] write 349 sequences to file: hairpin.region_1:3_ACU.fa.gz
[INFO] write 311 sequences to file: hairpin.region_1:3_CGG.fa.gz

range

1
$ cat hairpin.fa | seqkit range -r 101:150  #输出范围内的序列(1:12 如同 head -n 12)

sort

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ echo -e ">seq1\nACGTNcccc\n>SEQ2\nacgtnAAAA" | seqkit sort --quiet  #按ID排序,--quiet不输出提示信息
>SEQ2
acgtnAAAA
>seq1
ACGTNcccc

$ echo -e ">seq1\nACGTNcccc\n>SEQ2\nacgtnAAAA" | seqkit sort --quiet -i #不区分大小写
>seq1
ACGTNcccc
>SEQ2
acgtnAAAA

$ echo -e ">seq1\nACGTNcccc\n>SEQ2\nacgtnAAAA" | seqkit sort --quiet -i -r #不区分大小写,反转结果
>SEQ2
acgtnAAAA
>seq1
ACGTNcccc
$ echo -e ">seq1\nACGTNcccc\n>SEQ2\nacgtnAAAAnnn\n>seq3\nacgt" | seqkit sort --quiet -l #按序列长度
>seq3
acgt
>seq1
ACGTNcccc
>SEQ2
acgtnAAAAnnn

translate

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
$ seqkit translate tests/mouse-p53-cds.fna  #将DNA/RNA 翻译为蛋白序列
>lcl|AB021961.1_cds_BAA82344.1_1 [gene=p53] [protein=P53] [protein_id=BAA82344.1] [location=101..1273] [gbkey=CDS]
MTAMEESQSDISLELPLSQETFSGLWKLLPPEDILPSPHCMDDLLLPQDVEEFFEGPSEA
LRVSGAPAAQDPVTETPGPVAPAPATPWPLSSFVPSQKTYQGNYGFHLGFLQSGTAKSVM
CTYSPPLNKLFCQLAKTCPVQLWVSATPPAGSRVRAMAIYKKSQHMTEVVRRCPHHERCS
DGDGLAPPQHRIRVEGNLYPEYLEDRQTFRHSVVVPYEPPEAGSEYTTIHYKYMCNSSCM
GGMNRRPILTIITLEDSSGNLLGRDSFEVRVCACPGRDRRTEEENFRKKEVLCPELPPGS
AKRALPTCTSASPPQKKKPLDGEYFTLKIRGRKRFEMFRELNEALELKDAHATEESGDSR
AHSSYLKTKKGQSTSRHKKTMVKKVGPDSD*

$ seqkit translate tests/mouse-p53-cds.fna --trim #去掉 */X(终止密码子)
>lcl|AB021961.1_cds_BAA82344.1_1 [gene=p53] [protein=P53] [protein_id=BAA82344.1] [location=101..1273] [gbkey=CDS]
MTAMEESQSDISLELPLSQETFSGLWKLLPPEDILPSPHCMDDLLLPQDVEEFFEGPSEA
LRVSGAPAAQDPVTETPGPVAPAPATPWPLSSFVPSQKTYQGNYGFHLGFLQSGTAKSVM
CTYSPPLNKLFCQLAKTCPVQLWVSATPPAGSRVRAMAIYKKSQHMTEVVRRCPHHERCS
DGDGLAPPQHRIRVEGNLYPEYLEDRQTFRHSVVVPYEPPEAGSEYTTIHYKYMCNSSCM
GGMNRRPILTIITLEDSSGNLLGRDSFEVRVCACPGRDRRTEEENFRKKEVLCPELPPGS
AKRALPTCTSASPPQKKKPLDGEYFTLKIRGRKRFEMFRELNEALELKDAHATEESGDSR
AHSSYLKTKKGQSTSRHKKTMVKKVGPDSD