biopython NCBI

访问NCBI Entrez数据库

Entrez (http://www.ncbi.nlm.nih.gov/Entrez) 是一个给客户提供 NCBI 各个数据库(如PubMed, GeneBank, GEO等等)访问的检索系统。 用户可以通过浏览器手动输入查询条目访问Entrez,也可以使用 Biopython 的 Bio.Entrez 模块以编程方式访问来访问 Entrez。 如果使用第二种方法,用户用一个 Python 脚本就可以实现在PubMed 里面搜索或者从 GenBank 下载数据。

1
> pip install Bio  #安装模块

ESearch: 搜索Entrez数据库

esearch 会根据参数得出所需文献、序列等的 ID 号。

检索文献:

1
2
3
4
5
6
>>> from Bio import Entrez
>>> Entrez.email = "1009133184@qq.com" #Always tell NCBI who you are
>>> handle = Entrez.esearch(db="pubmed", term="biopython")
>>> record = Entrez.read(handle)
>>> record["IdList"]
['19304878', '18606172', '16403221', '16377612', '14871861', '14630660', '12230038']

检索序列:

1
2
3
4
5
6
7
>>> from Bio import Entrez
>>> handle = Entrez.esearch(db="nucleotide",term="Cypripedioideae[Orgn] AND matK[Gene]")
>>> record = Entrez.read(handle)
>>> record["Count"]
'25'
>>> record["IdList"]
['126789333', '37222967', '37222966', '37222965', ..., '61585492']

常用db参数:pubmed nucleotide protein gene snp unigene ,默认为pubmed

term参数:在检索文献时,term 就是关键词。在检索序列时,ncbi 自己有一套规则,如Cypripedioideae[Orgn] AND matK[Gene]的意思是 拖鞋兰物种中的 matK 基因序列

biomol_mrna[properties] AND Osteichthyes[organism] 只要 mRNA 序列在硬骨鱼纲中

其他关键词还有:OR NOT

EFetch: 从Entrez下载数据

1
2
3
4
5
6
7
8
9
10
11
12
>>> from Bio import Entrez
>>> Entrez.email = "1009133184@qq.com" # Always tell NCBI who you are
>>> handle = Entrez.efetch(db="nucleotide", id="186972394", rettype="fasta", retmode="text")
>>> print (handle.read())
>EU490707.1 Selenipedium aequinoctiale maturase K (matK) gene, partial cds; chloroplast
ATTTTTTACGAACCTGTGGAAATTTTTGGTTATGACAATAAATCTAGTTTAGTACTTGTGAAACGTTTAA
TTACTCGAATGTATCAACAGAATTTTTTGATTTCTTCGGTTAATGATTCTAACCAAAAAGGATTTTGGGG
GCACAAGCATTTTTTTTCTTCTCATTTTTCTTCTCAAATGGTATCAGAAGGTTTTGGAGTCATTCTGGAA
ATTCCATTCTCGTCGCAATTAGTATCTTCTCTTGAAGAAAAAAAAATACCAAAATATCAGAATTTACGAT
CTATTCATTCAATATTTCCCTTTTTAGAAGACAAATTTTTACATTTGAATTATGTGTCAGATCTACTAAT
ACCCCATCCCATCCATCTGGAAATCTTGGTTCAAATCCTTCAATGCCGGATCAAGGATGTTCCTTCTTTG
………………

id 参数 :ncbi 为每一条序列标识的ID号,可以使用 Entrez.esearch 获得。

rettype 参数:常用的有 fasta gb

retmode 参数:数据的组织形式有 text xml

1
2
3
4
5
6
7
8
9
10
11
>>> from Bio import Entrez, SeqIO
>>> handle = Entrez.efetch(db="nucleotide", id="186972394",rettype="gb", retmode="text")
>>> record = SeqIO.read(handle, "genbank") #使用SeqIO 读入genbank序列
>>> handle.close()
>>> print (record)
ID: EU490707.1
Name: EU490707
Description: Selenipedium aequinoctiale maturase K (matK) gene, partial cds; chloroplast.
Number of features: 3
...
Seq('ATTTTTTACGAACCTGTGGAAATTTTTGGTTATGACAATAAATCTAGTTTAGTA...GAA', IUPACAmbiguousDNA())

EGQuery: 全局搜索- 统计搜索的条目

EGQuery提供搜索字段在每个Entrez数据库中的数目。当我们只需要知道在每个数据库中能找到的条目的个数, 而不需要知道具体搜索结果的时候,这个非常的有用。

1
2
3
4
5
6
7
8
9
10
>>> from Bio import Entrez
>>> Entrez.email = "1009133184@qq.com" # Always tell NCBI who you are
>>> handle = Entrez.egquery(term="biopython")
>>> record = Entrez.read(handle)
>>> for row in record["eGQueryResult"]: print row["DbName"], row["Count"]
...
pubmed 6
pmc 62
journals 0
...

搜索,下载,和解析Entrez核酸记录

获取 Cypripedioideae 在 gene 库的条目数:

1
2
3
4
5
6
7
8
>>> from Bio import Entrez
>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who you are
>>> handle = Entrez.egquery(term="Cypripedioideae")
>>> record = Entrez.read(handle)
>>> for row in record["eGQueryResult"]:
... if row["DbName"]=="gene":
... print row["Count"]
376

获得这376条数据的ID:

1
2
3
>>> from Bio import Entrez
>>> handle = Entrez.esearch(db="nucleotide", term="Cypripedioideae", retmax=376)
>>> record = Entrez.read(handle)

使用 efetch 来下载这些结果的前5条:

1
2
3
4
5
6
7
>>> idlist = ",".join(record["IdList"][:5])
>>> print (idlist)
187237168,187372713,187372690,187372688,187372686
>>> handle = Entrez.efetch(db="nucleotide", id=idlist, retmode="xml")
>>> records = Entrez.read(handle) #解析xml文件
>>> print (len(records))
5