admin 管理员组

文章数量: 887017

为什么用python提取html不全,python

我正在尝试从此网页中提取数据,但由于该页面的HTML格式不一致,因此遇到了一些麻烦。 我有一个OGAP ID列表,我想提取每个迭代的OGAP ID的基因名称和任何文献信息(PMID#)。 感谢这里的其他问题和BeautifulSoup文档,我能够始终如一地获得每个ID的基因名称,但是我在文献部分遇到了麻烦。 以下是几个搜索字词,突显了这些不一致之处。

有效的HTML示例

搜索词:OG00131

Literature describing O-GlcNAcylation:
  PMID: 20068230 [CAD, ETD MS/MS];

无法使用的HTML示例

搜索词:OG00020

Literature describing O-GlcNAcylation: PMID: 16408927 [Azide-tag, nano-HPLC/tandem MS]
Site has not yet been determined. Use OGlcNAcScan to predict the O-GlcNAc site.

这是我到目前为止的代码

import urllib2

from bs4 import BeautifulSoup

#define list of genes

#initialize variables

gene_list = []

literature = []

# Test list

gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"]

for i in range(len(gene_listID)):

print gene_listID[i]

# Specifies URL, uses the "%" to sub in different ogapIDs based on a list provided

dbOGAP = ".cgi?textfield=%s&select=Any" % gene_listID[i]

# Opens the URL as a page

page = urllib2.urlopen(dbOGAP)

# Reads the page and parses it through "lxml" format

soup = BeautifulSoup(page, "lxml")

gene_name = soup.find("td", text="Gene Name").find_next_sibling("td").text

print gene_name[1:]

gene_list.append(gene_name[1:])

# PubMed IDs are located near the

tag with the term "Data and Source"

pmid = soup.find("span", text="Data and Source")

# Based on inspection of the website, need to move up to the parent

tag

pmid_p = pmid.parent

# Then we move to the next

tag, denoted as sibling (since they share parent (Table row) tag)

pmid_s = pmid_p.next_sibling

#for child in pmid_s.descendants:

# print child

# Now we search down the tree to find the next table data (

) tag

pmid_c = pmid_s.find("td")

temp_lit = []

# Next we print the text of the data

#print pmid_c.text

if "No literature is available" in pmid_c.text:

temp_lit.append("No literature is available")

print "Not available"

else:

# and then print out a list of urls for each pubmed ID we have

print "The following is available"

for link in pmid_c.find_all('a'):

# the tag includes more than just the link address.

# for each tag found, print the address (href attribute) and extra bits

# link.string provides the string that appears to be hyperlinked.

# In this case, it is the pubmedID

print link.string

temp_lit.append("PMID: " + link.string + " URL: " + link.get('href'))

literature.append(temp_lit)

print "\n"

因此,似乎元素是将代码抛出循环的原因。 有没有一种方法可以搜索带有文本“ PMID”的任何元素并返回其后的文本(如果有PMID号,则返回url)? 如果不是,我是否只想检查每个孩子,寻找我感兴趣的文字?

我正在使用Python 2.7.10

本文标签: 为什么用python提取html不全 python