admin 管理员组文章数量: 887017
为什么用python提取html不全,python
我正在尝试从此网页中提取数据,但由于该页面的HTML格式不一致,因此遇到了一些麻烦。 我有一个OGAP ID列表,我想提取每个迭代的OGAP ID的基因名称和任何文献信息(PMID#)。 感谢这里的其他问题和BeautifulSoup文档,我能够始终如一地获得每个ID的基因名称,但是我在文献部分遇到了麻烦。 以下是几个搜索字词,突显了这些不一致之处。
有效的HTML示例
搜索词:OG00131
Literature describing O-GlcNAcylation:PMID: 20068230 [CAD, ETD MS/MS];
无法使用的HTML示例
搜索词:OG00020
Literature describing O-GlcNAcylation: PMID: 16408927 [Azide-tag, nano-HPLC/tandem MS]Site has not yet been determined. Use OGlcNAcScan to predict the O-GlcNAc site.
这是我到目前为止的代码
import urllib2
from bs4 import BeautifulSoup
#define list of genes
#initialize variables
gene_list = []
literature = []
# Test list
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"]
for i in range(len(gene_listID)):
print gene_listID[i]
# Specifies URL, uses the "%" to sub in different ogapIDs based on a list provided
dbOGAP = ".cgi?textfield=%s&select=Any" % gene_listID[i]
# Opens the URL as a page
page = urllib2.urlopen(dbOGAP)
# Reads the page and parses it through "lxml" format
soup = BeautifulSoup(page, "lxml")
gene_name = soup.find("td", text="Gene Name").find_next_sibling("td").text
print gene_name[1:]
gene_list.append(gene_name[1:])
# PubMed IDs are located near the
tag with the term "Data and Source"pmid = soup.find("span", text="Data and Source")
# Based on inspection of the website, need to move up to the parent
tagpmid_p = pmid.parent
# Then we move to the next
tag, denoted as sibling (since they share parent (Table row) tag)pmid_s = pmid_p.next_sibling
#for child in pmid_s.descendants:
# print child
# Now we search down the tree to find the next table data (
) tagpmid_c = pmid_s.find("td")
temp_lit = []
# Next we print the text of the data
#print pmid_c.text
if "No literature is available" in pmid_c.text:
temp_lit.append("No literature is available")
print "Not available"
else:
# and then print out a list of urls for each pubmed ID we have
print "The following is available"
for link in pmid_c.find_all('a'):
# the tag includes more than just the link address.
# for each tag found, print the address (href attribute) and extra bits
# link.string provides the string that appears to be hyperlinked.
# In this case, it is the pubmedID
print link.string
temp_lit.append("PMID: " + link.string + " URL: " + link.get('href'))
literature.append(temp_lit)
print "\n"
因此,似乎元素是将代码抛出循环的原因。 有没有一种方法可以搜索带有文本“ PMID”的任何元素并返回其后的文本(如果有PMID号,则返回url)? 如果不是,我是否只想检查每个孩子,寻找我感兴趣的文字?
我正在使用Python 2.7.10
本文标签: 为什么用python提取html不全 python
版权声明:本文标题:为什么用python提取html不全,python 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.freenas.com.cn/jishu/1730781037h1381783.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论