首页技术总结正文内容

为什么用python提取html不全,python

技术总结

更新时间：2024-12-22 21:49:29 4

admin 管理员组

文章数量: 887017

为什么用python提取html不全,python

我正在尝试从此网页中提取数据，但由于该页面的HTML格式不一致，因此遇到了一些麻烦。我有一个OGAP ID列表，我想提取每个迭代的OGAP ID的基因名称和任何文献信息(PMID＃)。感谢这里的其他问题和BeautifulSoup文档，我能够始终如一地获得每个ID的基因名称，但是我在文献部分遇到了麻烦。以下是几个搜索字词，突显了这些不一致之处。

有效的HTML示例

搜索词：OG00131

Literature describing O-GlcNAcylation:
PMID: 20068230 [CAD, ETD MS/MS];

无法使用的HTML示例

搜索词：OG00020

Literature describing O-GlcNAcylation: PMID: 16408927 [Azide-tag, nano-HPLC/tandem MS]
Site has not yet been determined. Use OGlcNAcScan to predict the O-GlcNAc site.

这是我到目前为止的代码

import urllib2

from bs4 import BeautifulSoup

#define list of genes

#initialize variables

gene_list = []

literature = []

# Test list

gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"]

for i in range(len(gene_listID)):

print gene_listID[i]

# Specifies URL, uses the "%" to sub in different ogapIDs based on a list provided

dbOGAP = ".cgi?textfield=%s&select=Any" % gene_listID[i]

# Opens the URL as a page

page = urllib2.urlopen(dbOGAP)

# Reads the page and parses it through "lxml" format

soup = BeautifulSoup(page, "lxml")

gene_name = soup.find("td", text="Gene Name").find_next_sibling("td").text

print gene_name[1:]

gene_list.append(gene_name[1:])

# PubMed IDs are located near the

tag with the term "Data and Source"

pmid = soup.find("span", text="Data and Source")

# Based on inspection of the website, need to move up to the parent

tag

pmid_p = pmid.parent

# Then we move to the next

tag, denoted as sibling (since they share parent (Table row) tag)

pmid_s = pmid_p.next_sibling

#for child in pmid_s.descendants:

# print child

# Now we search down the tree to find the next table data (

) tag

pmid_c = pmid_s.find("td")

temp_lit = []

# Next we print the text of the data

#print pmid_c.text

if "No literature is available" in pmid_c.text:

temp_lit.append("No literature is available")

print "Not available"

else:

# and then print out a list of urls for each pubmed ID we have

print "The following is available"

for link in pmid_c.find_all('a'):

# the tag includes more than just the link address.

# for each tag found, print the address (href attribute) and extra bits

# link.string provides the string that appears to be hyperlinked.

# In this case, it is the pubmedID

print link.string

temp_lit.append("PMID: " + link.string + " URL: " + link.get('href'))

literature.append(temp_lit)

print "\n"

因此，似乎元素是将代码抛出循环的原因。有没有一种方法可以搜索带有文本“ PMID”的任何元素并返回其后的文本(如果有PMID号，则返回url)？如果不是，我是否只想检查每个孩子，寻找我感兴趣的文字？

我正在使用Python 2.7.10

本文标签：为什么用python提取html不全 python

版权声明：本文标题：为什么用python提取html不全,python 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.freenas.com.cn/jishu/1730781037h1381783.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。