admin 管理员组文章数量: 887016
python REfO模块使用入门
python REfO模块使用入门
在国内基本找不到关于REfO的使用的博客,转载一篇外网的使用实例
链接: Regular Expressions for Objects.
For work I recently needed to do something that is very similar to regexes, but with a twist: it should operate on lists of objects, not only on strings. Luckily, Python came to the rescue with REfO, a library for doing just this.
My usecase was selecting phrases from Part-of-Speech (POS) annotated text. The text was lemmatized and tagged using SpaCy and it resulted in lists of the following form:
s = [['i', 'PRON'], ['look', 'VERB'], ['around', 'ADP'], ['me', 'PRON'], ['and', 'CCONJ'], ['see', 'VERB'], ['that', 'ADP'], ['everyone', 'NOUN'], ['be', 'VERB'], ['run', 'VERB'], ['around', 'ADV'], ['in', 'ADP'], ['a', 'DET'], ['hurry', 'NOUN']]
From these sentences we want to extract human action phrases and noun phrases, which are defined as follows, using regex-like notation:
human_action = ("he"|"she"|"i"|"they"|"we") ([VERB] [ADP])+
noun_phrase = [DET]? ([ADJ] [NOUN])+
Translated to English this means that human actions are defined as 1st and 3rd person, singular and plural pronouns followed by repeated groups of verbs and adpositions (in, to, during). Noun phrases are composed of an optional determiner (a, an, the) followed by repeated groups of adjectives and nouns.
Most standard regex libraries won’t help you with this, because they work only on strings. But this problem is still perfectly well described by regular grammars, so after a bit of Googling I found REfO and it’s super simple to use, albeit you have to read the source code, because it doesn’t really have documentation.
REfO is a bit more verbose than normal regular expressions, but at least it tries to stay close to usual regex notions. Lazy repetition (*) is done using the refo.Star operator, while greedy one (+) is refo.Plus . The only new operator is refo.Predicate, which takes a function which takes a parameter and matches if that function returns true when called with the element at that position. Using this we will build the functions we need:
def pos(pos): return refo.Predicate(lambda x: x[1] == pos) def humanpron(): return refo.Predicate(lambda x: x[1] == 'PRON' and x[0] in {'i', 'he', 'she', 'we', 'they'})
For matching POS, we use a helper to create a function that will match the given tag. For matching human pronouns, we also check the words, not just the POS tag.
np = refo.Question(pos('DET')) + refo.Plus(refo.Question(pos('ADJ')) + pos('NOUN'))
humanAction = humanpron() + refo.Plus(pos('VERB') + pos('ADP'))
Then we just compose our functions and concatenate them and we got what we wanted. Using them is simple. You either call refo.search, which finds the first match or refo.finditer which returns an iterable over all matches.
for match in refo.finditer(humanAction, s): start = match.start() end = match.end() print(s[start:end])
[[u'i', u'PRON'], [u'look', u'VERB'], [u'around', u'ADP']]
So, it’s always good to Google around for a solution, because my first instict to whip up a parser in Parsec would have lead to a much more complicated solution. This is nice, elegant, short and efficient.
本文标签: python REfO模块使用入门
版权声明:本文标题:python REfO模块使用入门 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.freenas.com.cn/jishu/1732360624h1535110.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论