在python中使用正则表达式查找可嵌套字符串组-CDA数据分析师官网

热线电话：13121318867

在python中使用正则表达式查找可嵌套字符串组

2017-11-23

在python中使用正则表达式查找可嵌套字符串组

在网上看到一个小需求，需要用正则表达式来处理。原需求如下：

找出文本中包含”因为……所以”的句子，并以两个词为中心对齐输出前后3个字，中间全输出，如果“因为”和“所以”中间还存在“因为”“所以”，也要找出来，另算一行，输出格式为：
行号前面3个字 *因为* 全部 &所以& 后面3个字(标点符号算一个字)
2 还不是 *因为* 这里好， &所以& 没有人
实现方法如下：
#encoding:utf-8
import os
import re
def getPairStriList(filename):
pairStrList = []
textFile = open(filename, 'r')
pattern = re.compile(u'.{3}\u56e0\u4e3a.*\u6240\u4ee5.{3}') #u'\u56e0\u4e3a和u'\u6240\u4ee5'分别为“因为”和“所以”的utf8码
for line in textFile:
    utfLine = line.decode('utf8')
    result = pattern.search(utfLine)
    while result:
      resultStr = result.group()
      pairStrList.append(resultStr)
      result = pattern.search(resultStr,2,len(resultStr)-2)
#对每个字符串进行格式转换和拼接
for i in range(len(pairStrList)):
    pairStrList[i] = pairStrList[i][:3] + pairStrList[i][3:5].replace(u'\u56e0\u4e3a',u' *\u56e0\u4e3a* ',1) + pairStrList[i][5:]
    pairStrList[i] = pairStrList[i][:len(pairStrList[i])-5] + pairStrList[i][len(pairStrList[i])-5:].replace(u'\u6240\u4ee5',u' &\u6240\u4ee5& ',1)
    pairStrList[i] = str(i+1) + ' ' + pairStrList[i]
return pairStrList
if __name__ == '__main__':
pairStrList = getPairStriList('test.txt')
for str in pairStrList:
    print str

PS：下面看下python里使用正则表达式的组嵌套

由于组本身是一个完整的正则表达式，所以可以将组嵌套在其他组中，以构建更复杂的表达式。下面的例子，就是进行组嵌套的例子：
#python 3.6
#蔡军生
#http://blog.csdn.net/caimouse/article/details/51749579
#
import re
def test_patterns(text, patterns):
"""Given source text and a list of patterns, look for
matches for each pattern within the text and print
them to stdout.
"""
# Look for each pattern in the text and print the results
for pattern, desc in patterns:
    print('{!r} ({})\n'.format(pattern, desc))
    print(' {!r}'.format(text))
    for match in re.finditer(pattern, text):
      s = match.start()
      e = match.end()
      prefix = ' ' * (s)
      print(
        ' {}{!r}{} '.format(prefix,
                   text[s:e],
                   ' ' * (len(text) - e)),
        end=' ',
      )
      print(match.groups())
      if match.groupdict():
        print('{}{}'.format(
          ' ' * (len(text) - s),
          match.groupdict()),
        )
    print()
return

例子：
#python 3.6
#蔡军生
#http://blog.csdn.net/caimouse/article/details/51749579
#
from re_test_patterns_groups import test_patterns
test_patterns(
'abbaabbba',
[(r'a((a*)(b*))', 'a followed by 0-n a and 0-n b')],
)

结果输出如下：
'a((a*)(b*))' (a followed by 0-n a and 0-n b)
'abbaabbba'
'abb'    ('bb', '', 'bb')
   'aabbb' ('abbb', 'a', 'bbb')
     'a' ('', '', '')
总结
以上所述是小编给大家介绍的在python中使用正则表达式查找可嵌套字符串组，希望对大家有所帮助