2014-04-30 12 views
18

mojej stronie jest coś takiego -Korzystanie BeautifulSoup wyodrębnić tekst bez Tagi

<p> 
    <strong class="offender">YOB:</strong> 1987<br /> 
    <strong class="offender">RACE:</strong> WHITE<br /> 
    <strong class="offender">GENDER:</strong> FEMALE<br /> 
    <strong class="offender">HEIGHT:</strong> 5'05''<br /> 
    <strong class="offender">WEIGHT:</strong> 118<br /> 
    <strong class="offender">EYE COLOR:</strong> GREEN<br /> 
    <strong class="offender">HAIR COLOR:</strong> BROWN<br /> 
</p> 

chcę wyodrębnić informacji dla każdej jednostki i dostać YOB: 1987, rasa: biała etc ....

co starałem się -

subc = soup.findAll('p') 
subc1 = subc[1] 
subc2 = subc1.findAll('strong') 

Ale to daje mi tylko wartości YOB :, RACE :, itp

Czy istnieje sposób, aby uzyskać dane w YOB: 1987, RACE: WHITE?

Dzięki Manish

Odpowiedz

33

Sprawiedliwego pętli wszystkich znaczniki <strong> i użyj next_sibling, aby uzyskać to, co chcesz. Tak:

for strong_tag in soup.find_all('strong'): 
    print strong_tag.text, strong_tag.next_sibling 

Demo:

>>> from bs4 import BeautifulSoup 
>>> html = ''' 
... <p> 
...  <strong class="offender">YOB:</strong> 1987<br /> 
...  <strong class="offender">RACE:</strong> WHITE<br /> 
...  <strong class="offender">GENDER:</strong> FEMALE<br /> 
...  <strong class="offender">HEIGHT:</strong> 5'05''<br /> 
...  <strong class="offender">WEIGHT:</strong> 118<br /> 
...  <strong class="offender">EYE COLOR:</strong> GREEN<br /> 
...  <strong class="offender">HAIR COLOR:</strong> BROWN<br /> 
... </p> 
... ''' 
>>> soup = BeautifulSoup(html) 
>>> for strong_tag in soup.find_all('strong'): 
...  print strong_tag.text, strong_tag.next_sibling 

To daje:

YOB: 1987 
RACE: WHITE 
GENDER: FEMALE 
HEIGHT: 5'05'' 
WEIGHT: 118 
EYE COLOR: GREEN 
HAIR COLOR: BROWN 
14

Myślę, że można dostać go za pomocą subc1.text.

>>> html = """ 
<p> 
    <strong class="offender">YOB:</strong> 1987<br /> 
    <strong class="offender">RACE:</strong> WHITE<br /> 
    <strong class="offender">GENDER:</strong> FEMALE<br /> 
    <strong class="offender">HEIGHT:</strong> 5'05''<br /> 
    <strong class="offender">WEIGHT:</strong> 118<br /> 
    <strong class="offender">EYE COLOR:</strong> GREEN<br /> 
    <strong class="offender">HAIR COLOR:</strong> BROWN<br /> 
</p> 
""" 
>>> from bs4 import BeautifulSoup 
>>> soup = BeautifulSoup(html) 
>>> print soup.text 


YOB: 1987 
RACE: WHITE 
GENDER: FEMALE 
HEIGHT: 5'05'' 
WEIGHT: 118 
EYE COLOR: GREEN 
HAIR COLOR: BROWN 

Albo jeśli chcesz zbadać go, można użyć .contents:

>>> p = soup.find('p') 
>>> from pprint import pprint 
>>> pprint(p.contents) 
[u'\n', 
<strong class="offender">YOB:</strong>, 
u' 1987', 
<br/>, 
u'\n', 
<strong class="offender">RACE:</strong>, 
u' WHITE', 
<br/>, 
u'\n', 
<strong class="offender">GENDER:</strong>, 
u' FEMALE', 
<br/>, 
u'\n', 
<strong class="offender">HEIGHT:</strong>, 
u" 5'05''", 
<br/>, 
u'\n', 
<strong class="offender">WEIGHT:</strong>, 
u' 118', 
<br/>, 
u'\n', 
<strong class="offender">EYE COLOR:</strong>, 
u' GREEN', 
<br/>, 
u'\n', 
<strong class="offender">HAIR COLOR:</strong>, 
u' BROWN', 
<br/>, 
u'\n'] 

i odfiltrować niezbędne elementy z listy:

>>> data = dict(zip([x.text for x in p.contents[1::4]], [x.strip() for x in p.contents[2::4]])) 
>>> pprint(data) 
{u'EYE COLOR:': u'GREEN', 
u'GENDER:': u'FEMALE', 
u'HAIR COLOR:': u'BROWN', 
u'HEIGHT:': u"5'05''", 
u'RACE:': u'WHITE', 
u'WEIGHT:': u'118', 
u'YOB:': u'1987'} 
Powiązane problemy