트리 항해하기 - 내려가기

BeautilfulSoup 2015. 6. 28. 23:32

이 문서는 http://coreapython.hosting.paran.com/etc/beautifulsoup4.html# 사이트의 내용을 요약한 것이다.

html의 태그들은 트리구조로 되어있다. BeautifulSoup를 통하여 항해를 해보자.

In [1]:

from bs4 import BeautifulSoup as bs
html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = bs(html_doc)

태그에는 또 다른 태그가 포함된다. 이렇게 포함된 태그를 태그의 자손(children)이라고 부른다.

태그 이름을 사용하여 항해하기

가장 단순하게 항해하는 방법은 원하는 태그의 이름을 지정하는 것이다. 만일 <head>를 항해하려면 soup.head를 입력하면 된다.

In [2]:

soup.head

Out[2]:

<head><title>The Dormouse's story</title></head>

In [3]:

soup.title

Out[3]:

<title>The Dormouse's story</title>

이 방법을 반복해서 태그의 자손의 자손으로 계속 항해할 수 있다. 하지만 이 방법을 이용하면 오직 첫번째 태그에만 항해하게 된다. soup.body.b는 soup의 body태그의 첫번째 b태그를 의미한다.

In [4]:

soup.body.b

Out[4]:

<b>The Dormouse's story</b>

.contents와 .children

태그의 자손은 .contents로 리스트로 얻을 수 있다.

In [5]:

head_tag = soup.head
head_tag

Out[5]:

<head><title>The Dormouse's story</title></head>

In [6]:

head_tag.contents

Out[6]:

[<title>The Dormouse's story</title>]

.contents는 리스트로 얻지만, .children은 리스트인터레이터 객체를 얻는다.

In [7]:

head_tag.children

Out[7]:

<listiterator at 0x2eea9d0>

.descendants

.contents와 .children은 오직 한 태그의 직계(direct)자손만 접근한다. 하지만 .descendants는 모든 자손에 접근할 수 있다.

In [8]:

head_tag.descendants

Out[8]:

<generator object descendants at 0x02C39A30>

In [9]:

for child in head_tag.descendants:
    print child

<title>The Dormouse's story</title>
The Dormouse's story

.string

태그에 오직 자손이 하나이고 그 자손이 NavigableString이라면 .string으로 NavigableString을 얻을 수 있다.

In [10]:

title_tag = soup.head.title
title_tag.string

Out[10]:

u"The Dormouse's story"

태그(head)의 자손(title)이 하나있고 그 자손(title)이 .string을 가진다면 태그(head)도 .string을 가지는 것으로 간주한다.

In [11]:

head_tag.string

Out[11]:

u"The Dormouse's story"

.strings

.strings를 이용하면 태그에 여러 자손이 있어도 모든 NavigableString을 얻을 수 있다. 태그들의 줄바꿈(\n)도 NavigableString에 포함되는 것을 주목하자.

In [12]:

for string in soup.strings:
    print repr(string)

u"The Dormouse's story"
u'\n'
u"The Dormouse's story"
u'\n'
u'Once upon a time there were three little sisters; and their names were\n'
u'Elsie'
u',\n'
u'Lacie'
u' and\n'
u'Tillie'
u';\nand they lived at the bottom of a well.'
u'\n'
u'...'
u'\n'

.stripped_strings

.stripped_strings를 이용하면 불필요한 줄바꿈을 제거해 버리고 문자열 앞과 뒤의 공백도 제거한 NavigableString을 얻을 수 있다.

In [13]:

for string in soup.stripped_strings:
    print repr(string)

u"The Dormouse's story"
u"The Dormouse's story"
u'Once upon a time there were three little sisters; and their names were'
u'Elsie'
u','
u'Lacie'
u'and'
u'Tillie'
u';\nand they lived at the bottom of a well.'
u'...'

저작자표시 비영리 동일조건

'BeautilfulSoup' 카테고리의 다른 글

트리 탐색하기 - 여과기(전달인자)의 종류 (0)	2015.06.30
트리 항해하기 - 앞뒤로 가기 (0)	2015.06.29
트리 항해하기 - 옆으로 가기 (0)	2015.06.29
트리 항해하기 - 올라가기 (0)	2015.06.29
트리 항해하기 - 객체의 종류 (0)	2015.06.28

WRITTEN BY

: 히처리
python + er = pyther