Zastanawiam się, czy byłoby to możliwe, aby prettify nie utworzył nowych linii na konkretnych tagach.personalizuj BeautifulSoup's prettify przez tag

Chciałbym, aby tak było, że span i a znaczniki nie podzielone, na przykład:

doc="""<div><div><span>a</span><span>b</span> 
<a>link</a></div><a>link1</a><a>link2</a></div>""" 

from bs4 import BeautifulSoup as BS 
soup = BS(doc) 
print soup.prettify()

poniżej jest to, co chcę wydrukować:

<div> 
    <div> 
     <span>a</span><span>b</span> 
     <a>link</a> 
    </div> 
    <a>link1</a><a>link2</a> 
</div>

ale to, co zostanie wydrukowane:

<div> 
    <div> 
     <span> 
      a 
     </span> 
     <span> 
      b 
     </span> 
     <a> 
      link 
     </a> 
    </div> 
    <a> 
     link1 
    </a> 
    <a> 
     link2 
    </a> 
</div>

Umieszczanie wbudowanych znaczników stylu na nowych liniach w taki sposób spowoduje dodanie między nimi przestrzeni, co nieznacznie zmieni wygląd rzeczywistej strony. będę link do dwóch jsfiddles wyświetlania różnicę:

anchor tags on new lines

anchor tags next to eachother

Jeśli zastanawiasz się, dlaczego tak ważne dla BeautifulSoup, to dlatego piszę strony internetowej debugger, a funkcja upiększania byłaby bardzo użyteczna (wraz z innymi rzeczami w bs4). Ale jeśli uda mi się upiększyć dokument, ryzykuję zmianę niektórych rzeczy.

Czy istnieje sposób na dostosowanie funkcji prettify, aby można było ustawić tak, aby nie zerwać niektórych tagów?

Źródło

2013-07-11 Ryan Saxe

jestem delegowania Szybki sposób, a ja nie znaleźć lepsze rozwiązanie.

Używam go w moim projekcie, aby uniknąć łamania tekstów i znaczników pre. Zastąp ['span', 'a'] znacznikami, na których chcesz zapobiec wcięciu.

markup = """<div><div><span>a</span><span>b</span> 
<a>link</a></div><a>link1</a><a>link2</a></div>""" 

# Double curly brackets to avoid problems with .format() 
stripped_markup = markup.replace('{','{{').replace('}','}}') 

stripped_markup = BeautifulSoup(stripped_markup) 

unformatted_tag_list = [] 

for i, tag in enumerate(stripped_markup.find_all(['span', 'a'])): 
    unformatted_tag_list.append(str(tag)) 
    tag.replace_with('{' + 'unformatted_tag_list[{0}]'.format(i) + '}') 

pretty_markup = stripped_markup.prettify().format(unformatted_tag_list=unformatted_tag_list) 

print pretty_markup

Źródło

2013-08-25 12:12:17

Właśnie tego szukałem !!! także "11" to dzień ... to tylko miesiąc: D –

Dziękuję za te informacje! : D Będę edytować wpis, aby usunąć mój komentarz dotyczący daty wpisu. –

To nie działa, jeśli oryginalny znacznik zawiera javascript (w rzeczywistości, nawiasy). To, co nie dziwi, sprawia, że 'KeyError's używa formatu' format'. –

Krótka odpowiedź brzmi: nie.

Dłuższa odpowiedź nie jest łatwa.

Nadal używam bs3, więc jest to hack dla bs3. Jestem w trakcie przenoszenia tego na bs4.

Zawiera zasadniczo podklasę Tag i BeautifulSoup oraz przeciążanie metod uśredniania (i pokrewnych).

Kod:

import sys 
import BeautifulSoup 

class Tag(BeautifulSoup.Tag): 
    def __str__(self, encoding=BeautifulSoup.DEFAULT_OUTPUT_ENCODING, 
      prettyPrint=False, indentLevel=0, pprint_exs=[]): 
     """Returns a string or Unicode representation of this tag and 
     its contents. To get Unicode, pass None for encoding. 

     NOTE: since Python's HTML parser consumes whitespace, this 
     method is not certain to reproduce the whitespace present in 
     the original string.""" 

     encodedName = self.toEncoding(self.name, encoding) 

     unflatten_here = (not self.name in pprint_exs) 

     attrs = [] 
     if self.attrs: 
      for key, val in self.attrs: 
       fmt = '%s="%s"' 
       if isinstance(val, basestring): 
        if self.containsSubstitutions and '%SOUP-ENCODING%' in val: 
         val = self.substituteEncoding(val, encoding) 

        # The attribute value either: 
        # 
        # * Contains no embedded double quotes or single quotes. 
        # No problem: we enclose it in double quotes. 
        # * Contains embedded single quotes. No problem: 
        # double quotes work here too. 
        # * Contains embedded double quotes. No problem: 
        # we enclose it in single quotes. 
        # * Embeds both single _and_ double quotes. This 
        # can't happen naturally, but it can happen if 
        # you modify an attribute value after parsing 
        # the document. Now we have a bit of a 
        # problem. We solve it by enclosing the 
        # attribute in single quotes, and escaping any 
        # embedded single quotes to XML entities. 
        if '"' in val: 
         fmt = "%s='%s'" 
         if "'" in val: 
          # TODO: replace with apos when 
          # appropriate. 
          val = val.replace("'", "&squot;") 

        # Now we're okay w/r/t quotes. But the attribute 
        # value might also contain angle brackets, or 
        # ampersands that aren't part of entities. We need 
        # to escape those to XML entities too. 
        val = self.BARE_AMPERSAND_OR_BRACKET.sub(self._sub_entity, val) 

       attrs.append(fmt % (self.toEncoding(key, encoding), 
            self.toEncoding(val, encoding))) 
     close = '' 
     closeTag = '' 
     if self.isSelfClosing: 
      close = ' /' 
     else: 
      closeTag = '</%s>' % encodedName 

     prev = self.findPrevious(lambda x: isinstance(x, Tag)) 
     prev_sib = self.findPreviousSibling(lambda x: isinstance(x, Tag)) 
     ex_break_detected = (self.name != prev_sib.name) if(prev_sib and prev_sib.name in pprint_exs) else False 
     break_detected = (self.name != prev.name) if(prev) else False 

     indentTag, indentContents = 0, 0 
     if prettyPrint: 
      if(break_detected or unflatten_here): 
       indentContents = indentLevel + 1 
      indentTag = indentLevel 
      space = (' ' * (indentTag-1)) 
     contents = self.renderContents(encoding, prettyPrint, indentContents, pprint_exs, unflatten_here) 
     if self.hidden: 
      s = contents 
     else: 
      s = [] 
      attributeString = '' 
      if attrs: 
       attributeString = ' ' + ' '.join(attrs) 
      if prettyPrint and ex_break_detected and not unflatten_here: 
       s.append("\n") 
      if prettyPrint and (unflatten_here or break_detected): 
       s.append(space) 
      s.append('<%s%s%s>' % (encodedName, attributeString, close)) 
      if prettyPrint and unflatten_here: 
       s.append("\n") 
      s.append(contents) 
      if prettyPrint and contents and contents[-1] != "\n" and unflatten_here: 
       s.append("\n") 
      if prettyPrint and closeTag and unflatten_here: 
       s.append(space) 
      s.append(closeTag) 
      if prettyPrint and closeTag and self.nextSibling and unflatten_here: 
       s.append("\n") 
      if prettyPrint and isinstance(self.nextSibling, Tag) and self.nextSibling.name != self.name and not unflatten_here: 
       s.append("\n") 

      s = ''.join(s) 
     return s 

    def renderContents(self, encoding=BeautifulSoup.DEFAULT_OUTPUT_ENCODING, 
         prettyPrint=False, indentLevel=0, pprint_exs=[], unflatten=True): 
     """Renders the contents of this tag as a string in the given 
     encoding. If encoding is None, returns a Unicode string..""" 
     s=[] 
     for c in self: 
      text = None 
      if isinstance(c, BeautifulSoup.NavigableString): 
       text = c.__str__(encoding) 
      elif isinstance(c, Tag): 
       s.append(c.__str__(encoding, prettyPrint, indentLevel, pprint_exs)) 
      if text and prettyPrint: 
       text = text.strip() 
      if text: 
       if prettyPrint and unflatten: 
        s.append(" " * (indentLevel-1)) 
       s.append(text) 
       if prettyPrint and unflatten: 
        s.append("\n") 
     return ''.join(s) 
BeautifulSoup.Tag = Tag 

class BeautifulStoneSoup(Tag, BeautifulSoup.BeautifulStoneSoup): 
    pass 
BeautifulSoup.BeautifulStoneSoup = BeautifulStoneSoup 

class PumpkinSoup(BeautifulStoneSoup, BeautifulSoup.BeautifulSoup): 
    def __init__(self, *args, **kwargs): 
     self.pprint_exs = kwargs.pop("pprint_exs", []) 
     super(BeautifulSoup.BeautifulSoup, self).__init__(*args, **kwargs) 
    def prettify(self, encoding=BeautifulSoup.DEFAULT_OUTPUT_ENCODING): 
     return self.__str__(encoding, True, pprint_exs=self.pprint_exs) 

doc = \ 
''' 
<div> 
<div> 
<span>a</span><span>b</span> 
    <a>link1</a> 
    <a>link2</a> 
<span>c</span> 
</div> 
<a>link3</a><a>link4</a> 
</div> 
''' 

soup = PumpkinSoup(doc, pprint_exs = ["a", "span"]) 
print soup.prettify()

Źródło

2013-07-11 13:48:18 dilbert

personalizuj BeautifulSoup's prettify przez tag

anchor tags on new lines

anchor tags next to eachother

Odpowiedz

Powiązane problemy