Usuń wszystkie style, skrypty i tagi HTML stronie html

Oto co mam do tej pory:Usuń wszystkie style, skrypty i tagi HTML stronie html

from bs4 import BeautifulSoup 

def cleanme(html): 
    soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded 
    for script in soup(["script"]): 
     script.extract() 
    text = soup.get_text() 
    return text 
testhtml = "<!DOCTYPE HTML>\n<head>\n<title>THIS IS AN EXAMPLE </title><style>.call {font-family:Arial;}</style><script>getit</script><body>I need this text captured<h1>And this</h1></body>" 

cleaned = cleanme(testhtml) 
print (cleaned)

To działa w celu usunięcia skryptu

Źródło

2015-06-01 htifcs

Jaki jest Twój oczekiwany wynik? –

Wygląda na to prawie mam. Musisz również usunąć znaczniki html i kod stylizacji css. Oto moje rozwiązanie (I aktualizowany funkcji):

def cleanMe(html): 
    soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded 
    for script in soup(["script", "style"]): # remove all javascript and stylesheet code 
     script.extract() 
    # get text 
    text = soup.get_text() 
    # break into lines and remove leading and trailing space on each 
    lines = (line.strip() for line in text.splitlines()) 
    # break multi-headlines into a line each 
    chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) 
    # drop blank lines 
    text = '\n'.join(chunk for chunk in chunks if chunk) 
    return text

Źródło

2015-06-01 03:55:18 jamescampbell

Jeśli chcesz szybki i brudny roztwór ca użyć:

re.sub(r'<[^>]*?>', '', value)

Aby równowartość strip_tags w PHP. Czy tego chcesz?

Źródło

2015-06-01 04:05:31 Sanxofon

Możesz użyć decompose, aby całkowicie usunąć znaczniki z dokumentu i generatora stripped_strings, aby pobrać zawartość znacznika.

def clean_me(html): 
    soup = BeautifulSoup(html) 
    for s in soup(['script', 'style']): 
     s.decompose() 
    return ' '.join(soup.stripped_strings)

>>> clean_me(testhtml) 
'THIS IS AN EXAMPLE I need this text captured And this'

Źródło

2015-06-01 04:21:25 styvane

Usuń wszystkie style, skrypty i tagi HTML stronie html

Odpowiedz

Powiązane problemy