Esta biblioteca pretende tornar a análise de HTML (por exemplo, raspagem da web) o mais simples e intuitiva possível.
Ao usar esta biblioteca você obtém automaticamente:
Faça uma solicitação GET para 'python.org', usando Requests:
>>> from requests_html import HTMLSession
>>> session = HTMLSession()
>>> r = session.get( ' https://python.org/ ' )
Experimente o assíncrono e obtenha alguns sites ao mesmo tempo:
>>> from requests_html import AsyncHTMLSession
>>> asession = AsyncHTMLSession()
>>> async def get_pythonorg ():
... r = await asession.get( ' https://python.org/ ' )
... return r
...
>>> async def get_reddit ():
... r = await asession.get( ' https://reddit.com/ ' )
... return r
...
>>> async def get_google ():
... r = await asession.get( ' https://google.com/ ' )
... return r
...
>>> results = asession.run(get_pythonorg, get_reddit, get_google)
>>> results # check the requests all returned a 200 (success) code
[<Response [200]>, <Response [200]>, <Response [200]>]
>>> # Each item in the results list is a response object and can be interacted with as such
>>> for result in results:
... print (result.html.url)
...
https://www.python.org/
https://www.google.com/
https://www.reddit.com/
Observe que a ordem dos objetos na lista de resultados representa a ordem em que foram retornados, não a ordem em que as corrotinas são passadas para o método run
, que é mostrado no exemplo pela ordem ser diferente.
Obtenha uma lista de todos os links da página, como estão (excluindo âncoras):
>>> r.html.links
{'//docs.python.org/3/tutorial/', '/about/apps/', 'https://github.com/python/pythondotorg/issues', '/accounts/login/', '/dev/peps/', '/about/legal/', '//docs.python.org/3/tutorial/introduction.html#lists', '/download/alternatives', 'http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html', '/download/other/', '/downloads/windows/', 'https://mail.python.org/mailman/listinfo/python-dev', '/doc/av', 'https://devguide.python.org/', '/about/success/#engineering', 'https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event', 'https://www.openstack.org', '/about/gettingstarted/', 'http://feedproxy.google.com/~r/PythonInsider/~3/AMoBel8b8Mc/python-3.html', '/success-stories/industrial-light-magic-runs-python/', 'http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator', '/', 'http://pyfound.blogspot.com/', '/events/python-events/past/', '/downloads/release/python-2714/', 'https://wiki.python.org/moin/PythonBooks', 'http://plus.google.com/+Python', 'https://wiki.python.org/moin/', 'https://status.python.org/', '/community/workshops/', '/community/lists/', 'http://buildbot.net/', '/community/awards', 'http://twitter.com/ThePSF', 'https://docs.python.org/3/license.html', '/psf/donations/', 'http://wiki.python.org/moin/Languages', '/dev/', '/events/python-user-group/', 'https://wiki.qt.io/PySide', '/community/sigs/', 'https://wiki.gnome.org/Projects/PyGObject', 'http://www.ansible.com', 'http://www.saltstack.com', 'http://planetpython.org/', '/events/python-events', '/about/help/', '/events/python-user-group/past/', '/about/success/', '/psf-landing/', '/about/apps', '/about/', 'http://www.wxpython.org/', '/events/python-user-group/665/', 'https://www.python.org/psf/codeofconduct/', '/dev/peps/peps.rss', '/downloads/source/', '/psf/sponsorship/sponsors/', 'http://bottlepy.org', 'http://roundup.sourceforge.net/', 'http://pandas.pydata.org/', 'http://brochure.getpython.info/', 'https://bugs.python.org/', '/community/merchandise/', 'http://tornadoweb.org', '/events/python-user-group/650/', 'http://flask.pocoo.org/', '/downloads/release/python-364/', '/events/python-user-group/660/', '/events/python-user-group/638/', '/psf/', '/doc/', 'http://blog.python.org', '/events/python-events/604/', '/about/success/#government', 'http://python.org/dev/peps/', 'https://docs.python.org', 'http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html', '/users/membership/', '/about/success/#arts', 'https://wiki.python.org/moin/Python2orPython3', '/downloads/', '/jobs/', 'http://trac.edgewall.org/', 'http://feedproxy.google.com/~r/PythonInsider/~3/wh73_1A-N7Q/python-355rc1-and-python-348rc1-are-now.html', '/privacy/', 'https://pypi.python.org/', 'http://www.riverbankcomputing.co.uk/software/pyqt/intro', 'http://www.scipy.org', '/community/forums/', '/about/success/#scientific', '/about/success/#software-development', '/shell/', '/accounts/signup/', 'http://www.facebook.com/pythonlang?fref=ts', '/community/', 'https://kivy.org/', '/about/quotes/', 'http://www.web2py.com/', '/community/logos/', '/community/diversity/', '/events/calendars/', 'https://wiki.python.org/moin/BeginnersGuide', '/success-stories/', '/doc/essays/', '/dev/core-mentorship/', 'http://ipython.org', '/events/', '//docs.python.org/3/tutorial/controlflow.html', '/about/success/#education', '/blogs/', '/community/irc/', 'http://pycon.blogspot.com/', '//jobs.python.org', 'http://www.pylonsproject.org/', 'http://www.djangoproject.com/', '/downloads/mac-osx/', '/about/success/#business', 'http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now-available-for.html', 'http://wiki.python.org/moin/TkInter', 'https://docs.python.org/faq/', '//docs.python.org/3/tutorial/controlflow.html#defining-functions'}
Obtenha uma lista de todos os links da página, em formato absoluto (excluindo âncoras):
>>> r.html.absolute_links
{'https://github.com/python/pythondotorg/issues', 'https://docs.python.org/3/tutorial/', 'https://www.python.org/about/success/', 'http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html', 'https://www.python.org/dev/peps/', 'https://mail.python.org/mailman/listinfo/python-dev', 'https://www.python.org/doc/', 'https://www.python.org/', 'https://www.python.org/about/', 'https://www.python.org/events/python-events/past/', 'https://devguide.python.org/', 'https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event', 'https://www.openstack.org', 'http://feedproxy.google.com/~r/PythonInsider/~3/AMoBel8b8Mc/python-3.html', 'https://docs.python.org/3/tutorial/introduction.html#lists', 'http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator', 'http://pyfound.blogspot.com/', 'https://wiki.python.org/moin/PythonBooks', 'http://plus.google.com/+Python', 'https://wiki.python.org/moin/', 'https://www.python.org/events/python-events', 'https://status.python.org/', 'https://www.python.org/about/apps', 'https://www.python.org/downloads/release/python-2714/', 'https://www.python.org/psf/donations/', 'http://buildbot.net/', 'http://twitter.com/ThePSF', 'https://docs.python.org/3/license.html', 'http://wiki.python.org/moin/Languages', 'https://docs.python.org/faq/', 'https://jobs.python.org', 'https://www.python.org/about/success/#software-development', 'https://www.python.org/about/success/#education', 'https://www.python.org/community/logos/', 'https://www.python.org/doc/av', 'https://wiki.qt.io/PySide', 'https://www.python.org/events/python-user-group/660/', 'https://wiki.gnome.org/Projects/PyGObject', 'http://www.ansible.com', 'http://www.saltstack.com', 'https://www.python.org/dev/peps/peps.rss', 'http://planetpython.org/', 'https://www.python.org/events/python-user-group/past/', 'https://docs.python.org/3/tutorial/controlflow.html#defining-functions', 'https://www.python.org/community/diversity/', 'https://docs.python.org/3/tutorial/controlflow.html', 'https://www.python.org/community/awards', 'https://www.python.org/events/python-user-group/638/', 'https://www.python.org/about/legal/', 'https://www.python.org/dev/', 'https://www.python.org/download/alternatives', 'https://www.python.org/downloads/', 'https://www.python.org/community/lists/', 'http://www.wxpython.org/', 'https://www.python.org/about/success/#government', 'https://www.python.org/psf/', 'https://www.python.org/psf/codeofconduct/', 'http://bottlepy.org', 'http://roundup.sourceforge.net/', 'http://pandas.pydata.org/', 'http://brochure.getpython.info/', 'https://www.python.org/downloads/source/', 'https://bugs.python.org/', 'https://www.python.org/downloads/mac-osx/', 'https://www.python.org/about/help/', 'http://tornadoweb.org', 'http://flask.pocoo.org/', 'https://www.python.org/users/membership/', 'http://blog.python.org', 'https://www.python.org/privacy/', 'https://www.python.org/about/gettingstarted/', 'http://python.org/dev/peps/', 'https://www.python.org/about/apps/', 'https://docs.python.org', 'https://www.python.org/success-stories/', 'https://www.python.org/community/forums/', 'http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html', 'https://www.python.org/community/merchandise/', 'https://www.python.org/about/success/#arts', 'https://wiki.python.org/moin/Python2orPython3', 'http://trac.edgewall.org/', 'http://feedproxy.google.com/~r/PythonInsider/~3/wh73_1A-N7Q/python-355rc1-and-python-348rc1-are-now.html', 'https://pypi.python.org/', 'https://www.python.org/events/python-user-group/650/', 'http://www.riverbankcomputing.co.uk/software/pyqt/intro', 'https://www.python.org/about/quotes/', 'https://www.python.org/downloads/windows/', 'https://www.python.org/events/calendars/', 'http://www.scipy.org', 'https://www.python.org/community/workshops/', 'https://www.python.org/blogs/', 'https://www.python.org/accounts/signup/', 'https://www.python.org/events/', 'https://kivy.org/', 'http://www.facebook.com/pythonlang?fref=ts', 'http://www.web2py.com/', 'https://www.python.org/psf/sponsorship/sponsors/', 'https://www.python.org/community/', 'https://www.python.org/download/other/', 'https://www.python.org/psf-landing/', 'https://www.python.org/events/python-user-group/665/', 'https://wiki.python.org/moin/BeginnersGuide', 'https://www.python.org/accounts/login/', 'https://www.python.org/downloads/release/python-364/', 'https://www.python.org/dev/core-mentorship/', 'https://www.python.org/about/success/#business', 'https://www.python.org/community/sigs/', 'https://www.python.org/events/python-user-group/', 'http://ipython.org', 'https://www.python.org/shell/', 'https://www.python.org/community/irc/', 'https://www.python.org/about/success/#engineering', 'http://www.pylonsproject.org/', 'http://pycon.blogspot.com/', 'https://www.python.org/about/success/#scientific', 'https://www.python.org/doc/essays/', 'http://www.djangoproject.com/', 'https://www.python.org/success-stories/industrial-light-magic-runs-python/', 'http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now-available-for.html', 'http://wiki.python.org/moin/TkInter', 'https://www.python.org/jobs/', 'https://www.python.org/events/python-events/604/'}
Selecione um elemento com um seletor CSS:
>>> about = r.html.find( ' #about ' , first = True )
Pegue o conteúdo de texto de um elemento:
>>> print (about.text)
About
Applications
Quotes
Getting Started
Help
Python Brochure
Faça uma introspecção nos atributos de um elemento:
>>> about.attrs
{'id': 'about', 'class': ('tier-1', 'element-1'), 'aria-haspopup': 'true'}
Renderize o HTML de um elemento:
>>> about.html
'<li aria-haspopup="true" class="tier-1 element-1 " id="about">n<a class="" href="/about/" title="">About</a>n<ul aria-hidden="true" class="subnav menu" role="menu">n<li class="tier-2 element-1" role="treeitem"><a href="/about/apps/" title="">Applications</a></li>n<li class="tier-2 element-2" role="treeitem"><a href="/about/quotes/" title="">Quotes</a></li>n<li class="tier-2 element-3" role="treeitem"><a href="/about/gettingstarted/" title="">Getting Started</a></li>n<li class="tier-2 element-4" role="treeitem"><a href="/about/help/" title="">Help</a></li>n<li class="tier-2 element-5" role="treeitem"><a href="http://brochure.getpython.info/" title="">Python Brochure</a></li>n</ul>n</li>'
Selecione Elementos dentro de Elementos:
>>> about.find( ' a ' )
[<Element 'a' href='/about/' title='' class=''>, <Element 'a' href='/about/apps/' title=''>, <Element 'a' href='/about/quotes/' title=''>, <Element 'a' href='/about/gettingstarted/' title=''>, <Element 'a' href='/about/help/' title=''>, <Element 'a' href='http://brochure.getpython.info/' title=''>]
Pesquise links dentro de um elemento:
>>> about.absolute_links
{'http://brochure.getpython.info/', 'https://www.python.org/about/gettingstarted/', 'https://www.python.org/about/', 'https://www.python.org/about/quotes/', 'https://www.python.org/about/help/', 'https://www.python.org/about/apps/'}
Pesquise texto na página:
>>> r.html.search( ' Python is a {} language ' )[ 0 ]
programming
Exemplo mais complexo de seletor CSS (copiado das ferramentas de desenvolvimento do Chrome):
>>> r = session.get( ' https://github.com/ ' )
>>> sel = ' body > div.application-main > div.jumbotron.jumbotron-codelines > div > div > div.col-md-7.text-center.text-md-left > p '
>>> print (r.html.find(sel, first = True ).text)
GitHub is a development platform inspired by the way you work. From open source to business, you can host and review code, manage projects, and build software alongside millions of other developers.
XPath também é compatível:
>>> r.html.xpath( ' /html/body/div[1]/a ' )
[<Element 'a' class=('px-2', 'py-4', 'show-on-focus', 'js-skip-to-content') href='#start-of-content' tabindex='1'>]
Vamos pegar um texto renderizado por JavaScript. Até 2020, o relógio de contagem regressiva do Python 2.7 (https://pythonclock.org) servirá como uma boa página de teste:
>>> r = session.get( ' https://pythonclock.org ' )
Vamos tentar ver o código renderizado dinamicamente (o relógio de contagem regressiva). Para fazer isso rapidamente, a princípio, pesquisaremos entre o último texto que vemos antes dele ('Python 2.7 irá se aposentar em...') e o primeiro texto que vemos depois dele ('Enable Guido Mode').
>>> r.html.search( ' Python 2.7 will retire in... {} Enable Guido Mode ' )[ 0 ]
'</h1>n </div>n <div class="python-27-clock"></div>n <div class="center">n <div class="guido-button-block">n <button class="js-guido-mode guido-button">'
Observe que o relógio está faltando. O método render()
pega a resposta e renderiza o conteúdo dinâmico exatamente como um navegador da web faria.
>>> r.html.render()
>>> r.html.search( ' Python 2.7 will retire in... {} Enable Guido Mode ' )[ 0 ]
'</h1>n </div>n <div class="python-27-clock is-countdown"><span class="countdown-row countdown-show6"><span class="countdown-section"><span class="countdown-amount">1</span><span class="countdown-period">Year</span></span><span class="countdown-section"><span class="countdown-amount">2</span><span class="countdown-period">Months</span></span><span class="countdown-section"><span class="countdown-amount">28</span><span class="countdown-period">Days</span></span><span class="countdown-section"><span class="countdown-amount">16</span><span class="countdown-period">Hours</span></span><span class="countdown-section"><span class="countdown-amount">52</span><span class="countdown-period">Minutes</span></span><span class="countdown-section"><span class="countdown-amount">46</span><span class="countdown-period">Seconds</span></span></span></div>n <div class="center">n <div class="guido-button-block">n <button class="js-guido-mode guido-button">'
Vamos limpar um pouco. Esta etapa não é necessária, apenas torna um pouco mais fácil visualizar o html retornado para ver o que precisamos direcionar para extrair as informações necessárias.
>>> from pprint import pprint
>>> pprint(r.html.search('Python 2.7 will retire in...{}Enable')[0])
('</h1>n'
' </div>n'
' <div class="python-27-clock is-countdown"><span class="countdown-row '
'countdown-show6"><span class="countdown-section"><span '
'class="countdown-amount">1</span><span '
'class="countdown-period">Year</span></span><span '
'class="countdown-section"><span class="countdown-amount">2</span><span '
'class="countdown-period">Months</span></span><span '
'class="countdown-section"><span class="countdown-amount">28</span><span '
'class="countdown-period">Days</span></span><span '
'class="countdown-section"><span class="countdown-amount">16</span><span '
'class="countdown-period">Hours</span></span><span '
'class="countdown-section"><span class="countdown-amount">52</span><span '
'class="countdown-period">Minutes</span></span><span '
'class="countdown-section"><span class="countdown-amount">46</span><span '
'class="countdown-period">Seconds</span></span></span></div>n'
' <div class="center">n'
' <div class="guido-button-block">n'
' <button class="js-guido-mode guido-button">')
O html renderizado possui todos os mesmos métodos e atributos acima. Vamos extrair apenas os dados que queremos do relógio para algo fácil de usar em outro lugar e introspectivo como um dicionário.
>>> periods = [element.text for element in r.html.find( ' .countdown-period ' )]
>>> amounts = [element.text for element in r.html.find( ' .countdown-amount ' )]
>>> countdown_data = dict ( zip (periods, amounts))
>>> countdown_data
{'Year': '1', 'Months': '2', 'Days': '5', 'Hours': '23', 'Minutes': '34', 'Seconds': '37'}
Ou você também pode fazer isso de forma assíncrona:
>>> async def get_pyclock ():
... r = await asession.get( ' https://pythonclock.org/ ' )
... await r.html.arender()
... return r
...
>>> results = asession.run(get_pyclock, get_pyclock, get_pyclock)
O restante do código funciona da mesma maneira que a versão síncrona, exceto que results
são uma lista contendo vários objetos de resposta; no entanto, os mesmos processos básicos podem ser aplicados acima para extrair os dados desejados.
Observe que na primeira vez que você executar o método render()
, ele fará o download do Chromium em seu diretório inicial (por exemplo, ~/.pyppeteer/
). Isso só acontece uma vez.
Você também pode usar esta biblioteca sem solicitações:
>>> from requests_html import HTML
>>> doc = """ <a href='https://httpbin.org'> """
>>> html = HTML( html = doc)
>>> html.links
{'https://httpbin.org'}
$ pipenv install requests-html
?
Somente Python 3.6 e superior são suportados.