newspaper下載 - newspaper原始碼下載

Newspaper3k：文章抓取與策展

受到其簡單性要求的啟發，並由 lxml 為其速度提供支援：

“Newspaper 是一個令人驚嘆的 Python 庫，用於提取和整理文章。” ——請作者 Kenneth Reitz 發推文
“報紙提供 Instapaper 風格的文章提取。” -- 變更日誌

Newspaper 是一個 Python3 函式庫！或者，查看我們已棄用且有缺陷的Python2 分支

概覽：

>>> from newspaper import Article

>>> url = ' http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/ '
>>> article = Article(url)

>>> article.download()

>>> article.html
'<!DOCTYPE HTML><html itemscope itemtype="http://...'

>>> article.parse()

>>> article.authors
['Leigh Ann Caldwell', 'John Honway']

>>> article.publish_date
datetime.datetime(2013, 12, 30, 0, 0)

>>> article.text
'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'

>>> article.top_image
'http://someCDN.com/blah/blah/blah/file.png'

>>> article.movies
['http://youtube.com/path/to/link.com', ...]

>>> article.nlp()

>>> article.keywords
['New Years', 'resolution', ...]

>>> article.summary
'The study shows that 93% of people ...'

>>> import newspaper

>>> cnn_paper = newspaper.build( ' http://cnn.com ' )

>>> for article in cnn_paper.articles:
>>>     print (article.url)
http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/
http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html
...

>>> for category in cnn_paper.category_urls():
>>>     print (category)

http://lifestyle.cnn.com
http://cnn.com/world
http://tech.cnn.com
...

>>> cnn_article = cnn_paper.articles[ 0 ]
>>> cnn_article.download()
>>> cnn_article.parse()
>>> cnn_article.nlp()
...

>>> from newspaper import fulltext

>>> html = requests.get( ... ).text
>>> text = fulltext(html)

報紙可以無縫地提取和檢測語言。如果未指定語言，報紙將嘗試自動偵測語言。

>>> from newspaper import Article
>>> url = ' http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml '

>>> a = Article(url, language = ' zh ' ) # Chinese

>>> a.download()
>>> a.parse()

>>> print (a.text[: 150 ])
香港行政长官梁振英在各方压力下就其大宅的违章建
筑（僭建）问题到立法会接受质询，并向香港民众道歉。
梁振英在星期二（12月10日）的答问大会开始之际
在其演说中道歉，但强调他在违章建筑问题上没有隐瞒的
意图和动机。 一些亲北京阵营议员欢迎梁振英道歉，
且认为应能获得香港民众接受，但这些议员也质问梁振英有

>>> print (a.title)
港特首梁振英就住宅违建事件道歉

如果您確定整個新聞源都是使用一種語言，請繼續使用相同的 api :)

>>> import newspaper
>>> sina_paper = newspaper.build( ' http://www.sina.com.cn/ ' , language = ' zh ' )

>>> for category in sina_paper.category_urls():
>>>     print (category)
http://health.sina.com.cn
http://eladies.sina.com.cn
http://english.sina.com
...

>>> article = sina_paper.articles[ 0 ]
>>> article.download()
>>> article.parse()

>>> print (article.text)
新浪武汉汽车综合 随着汽车市场的日趋成熟，
传统的“集全家之力抱得爱车归”的全额购车模式已然过时，
另一种轻松的新兴 车模式――金融购车正逐步成为时下消费者购
买爱车最为时尚的消费理念，他们认为，这种新颖的购车
模式既能在短期内
...

>>> print (article.title)
两年双免0手续0利率 科鲁兹掀背金融轻松购_武汉车市_武汉汽
车网_新浪汽车_新浪网

支持我們的圖書館

只需輕輕一按

文件

查看文件以獲取使用報紙的完整詳細指南。

有興趣為我們增加一種新語言嗎？請參閱：文件 - 新增語言

特徵

多線程文章下載框架
新聞url識別
從 html 提取文本
從 html 中提取頂部圖像
從html中提取所有圖像
從文字中提取關鍵字
從文字中提取摘要
從文本中提取作者
谷歌趨勢術語提取
支援 10 多種語言（英語、中文、德語、阿拉伯語…）

>>> import newspaper
>>> newspaper.languages()

Your available languages are:
input code      full name

  ar              Arabic
  be              Belarusian
  bg              Bulgarian
  da              Danish
  de              German
  el              Greek
  en              English
  es              Spanish
  et              Estonian
  fa              Persian
  fi              Finnish
  fr              French
  he              Hebrew
  hi              Hindi
  hr              Croatian
  hu              Hungarian
  id              Indonesian
  it              Italian
  ja              Japanese
  ko              Korean
  lt              Lithuanian
  mk              Macedonian
  nb              Norwegian (Bokmål)
  nl              Dutch
  no              Norwegian
  pl              Polish
  pt              Portuguese
  ro              Romanian
  ru              Russian
  sl              Slovenian
  sr              Serbian
  sv              Swedish
  sw              Swahili
  th              Thai
  tr              Turkish
  uk              Ukrainian
  vi              Vietnamese
  zh              Chinese

立即獲取

運行 ✅ pip3 install newspaper3k ✅

不是 ⛔ pip3 install newspaper ⛔

在 python3 上，您必須安裝newspaper3k ，而不是newspaper 。 newspaper是我們的 python2 函式庫。雖然使用 pip 安裝報紙很簡單，但如果您嘗試在 ubuntu 上安裝，您將遇到可修復的問題。

如果您使用的是 Debian / Ubuntu ，請使用以下命令進行安裝：

安裝安裝newspaper3k包所需的pip3指令：
```
 $ sudo apt-get install python3-pip
```
Python開發版本，Python.h需要：
```
 $ sudo apt-get install python-dev
```

lxml需求：

 $ sudo apt-get install libxml2-dev libxslt-dev

讓 PIL 辨識 .jpg 影像：

 $ sudo apt-get install libjpeg-dev zlib1g-dev libpng12-dev

注意：如果您發現安裝libpng12-dev時有問題，請嘗試安裝libpng-dev 。

下載NLP相關語料庫：

 $curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py |蟒蛇3

透過 pip 安裝發行版：
```
 $ pip3 安裝報紙3k
```

如果您使用的是 OSX ，請使用以下命令進行安裝，您可以使用 homebrew 或 macports：

 $brew 安裝 libxml2 libxslt

$brew安裝libtifflibjpegwebplittle-cms2

$ pip3 安裝報紙3k

$ 捲曲 https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py |蟒蛇3

否則，請使用以下命令安裝：

注意：您很可能仍然需要透過套件管理器安裝以下程式庫

PIL： libjpeg-dev zlib1g-dev libpng12-dev
lxml: libxml2-dev libxslt-dev
Python開發版本： python-dev

 $ pip3 安裝報紙3k

$ 捲曲 https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py |蟒蛇3

捐款

非常感謝您的捐款！他們將使我有更多時間參與這個項目，承擔諸如添加新功能、錯誤修復支援、解決與庫有關的問題等事情。

我的 PayPal 連結：https://www.paypal.me/codelucas
我的 Venmo 帳號：@Lucas-Ou-Yang

發展

如果您想為報紙項目做出貢獻並進行破解，請隨時在本地克隆此存儲庫的開發版本：

 git 克隆 git://github.com/codelucas/newspaper.git

一旦您獲得了原始程式碼的副本，您就可以將其嵌入到您的 Python 套件中，或輕鬆地將其安裝到您的網站套件中：

 $ pip3 install -r 要求.txt
$ python3 setup.py安裝

請隨意嘗試我們的測試套件，一切都被嘲笑了！

 $ python3 測試/unit_tests.py

計劃調整我們的全文演算法？新增fulltext參數：

 $ python3測試/unit_tests.py全文

示範

在此處查看有效的線上演示：http://newspaper-demo.herokuapp.com

這是另一個線上演示：http://newspaper.chinazt.cc/

執照

由 Lucas Ou-Yang 創作和維護。

Parse.ly 贊助了一些報紙工作，特別關注自動提取。

Newspaper大量使用了python-goose的解析程式碼。在這裡查看他們的許可證。

如果您遇到問題或只是想談論這個圖書館的未來和新聞提取，請隨時發送電子郵件並與我聯繫！

展開

newspaper

Newspaper3k：文章抓取與策展

概覽：

支持我們的圖書館

文件

特徵

立即獲取

捐款

發展

示範

執照

Nuitka

Google Blog Converters(部落格資料轉換器)

azure storage python

plainCms非同步協程內容管理系統v1.0

repository guide

smartchart資料視覺化平台v6.9

chat.petals.dev

GPT Prompt Templates

GPTyped

Nuitka

Google Blog Converters(部落格資料轉換器)

azure storage python

waymo open dataset

wp functions

termwind