Quickstart¶
Eager to get started? This page gives a good introduction in how to get started with newspaper. This assumes you already have newspaper installed. If you do not, head over to the Installation section.
Building a news source¶
Source objects are an abstraction of online news media websites like CNN or ESPN. You can initialize them in two different ways.
Building a Source
will extract its categories, feeds, articles, brand, and description for you.
You may also provide configuration parameters like language
, browser_user_agent
, and etc seamlessly. Navigate to the advanced section for details.
>>> import newspaper
>>> cnn_paper = newspaper.build('http://cnn.com')
>>> sina_paper = newspaper.build('http://www.lemonde.fr/', language='fr')
However, if needed, you may also play with the lower level Source
object as described
in the advanced section.
Extracting articles¶
Every news source has a set of recent articles.
The following examples assume that a news source has been initialized and built.
>>> for article in cnn_paper.articles:
>>> print(article.url)
u'http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/'
u'http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html'
...
>>> print(cnn_paper.size()) # cnn has 3100 articles
3100
Article caching¶
By default, newspaper caches all previously extracted articles and eliminates any article which it has already extracted.
This feature exists to prevent duplicate articles and to increase extraction speed.
>>> cbs_paper = newspaper.build('http://cbs.com')
>>> cbs_paper.size()
1030
>>> cbs_paper = newspaper.build('http://cbs.com')
>>> cbs_paper.size()
2
The return value of cbs_paper.size()
changes from 1030 to 2 because when we first
crawled cbs we found 1030 articles. However, on our second crawl, we eliminate all
articles which have already been crawled.
This means 2 new articles have been published since our first extraction.
You may opt out of this feature with the memoize_articles
parameter.
You may also pass in the lower level``Config`` objects as covered in the advanced section.
>>> import newspaper
>>> cbs_paper = newspaper.build('http://cbs.com', memoize_articles=False)
>>> cbs_paper.size()
1030
>>> cbs_paper = newspaper.build('http://cbs.com', memoize_articles=False)
>>> cbs_paper.size()
1030
Extracting Source categories¶
>>> for category in cnn_paper.category_urls():
>>> print(category)
u'http://lifestyle.cnn.com'
u'http://cnn.com/world'
u'http://tech.cnn.com'
...
Extracting Source feeds¶
>>> for feed_url in cnn_paper.feed_urls():
>>> print(feed_url)
u'http://rss.cnn.com/rss/cnn_crime.rss'
u'http://rss.cnn.com/rss/cnn_tech.rss'
...
Extracting Source brand & description¶
>>> print(cnn_paper.brand)
u'cnn'
>>> print(cnn_paper.description)
u'CNN.com delivers the latest breaking news and information on the latest...'
News Articles¶
Article objects are abstractions of news articles. For example, a news Source
would be CNN while a news Article
would be a specific CNN article.
You may reference an Article
from an existing news Source
or initialize
one by itself.
Referencing it from a Source
.
>>> first_article = cnn_paper.articles[0]
Initializing an Article
by itself.
>>> from newspaper import Article
>>> first_article = Article(url="http://www.lemonde.fr/...", language='fr')
Note the similar language=
named paramater above. All the config parameters as described for Source
objects also apply for Article
objects! Source and Article objects have a very similar api.
There are endless possibilities on how we can manipulate and build articles.
Downloading an Article¶
We begin by calling download()
on an article. If you are interested in how to
quickly download articles concurrently with multi-threading check out the
advanced section.
>>> first_article = cnn_paper.articles[0]
>>> first_article.download()
>>> print(first_article.html)
u'<!DOCTYPE HTML><html itemscope itemtype="http://...'
>>> print(cnn_paper.articles[7].html)
u'' fail, not downloaded yet
Parsing an Article¶
You may also extract meaningful content from the html, like authors and body-text.
You must have called download()
on an article before calling parse()
.
>>> first_article.parse()
>>> print(first_article.text)
u'Three sisters who were imprisoned for possibly...'
>>> print(first_article.top_image)
u'http://some.cdn.com/3424hfd4565sdfgdg436/
>>> print(first_article.authors)
[u'Eliott C. McLaughlin', u'Some CoAuthor']
>>> print(first_article.title)
u'Police: 3 sisters imprisoned in Tucson home'
>>> print(first_article.images)
['url_to_img_1', 'url_to_img_2', 'url_to_img_3', ...]
>>> print(first_article.movies)
['url_to_youtube_link_1', ...] # youtube, vimeo, etc
Performing NLP on an Article¶
Finally, you may extract out natural language properties from the text.
You must have called both download()
and parse()
on the article
before calling nlp()
.
As of the current build, nlp() features only work on western languages.
>>> first_article.nlp()
>>> print(first_article.summary)
u'...imprisoned for possibly a constant barrage...'
>>> print(first_article.keywords)
[u'music', u'Tucson', ... ]
>>> print(cnn_paper.articles[100].nlp()) # fail, not been downloaded yet
Traceback (...
ArticleException: You must parse an article before you try to..
nlp()
is expensive, as is parse()
, make sure you actually need them before calling them on
all of your articles! In some cases, if you just need urls, even download()
is not necessary.
Easter Eggs¶
Here are random but hopefully useful features! hot()
returns a list of the top
trending terms on Google using a public api. popular_urls()
returns a list
of popular news source urls.. In case you need help choosing a news source!
>>> import newspaper
>>> newspaper.hot()
['Ned Vizzini', Brian Boitano', Crossword Inventor', 'Alex & Sierra', ... ]
>>> newspaper.popular_urls()
['http://slate.com', 'http://cnn.com', 'http://huffingtonpost.com', ... ]
>>> newspaper.languages()
Your available languages are:
input code full name
ar Arabic
de German
en English
es Spanish
fr French
he Hebrew
it Italian
ko Korean
no Norwegian
pt Portuguese
sv Swedish
zh Chinese