Quickstart¶

Eager to get started? This page gives a good introduction in how to get started with newspaper. This assumes you already have newspaper installed. If you do not, head over to the Installation section.

Building a news source¶

Source objects are an abstraction of online news media websites like CNN or ESPN. You can initialize them in two different ways.

Building a Source will extract its categories, feeds, articles, brand, and description for you.

You may also provide configuration parameters like language, browser_user_agent, and etc seamlessly. Navigate to the advanced section for details.

>>> import newspaper
>>> cnn_paper = newspaper.build('http://cnn.com')

>>> sina_paper = newspaper.build('http://www.lemonde.fr/', language='fr')

However, if needed, you may also play with the lower level Source object as described in the advanced section.

Extracting articles¶

Every news source has a set of recent articles.

The following examples assume that a news source has been initialized and built.

>>> for article in cnn_paper.articles:
>>>     print(article.url)

u'http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/'
u'http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html'
...

>>> print(cnn_paper.size()) # cnn has 3100 articles
3100

Article caching¶

By default, newspaper caches all previously extracted articles and eliminates any article which it has already extracted.

This feature exists to prevent duplicate articles and to increase extraction speed.

>>> cbs_paper = newspaper.build('http://cbs.com')
>>> cbs_paper.size()
1030

>>> cbs_paper = newspaper.build('http://cbs.com')
>>> cbs_paper.size()
2

The return value of cbs_paper.size() changes from 1030 to 2 because when we first crawled cbs we found 1030 articles. However, on our second crawl, we eliminate all articles which have already been crawled.

This means 2 new articles have been published since our first extraction.

You may opt out of this feature with the memoize_articles parameter.

You may also pass in the lower level``Config`` objects as covered in the advanced section.

>>> import newspaper

>>> cbs_paper = newspaper.build('http://cbs.com', memoize_articles=False)
>>> cbs_paper.size()
1030

>>> cbs_paper = newspaper.build('http://cbs.com', memoize_articles=False)
>>> cbs_paper.size()
1030

Extracting Source categories¶

>>> for category in cnn_paper.category_urls():
>>>     print(category)

u'http://lifestyle.cnn.com'
u'http://cnn.com/world'
u'http://tech.cnn.com'
...

Extracting Source feeds¶

>>> for feed_url in cnn_paper.feed_urls():
>>>     print(feed_url)

u'http://rss.cnn.com/rss/cnn_crime.rss'
u'http://rss.cnn.com/rss/cnn_tech.rss'
...

Extracting Source brand & description¶

>>> print(cnn_paper.brand)
u'cnn'

>>> print(cnn_paper.description)
u'CNN.com delivers the latest breaking news and information on the latest...'

News Articles¶

Article objects are abstractions of news articles. For example, a news Source would be CNN while a news Article would be a specific CNN article. You may reference an Article from an existing news Source or initialize one by itself.

Referencing it from a Source.

>>> first_article = cnn_paper.articles[0]

Initializing an Article by itself.

>>> from newspaper import Article
>>> first_article = Article(url="http://www.lemonde.fr/...", language='fr')

Note the similar language= named paramater above. All the config parameters as described for Source objects also apply for Article objects! Source and Article objects have a very similar api.

There are endless possibilities on how we can manipulate and build articles.

Downloading an Article¶

We begin by calling download() on an article. If you are interested in how to quickly download articles concurrently with multi-threading check out the advanced section.

>>> first_article = cnn_paper.articles[0]

>>> first_article.download()

>>> print(first_article.html)
u'<!DOCTYPE HTML><html itemscope itemtype="http://...'

>>> print(cnn_paper.articles[7].html)
u'' fail, not downloaded yet

Parsing an Article¶

You may also extract meaningful content from the html, like authors and body-text. You must have called download() on an article before calling parse().

>>> first_article.parse()

>>> print(first_article.text)
u'Three sisters who were imprisoned for possibly...'

>>> print(first_article.top_image)
u'http://some.cdn.com/3424hfd4565sdfgdg436/

>>> print(first_article.authors)
[u'Eliott C. McLaughlin', u'Some CoAuthor']

>>> print(first_article.title)
u'Police: 3 sisters imprisoned in Tucson home'

>>> print(first_article.images)
['url_to_img_1', 'url_to_img_2', 'url_to_img_3', ...]

>>> print(first_article.movies)
['url_to_youtube_link_1', ...] # youtube, vimeo, etc

Performing NLP on an Article¶

Finally, you may extract out natural language properties from the text. You must have called both download() and parse() on the article before calling nlp().

As of the current build, nlp() features only work on western languages.

>>> first_article.nlp()

>>> print(first_article.summary)
u'...imprisoned for possibly a constant barrage...'

>>> print(first_article.keywords)
[u'music', u'Tucson', ... ]

>>> print(cnn_paper.articles[100].nlp()) # fail, not been downloaded yet
Traceback (...
ArticleException: You must parse an article before you try to..

nlp() is expensive, as is parse(), make sure you actually need them before calling them on all of your articles! In some cases, if you just need urls, even download() is not necessary.

Easter Eggs¶

Here are random but hopefully useful features! hot() returns a list of the top trending terms on Google using a public api. popular_urls() returns a list of popular news source urls.. In case you need help choosing a news source!

>>> import newspaper

>>> newspaper.hot()
['Ned Vizzini', Brian Boitano', Crossword Inventor', 'Alex & Sierra', ... ]

>>> newspaper.popular_urls()
['http://slate.com', 'http://cnn.com', 'http://huffingtonpost.com', ... ]

>>> newspaper.languages()

Your available languages are:
input code      full name

  ar              Arabic
  de              German
  en              English
  es              Spanish
  fr              French
  he              Hebrew
  it              Italian
  ko              Korean
  no              Norwegian
  pt              Portuguese
  sv              Swedish
  zh              Chinese