.. _quickstart: Quickstart ========== Eager to get started? This page gives a good introduction in how to get started with newspaper. This assumes you already have newspaper installed. If you do not, head over to the :ref:`Installation ` section. Building a news source ---------------------- Source objects are an abstraction of online news media websites like CNN or ESPN. You can initialize them in two *different* ways. Building a ``Source`` will extract its categories, feeds, articles, brand, and description for you. You may also provide configuration parameters like ``language``, ``browser_user_agent``, and etc seamlessly. Navigate to the :ref:`advanced ` section for details. .. code-block:: pycon >>> import newspaper >>> cnn_paper = newspaper.build('http://cnn.com') >>> sina_paper = newspaper.build('http://www.lemonde.fr/', language='fr') However, if needed, you may also play with the lower level ``Source`` object as described in the :ref:`advanced ` section. Extracting articles ------------------- Every news source has a set of *recent* articles. The following examples assume that a news source has been initialized and built. .. code-block:: pycon >>> for article in cnn_paper.articles: >>> print(article.url) u'http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/' u'http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html' ... >>> print(cnn_paper.size()) # cnn has 3100 articles 3100 Article caching --------------- By default, newspaper caches all previously extracted articles and **eliminates any article which it has already extracted**. This feature exists to prevent duplicate articles and to increase extraction speed. .. code-block:: pycon >>> cbs_paper = newspaper.build('http://cbs.com') >>> cbs_paper.size() 1030 >>> cbs_paper = newspaper.build('http://cbs.com') >>> cbs_paper.size() 2 The return value of ``cbs_paper.size()`` changes from 1030 to 2 because when we first crawled cbs we found 1030 articles. However, on our second crawl, we eliminate all articles which have already been crawled. This means **2** new articles have been published since our first extraction. You may opt out of this feature with the ``memoize_articles`` parameter. You may also pass in the lower level``Config`` objects as covered in the :ref:`advanced ` section. .. code-block:: pycon >>> import newspaper >>> cbs_paper = newspaper.build('http://cbs.com', memoize_articles=False) >>> cbs_paper.size() 1030 >>> cbs_paper = newspaper.build('http://cbs.com', memoize_articles=False) >>> cbs_paper.size() 1030 Extracting Source categories ---------------------------- .. code-block:: pycon >>> for category in cnn_paper.category_urls(): >>> print(category) u'http://lifestyle.cnn.com' u'http://cnn.com/world' u'http://tech.cnn.com' ... Extracting Source feeds ----------------------- .. code-block:: pycon >>> for feed_url in cnn_paper.feed_urls(): >>> print(feed_url) u'http://rss.cnn.com/rss/cnn_crime.rss' u'http://rss.cnn.com/rss/cnn_tech.rss' ... Extracting Source brand & description ------------------------------------- .. code-block:: pycon >>> print(cnn_paper.brand) u'cnn' >>> print(cnn_paper.description) u'CNN.com delivers the latest breaking news and information on the latest...' News Articles ------------- Article objects are abstractions of news articles. For example, a news ``Source`` would be CNN while a news ``Article`` would be a specific CNN article. You may reference an ``Article`` from an existing news ``Source`` or initialize one by itself. Referencing it from a ``Source``. .. code-block:: pycon >>> first_article = cnn_paper.articles[0] Initializing an ``Article`` by itself. .. code-block:: pycon >>> from newspaper import Article >>> first_article = Article(url="http://www.lemonde.fr/...", language='fr') Note the similar ``language=`` named paramater above. All the config parameters as described for ``Source`` objects also apply for ``Article`` objects! **Source and Article objects have a very similar api**. Initializing an ``Article`` with the particular content-type ignoring. There is option to skip loading of articles with particular content-type, that can be useful if it is not desired to have delays because of long PDF resources. The default html value for the particular content type can be provided and then used in order to define the actual content-type of the article .. code-block:: pycon >>> from newspaper import Article >>> pdf_defaults = {"application/pdf": "%PDF-", "application/x-pdf": "%PDF-", "application/x-bzpdf": "%PDF-", "application/x-gzpdf": "%PDF-"} >>> pdf_article = Article(url='https://www.adobe.com/pdf/pdfs/ISO32000-1PublicPatentLicense.pdf', ignored_content_types_defaults=pdf_defaults) >>> pdf_article.download() >>> print(pdf_article.html) %PDF- There are endless possibilities on how we can manipulate and build articles. Downloading an Article ---------------------- We begin by calling ``download()`` on an article. If you are interested in how to quickly download articles concurrently with multi-threading check out the :ref:`advanced ` section. .. code-block:: pycon >>> first_article = cnn_paper.articles[0] >>> first_article.download() >>> print(first_article.html) u'