Installation and Getting Started

Supported Python Versions: 3.8+

Supported Platforms: Linux, Windows, MacOS

This page explains how to install obscraper and run some basic commands.

Install obscraper

Install from PyPI:

$ python -m pip install obscraper
$ python -m pip show obscraper

Alternatively, get the source code:

$ git clone https://github.com/chris-mcdo/obscraper.git

Get a Single Post

Scrape data from a single post:

>>> import obscraper
>>> intro = obscraper.get_post_by_url('https://www.overcomingbias.com/2006/11/introduction.html')
>>> f'{intro.title}, by {intro.author} ({intro.word_count} words)'
'How To Join, by Robin Hanson (263 words)'

The post is represented as a Post object:

>>> type(intro)
<class 'obscraper._post.Post'>
>>> f'{intro.publish_date}'
'2006-11-20 11:00:00+00:00'
>>> intro.tags
['meta']
>>> intro.text_html
'<div class="entry-content">\n<p>How can we better believe what is true? ...'
>>> intro.plaintext
'How can we better believe what is true? ...'

A full list of post attributes can be found in the API Reference.

Get Multiple Posts

get_posts_by_urls and get_posts_by_names let you get multiple posts by their URLs / names. The “name” of a post is its URL with the same-y parts chopped off:

>>> names = [
... '2006/11/quiz_fox_or_hed',
... '2011/04/the-seti-game',
... '2013/10/stories-change-goals',
... ]
>>> posts = obscraper.get_posts_by_names(names)

This returns a dictionary whose keys are the original URLs / names, and whose values are the corresponding Post objects:

>>> type(posts)
<class 'dict'>
>>> [p.title for p in posts.values()]
['Quiz: Fox or Hedgehog?', 'The SETI Game', 'Stories Change Goals']
>>> [p.word_count for p in posts.values()]
[980, 792, 316]

Alternatively, you can get posts by their “last edited” dates:

>>> import datetime
>>> today = datetime.datetime.now(tz=datetime.timezone.utc)
>>> one_year_ago = today - 365 * datetime.timedelta(days=1)
>>> posts = obscraper.get_posts_by_edit_date(start_date=one_year_ago, end_date=today)
>>> len(posts)
142
>>> [p.title for p in posts.values() if p is not None][:5]
['Best Case Contrarians', 'Much Talk Is Sales Patter', 'My Old Man Rant',
'My 11 Bets at 10-1 Odds On 10M Covid deaths by 2022',
'To Innovate, Unify or Fragment?']

Both get_posts_by_urls and get_posts_by_edit_date return a dictionary of labels (URLs / names) and posts. This is the standard format for responses from the obscraper API.

Get All Posts

To get a list of URLs and “last edited” dates for all posts (including some no longer hosted on the overcomingbias site), you can use get_edit_dates:

>>> urls_and_dates = obscraper.get_edit_dates()
>>> len(urls_and_dates)
4353
>>> {url: str(urls_and_dates[url]) for url in list(urls_and_dates)[:5]}
{'2022/01/best-case-contrarians':
'2022-01-16 21:55:04+00:00',
'2022/01/much-talk-is-sales-patter':
'2022-01-14 20:46:35+00:00',
'2022/01/old-man-rant':
'2022-01-13 15:21:33+00:00',
'2022/01/my-11-bets-at-10-1-odds-on-10m-covid-deaths-by-2022':
'2022-01-12 19:15:10+00:00',
'2022/01/to-innovate-unify-or-fragment':
'2022-01-11 01:03:44+00:00'}

You can download all posts indirectly by using get_posts_by_edit_date, or directly using get_all_posts:

>>> all_posts = obscraper.get_all_posts()
>>> len(all_posts)
3702
>>> [p.title for p in all_posts.values() if 'Liability' in p.title]
['Innovation Liability Nightmare', 'Liability Insurance For All',
'Between Property and Liability', 'All Pay Liability',
'Require Legal Liability Insurance', 'For Doc Liability']

This may take a few (<10) minutes.

get_all_posts will send more than 4000 requests to the overcomingbias site, and download ~100MB-1GB of data. get_edit_dates requires only 1 request to the overcomingbias site, so should probably be preferred where possible.

Updating Vote and Comment Counts

Vote and comment counts are collected from separate APIs to the rest of the post data.

They can be updated using get_vote_counts and get_comment_counts:

>>> obscraper.get_vote_counts({'intro': intro.number})
{'intro': 4}
>>> obscraper.get_comment_counts({'intro': intro.disqus_id})
{'intro': 20}

Note

The vote count API appears to be broken for posts published after 2021-03-17.

Representing Post Objects using JSON

To convert a list of Post objects (or just one) to the JSON format, use the PostEncoder class:

>>> import json
>>> intro_json = json.dumps(intro, cls=obscraper.PostEncoder)
>>> intro_json
'{"name": "2006/11/introduction", "number": 18402, ...}'

This is useful when storing posts for later:

>>> write_path = '2006-11-introduction.json'
>>> with open(write_path, mode='w', encoding='utf8') as out_file:
...     json.dump(intro, out_file, cls=obscraper.PostEncoder, indent=4)

Also, the attributes of the post can be examined more easily in a file:

2006-11-introduction.json
{
    "name": "2006/11/introduction",
    "number": 18402,
    "page_type": "post",
    ...
}

To convert the JSON back into a Post object, use the PostDecoder class:

>>> intro_json
'{"name": "2006/11/introduction", "number": 18402, ...}'
>>> intro_decoded = json.loads(intro_json, cls=obscraper.PostDecoder)
>>> type(intro_decoded)
<class 'obscraper._post.Post'>
>>> intro_decoded.title
'How To Join'

Command Line Interface

obscraper also comes with a command line interface:

$ obscraper --dates "November 25, 2016" "November 30, 2016"
Getting posts edited between 2016-11-25 00:00:00+00:00 and 2016-11-30 00:00:00+00:00...
Writing posts to posts.json...
Posts successfully written to file.

You can use the CLI to get posts by their URLs or their edit dates, or to download all posts. By default the results are stored in a posts.json file in the current directory:

posts.json
[
    {
        "url": "https://www.overcomingbias.com/2016/11/myplay.html",
        "post": {
            "name": "2016/11/myplay",
            "number": 31449,
            "page_type": "post",
            ...
        }
    },
    ...
]

To see a full list of commands, use the -h / –help option.

Logging

obscraper uses python’s inbuilt logging library to monitor its activity. This is mainly useful for debugging, but if you want you can see these logs yourself by setting up a logger:

import logging
handler = logging.FileHandler('logs.txt', encoding='utf-8')
logger = logging.getLogger('obscraper')
logger.setLevel(logging.DEBUG)
logger.addHandler(handler)

names = [
    '2010/08/new-hard-steps-results',
    '2009/02/the-most-important-thing'
]
posts = obscraper.get_posts_by_names(names)

# Close logging file when finished!
handler.close()
logger.removeHandler(handler)
logs.txt
AttributeNotFoundError raised when grabbing post 2009/02/the-most-important-thing
Successfully grabbed post 2010/08/new-hard-steps-results

The urllib3 library - which acts as the HTTP client - also uses logging. You can get its logs by the same method as above.

Caching

By default, obscraper caches recently accessed sites to increase post retrieval speed and reduce the load on the overcomingbias site. This cache can be cleared using clear_cache. You may want to do this if the site has recently been updated, or a post has been added.

Errors and Exceptions

obscraper tries to catch most errors before attempting to download anything. For example:

>>> obscraper.get_post_by_url(12345)
Traceback ...
TypeError: expected URL to be type str, got <class 'int'>
>>> obscraper.get_post_by_url('https://www.overcomingbias.com/blah')
Traceback ...
ValueError: expected URL to be overcomingbias post URL, got
https://www.overcomingbias.com/blah

When a URL is not found on the overcomingbias site, get_post_by_url will raise an InvalidResponseError.

By contrast, get_posts_by_urls will just return None for that particular post:

>>> urls = [
... 'https://www.overcomingbias.com/2006/11/quiz_fox_or_hed.html',
... 'https://www.overcomingbias.com/2011/04/the-seti-game.html',
... 'https://www.overcomingbias.com/2013/10/not-a-real-post.html',
... ]
>>> posts = obscraper.get_posts_by_urls(urls)
>>> posts[urls[0]].title
'Quiz: Fox or Hedgehog?'
>>> posts[urls[2]]
None

The behaviour is similar for get_post_by_name and get_posts_by_names. This is useful when you intend to download many posts, some of which may not exist.

Continue Reading

For more details on the obscraper public API, see the Public API Reference.