obscraper: scrape posts from the overcomingbias blog

obscraper lets you scrape blog posts and associated metadata from the overcomingbias blog.

It’s easy to get a single post:

>>> import obscraper
>>> intro_url = 'https://www.overcomingbias.com/2006/11/introduction.html'
>>> intro = obscraper.get_post_by_url(intro_url)
>>> intro.title
'How To Join'
>>> intro.plaintext
'How can we better believe what is true? ...'
>>> intro.internal_links
[
  'http://www.overcomingbias.com/2007/02/moderate_modera.html': 1,
  'http://www.overcomingbias.com/2006/12/contributors_be.html': 1
]
>>> intro.comments
20

Or a full list of post names and edit dates:

>>> import obscraper
>>> edit_dates = obscraper.get_edit_dates()
...
>>> len(edit_dates)
4352
>>> {name: str(edit_dates[name]) for name in list(edit_dates)[:5]}
{'2022/01/much-talk-is-sales-patter':
'2022-01-14 20:46:35+00:00',
'2022/01/old-man-rant':
'2022-01-13 15:21:33+00:00',
'2022/01/my-11-bets-at-10-1-odds-on-10m-covid-deaths-by-2022':
'2022-01-12 19:15:10+00:00',
'2022/01/to-innovate-unify-or-fragment':
'2022-01-11 01:03:44+00:00',
'2022/01/on-what-is-advice-useful':
'2022-01-10 18:46:26+00:00'}

For more on how to use the package, see Getting Started.

Features

  • Get posts by their URLs or edit dates, or get all posts hosted on the overcomingbias site

  • Provides detailed post metadata including post URLs, titles, authors, tags, publish dates, and last edit dates

  • Provides summary of post content including full post text as HTML or plaintext, and a list of hyperlinks to other overcomingbias posts

  • Asynchronous execution and caching for fast downloads

  • Use via import obscraper or the simple command line interface

  • Comprehensively tested

  • Supports python 3.8+

Documentation

See Getting Started for an introduction to the package.

A full reference to the obscraper public API can be found at Public API Reference.

For the full details, check out the well-documented code.

Bugs/Requests

Please use the GitHub issue tracker to submit bugs or request features.

Changelog

See the Changelog for a list of fixes and enhancements of each version.

License

Copyright (c) 2022 Christopher McDonald

Distributed under the terms of the MIT license.

All overcomingbias posts are copyright the original authors.