Public API Reference

This page contains a full reference to obscraper’s public API.

Objects

Post

class obscraper.Post(name: str, number: int, page_type: str, page_status: str, page_format: str, title: str, author: str, publish_date: datetime, tags: List[str], categories: List[str], text_html: str, word_count: int, internal_links: List[str], external_links: List[str], disqus_id: Optional[str], votes: Optional[int] = None, comments: Optional[int] = None, edit_date: Optional[datetime] = None)

Class representing a single post.

name

The original year, month and abbreviated name of the post, as found in its url. E.g. ‘2010/09/jobs-explain-lots’.

Type

str

number

The unique integer identifier of the post.

Type

int

page_type

Page type, normally ‘post’. I don’t know its definition.

Type

str

page_status

Page status, normally ‘publish’. I don’t know its definition.

Type

str

page_format

Page format, normally ‘standard’. I don’t know its definition.

Type

str

title

The title of the post, as seen on the page. E.g. ‘Jobs Explain Lots’.

Type

str

author

The name of the author of the post. E.g. ‘Robin Hanson’

Type

str

publish_date

The (aware) datetime when the post was first published, according to the post page.

Type

datetime.datetime

tags

A list of tags associated with the post.

Type

List[str]

categories

A list of categories associated with the post.

Type

List[str]

text_html

The full text of the post in HTML format.

Type

str

word_count

The number of words in the body of the post.

Type

int

List of hyperlinks to other posts. May contain duplicates.

Type

List[str]

List of hyperlinks to external webpages. May contain duplicates.

Type

List[str]

disqus_id

A string which uniquely identifies the post to the Disqus comment count API.

Type

str | None

votes

The number of votes the post has received.

Type

int, optional

comments

The number of comments on the post.

Type

int, optional

edit_date

The (aware) datetime when the post was last edited, according to the sitemap.

Type

datetime.datetime, optional

property plaintext

The full text of the post in plaintext format.

Type

str

property url

The URL of the post.

Type

str

Functions

get_posts_by_names

obscraper.get_posts_by_names(names)

Get dict of posts identified by their names.

No exceptions are raised if a post or post attribute is not found - instead “None” is returned for that post.

Parameters

names (List[str]) – A list of overcomingbias post names to scrape data for.

Returns

A dictionary whose keys are the inputted names and whose values are the corresponding posts.

Return type

Dict[str, obscraper.Post]

Raises

ValueError – If any of the input names are not valid overcomingbias post names.

get_vote_counts

obscraper.get_vote_counts(numbers_dict)

Get vote counts for some posts.

Unlike other functions, get_vote_counts returns 0 (rather than None) when a post is not found. This is because the vote count API returns a vote count of 0 for posts that do not exist - it is impossible to tell whether a post doesn’t exist or if it just has zero votes.

Parameters

numbers_dict (Dict[str, int]) – Dictionary whose keys are arbitrary labels (e.g. the post URLs) and whose values are post numbers to get votes for.

Returns

A dictionary whose keys are the inputted labels and whose values are the corresponding vote counts (int). The vote count is 0 if the post is not found.

Return type

Dict[str, int]

Raises

ValueError – If any of the input post numbers are not valid.

get_comment_counts

obscraper.get_comment_counts(disqus_ids_dict)

Get comment counts for some posts.

If no comment count is found, “None” is returned.

Parameters

disqus_ids_dict (Dict[str, str | None]) – Dictionary whose keys are arbitrary labels (e.g. the post URLs) and whose values are the the corresponding Disqus ID strings.

Returns

A dictionary whose keys are the inputted labels and whose values are the corresponding comment counts. The comment count is “None” if the post is not found.

Return type

Dict[str, int | None]

Raises

ValueError – If any of the input Disqus IDs are not valid.

get_edit_dates

obscraper.get_edit_dates()

Get a dict of post edit dates.

Returns

Dictionary whose keys are post names and values are the last edit dates of each post as “aware” datetime.datetime objects.

Return type

Dict[str, datetime.datetime]

get_all_posts

obscraper.get_all_posts()

Get all posts hosted on the overcomingbias site.

This includes vote and comment counts for each post, and their last edit dates.

Posts which are no longer hosted on the overcomingbias site are returned as “None”.

Returns

A dictionary whose keys are post names and whose values are the corresponding posts.

Return type

Dict[str, obscraper.Post]

get_posts_by_edit_date

obscraper.get_posts_by_edit_date(start_date, end_date)

Get posts edited within a given date range.

Parameters
  • start_date (datetime.datetime) – The start and end dates of the date range, as “aware” datetimes.

  • end_date (datetime.datetime) – The start and end dates of the date range, as “aware” datetimes.

Returns

A dictionary whose keys are the URLs of posts edited within the date range, and whose values are the corresponding posts.

Return type

Dict[str, obscraper.Post]

Raises

ValueError – If start_date is after end_date.

get_posts_by_urls

obscraper.get_posts_by_urls(urls)

Get list of posts identified by their URLs.

“None” is returned if a post could not be retrieved.

Parameters

urls (List[str]) – A list of overcomingbias post URLs to scrape data for.

Returns

A dictionary whose keys are the inputted URLs and whose values are the corresponding posts.

Return type

Dict[str, obscraper.Post]

Raises

ValueError – If any of the input URLs are not valid overcomingbias post URLs.

get_post_by_name

obscraper.get_post_by_name(name)

Get a single post by its name.

Parameters

name (str) – An overcomingbias post name, e.g. ‘2010/09/jobs-explain-lots’.

Returns

The Post corresponding to the input name.

Return type

obscraper.Post

Raises

get_post_by_url

obscraper.get_post_by_url(url)

Get a single post by its URL.

Parameters

url (str) – An overcomingbias post URL.

Returns

The Post corresponding to the input URL.

Return type

obscraper.Post

Raises

clear_cache

obscraper.clear_cache()

Clear all cached data.

url_to_name

obscraper.url_to_name(post_url)

Get the name of a post from its URL.

Parameters

post_url (str) – The URL of the post.

Returns

name – Name of the post, e.g. ‘2006/11/introduction’.

Return type

str

Raises

ValueError – If the input URL is not a valid overcomingbias post URL.

name_to_url

obscraper.name_to_url(post_name)

Get the URL of a post from its name.

Parameters

post_name (str) – The name of the post. E.g. ‘2010/09/jobs-explain-lots’.

Returns

url – The URL of the post.

Return type

str

Raises

ValueError – If the input name is not a valid overcomingbias post name.

Serializers

PostEncoder

obscraper.PostEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)

Encode obscraper.Post object to JSON.

All obscraper.Post attributes and properties (including post plaintext and URL) are included in the serialized object.

Inherits from json.JSONEncoder.

PostDecoder

obscraper.PostDecoder()

Decode a obscraper.Post object from JSON.

Inherits from json.JSONDecoder, implementing a special object_hook to deserialize obscraper.Post objects.

Exceptions

InvalidResponseError

exception obscraper.InvalidResponseError

An HTTP response returned unexpected content.

AttributeNotFoundError

exception obscraper.AttributeNotFoundError

An attribute could not be extracted from an HTML page.

Constants

obscraper.OB_POST_URL_PATTERN = '(^https?://www\\.overcomingbias\\.com/)(\\d{4}/\\d{2}/[a-z0-9-_%]+)(\\.html$)'

Regex pattern for “long” format overcomingbias URLs.

It consists of 3 capturing groups. The second group captures the post name.

Type

str