Public API Reference¶
Important UPDATE: As of February 2023, the overcomingbias blog has moved to substack. As a result, this web scraper no longer works. Apologies for the inconvenience!
This page contains a full reference to obscraper’s public API.
Objects¶
Post¶
- class obscraper.Post(name: str, number: int, page_type: str, page_status: str, page_format: str, title: str, author: str, publish_date: datetime, tags: List[str], categories: List[str], text_html: str, word_count: int, internal_links: List[str], external_links: List[str], disqus_id: str | None, votes: int | None = None, comments: int | None = None, edit_date: datetime | None = None)¶
Class representing a single post.
- name¶
The original year, month and abbreviated name of the post, as found in its url. E.g. ‘2010/09/jobs-explain-lots’.
- Type:
str
- number¶
The unique integer identifier of the post.
- Type:
int
- page_type¶
Page type, normally ‘post’. I don’t know its definition.
- Type:
str
- page_status¶
Page status, normally ‘publish’. I don’t know its definition.
- Type:
str
- page_format¶
Page format, normally ‘standard’. I don’t know its definition.
- Type:
str
- title¶
The title of the post, as seen on the page. E.g. ‘Jobs Explain Lots’.
- Type:
str
- author¶
The name of the author of the post. E.g. ‘Robin Hanson’
- Type:
str
- publish_date¶
The (aware) datetime when the post was first published, according to the post page.
- Type:
datetime.datetime
- tags¶
A list of tags associated with the post.
- Type:
List[str]
- categories¶
A list of categories associated with the post.
- Type:
List[str]
- text_html¶
The full text of the post in HTML format.
- Type:
str
- word_count¶
The number of words in the body of the post.
- Type:
int
- internal_links¶
List of hyperlinks to other posts. May contain duplicates.
- Type:
List[str]
- external_links¶
List of hyperlinks to external webpages. May contain duplicates.
- Type:
List[str]
- disqus_id¶
A string which uniquely identifies the post to the Disqus comment count API.
- Type:
str | None
- votes¶
The number of votes the post has received.
- Type:
int, optional
- comments¶
The number of comments on the post.
- Type:
int, optional
- edit_date¶
The (aware) datetime when the post was last edited, according to the sitemap.
- Type:
datetime.datetime, optional
- property plaintext¶
The full text of the post in plaintext format.
- Type:
str
- property url¶
The URL of the post.
- Type:
str
Functions¶
get_posts_by_names¶
- obscraper.get_posts_by_names(names)¶
Get dict of posts identified by their names.
No exceptions are raised if a post or post attribute is not found - instead “None” is returned for that post.
- Parameters:
names (List[str]) – A list of overcomingbias post names to scrape data for.
- Returns:
A dictionary whose keys are the inputted names and whose values are the corresponding posts.
- Return type:
Dict[str, obscraper.Post]
- Raises:
ValueError – If any of the input names are not valid overcomingbias post names.
get_vote_counts¶
- obscraper.get_vote_counts(numbers_dict)¶
Get vote counts for some posts.
Unlike other functions, get_vote_counts returns 0 (rather than None) when a post is not found. This is because the vote count API returns a vote count of 0 for posts that do not exist - it is impossible to tell whether a post doesn’t exist or if it just has zero votes.
- Parameters:
numbers_dict (Dict[str, int]) – Dictionary whose keys are arbitrary labels (e.g. the post URLs) and whose values are post numbers to get votes for.
- Returns:
A dictionary whose keys are the inputted labels and whose values are the corresponding vote counts (int). The vote count is 0 if the post is not found.
- Return type:
Dict[str, int]
- Raises:
ValueError – If any of the input post numbers are not valid.
get_comment_counts¶
- obscraper.get_comment_counts(disqus_ids_dict)¶
Get comment counts for some posts.
If no comment count is found, “None” is returned.
- Parameters:
disqus_ids_dict (Dict[str, str | None]) – Dictionary whose keys are arbitrary labels (e.g. the post URLs) and whose values are the the corresponding Disqus ID strings.
- Returns:
A dictionary whose keys are the inputted labels and whose values are the corresponding comment counts. The comment count is “None” if the post is not found.
- Return type:
Dict[str, int | None]
- Raises:
ValueError – If any of the input Disqus IDs are not valid.
get_edit_dates¶
- obscraper.get_edit_dates()¶
Get a dict of post edit dates.
- Returns:
Dictionary whose keys are post names and values are the last edit dates of each post as “aware” datetime.datetime objects.
- Return type:
Dict[str, datetime.datetime]
get_all_posts¶
- obscraper.get_all_posts()¶
Get all posts hosted on the overcomingbias site.
This includes vote and comment counts for each post, and their last edit dates.
Posts which are no longer hosted on the overcomingbias site are returned as “None”.
- Returns:
A dictionary whose keys are post names and whose values are the corresponding posts.
- Return type:
Dict[str, obscraper.Post]
get_posts_by_edit_date¶
- obscraper.get_posts_by_edit_date(start_date, end_date)¶
Get posts edited within a given date range.
- Parameters:
start_date (datetime.datetime) – The start and end dates of the date range, as “aware” datetimes.
end_date (datetime.datetime) – The start and end dates of the date range, as “aware” datetimes.
- Returns:
A dictionary whose keys are the URLs of posts edited within the date range, and whose values are the corresponding posts.
- Return type:
Dict[str, obscraper.Post]
- Raises:
ValueError – If start_date is after end_date.
get_posts_by_urls¶
- obscraper.get_posts_by_urls(urls)¶
Get list of posts identified by their URLs.
“None” is returned if a post could not be retrieved.
- Parameters:
urls (List[str]) – A list of overcomingbias post URLs to scrape data for.
- Returns:
A dictionary whose keys are the inputted URLs and whose values are the corresponding posts.
- Return type:
Dict[str, obscraper.Post]
- Raises:
ValueError – If any of the input URLs are not valid overcomingbias post URLs.
get_post_by_name¶
- obscraper.get_post_by_name(name)¶
Get a single post by its name.
- Parameters:
name (str) – An overcomingbias post name, e.g. ‘2010/09/jobs-explain-lots’.
- Returns:
The Post corresponding to the input name.
- Return type:
- Raises:
ValueError – If the input name is not a valid overcomingbias post name.
obscraper.InvalidResponseError – If the post could not be retrieved.
get_post_by_url¶
- obscraper.get_post_by_url(url)¶
Get a single post by its URL.
- Parameters:
url (str) – An overcomingbias post URL.
- Returns:
The Post corresponding to the input URL.
- Return type:
- Raises:
ValueError – If the input URL is not a valid overcomingbias post URL.
obscraper.InvalidResponseError – If the post could not be retrieved.
clear_cache¶
- obscraper.clear_cache()¶
Clear all cached data.
url_to_name¶
- obscraper.url_to_name(post_url)¶
Get the name of a post from its URL.
- Parameters:
post_url (str) – The URL of the post.
- Returns:
name – Name of the post, e.g. ‘2006/11/introduction’.
- Return type:
str
- Raises:
ValueError – If the input URL is not a valid overcomingbias post URL.
name_to_url¶
- obscraper.name_to_url(post_name)¶
Get the URL of a post from its name.
- Parameters:
post_name (str) – The name of the post. E.g. ‘2010/09/jobs-explain-lots’.
- Returns:
url – The URL of the post.
- Return type:
str
- Raises:
ValueError – If the input name is not a valid overcomingbias post name.
Serializers¶
PostEncoder¶
- obscraper.PostEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)¶
Encode obscraper.Post object to JSON.
All obscraper.Post attributes and properties (including post plaintext and URL) are included in the serialized object.
Inherits from
json.JSONEncoder.
PostDecoder¶
- obscraper.PostDecoder()¶
Decode a obscraper.Post object from JSON.
Inherits from
json.JSONDecoder, implementing a specialobject_hookto deserialize obscraper.Post objects.
Exceptions¶
InvalidResponseError¶
- exception obscraper.InvalidResponseError¶
An HTTP response returned unexpected content.
AttributeNotFoundError¶
- exception obscraper.AttributeNotFoundError¶
An attribute could not be extracted from an HTML page.
Constants¶
- obscraper.OB_POST_URL_PATTERN = '(^https?://www\\.overcomingbias\\.com/)(\\d{4}/\\d{2}/[a-z0-9-_%]+)(\\.html$)'¶
Regex pattern for “long” format overcomingbias URLs.
It consists of 3 capturing groups. The second group captures the post name.
- Type:
str