Public API Reference¶
This page contains a full reference to obscraper’s public API.
Objects¶
Post¶
- class obscraper.Post(name: str, number: int, page_type: str, page_status: str, page_format: str, title: str, author: str, publish_date: datetime, tags: List[str], categories: List[str], text_html: str, word_count: int, internal_links: List[str], external_links: List[str], disqus_id: Optional[str], votes: Optional[int] = None, comments: Optional[int] = None, edit_date: Optional[datetime] = None)¶
Class representing a single post.
- name¶
The original year, month and abbreviated name of the post, as found in its url. E.g. ‘2010/09/jobs-explain-lots’.
- Type
str
- number¶
The unique integer identifier of the post.
- Type
int
- page_type¶
Page type, normally ‘post’. I don’t know its definition.
- Type
str
- page_status¶
Page status, normally ‘publish’. I don’t know its definition.
- Type
str
- page_format¶
Page format, normally ‘standard’. I don’t know its definition.
- Type
str
- title¶
The title of the post, as seen on the page. E.g. ‘Jobs Explain Lots’.
- Type
str
- author¶
The name of the author of the post. E.g. ‘Robin Hanson’
- Type
str
- publish_date¶
The (aware) datetime when the post was first published, according to the post page.
- Type
datetime.datetime
- tags¶
A list of tags associated with the post.
- Type
List[str]
- categories¶
A list of categories associated with the post.
- Type
List[str]
- text_html¶
The full text of the post in HTML format.
- Type
str
- word_count¶
The number of words in the body of the post.
- Type
int
- internal_links¶
List of hyperlinks to other posts. May contain duplicates.
- Type
List[str]
- external_links¶
List of hyperlinks to external webpages. May contain duplicates.
- Type
List[str]
- disqus_id¶
A string which uniquely identifies the post to the Disqus comment count API.
- Type
str | None
- votes¶
The number of votes the post has received.
- Type
int, optional
- comments¶
The number of comments on the post.
- Type
int, optional
- edit_date¶
The (aware) datetime when the post was last edited, according to the sitemap.
- Type
datetime.datetime, optional
- property plaintext¶
The full text of the post in plaintext format.
- Type
str
- property url¶
The URL of the post.
- Type
str
Functions¶
get_posts_by_names¶
- obscraper.get_posts_by_names(names)¶
Get dict of posts identified by their names.
No exceptions are raised if a post or post attribute is not found - instead “None” is returned for that post.
- Parameters
names (List[str]) – A list of overcomingbias post names to scrape data for.
- Returns
A dictionary whose keys are the inputted names and whose values are the corresponding posts.
- Return type
Dict[str, obscraper.Post]
- Raises
ValueError – If any of the input names are not valid overcomingbias post names.
get_vote_counts¶
- obscraper.get_vote_counts(numbers_dict)¶
Get vote counts for some posts.
Unlike other functions, get_vote_counts returns 0 (rather than None) when a post is not found. This is because the vote count API returns a vote count of 0 for posts that do not exist - it is impossible to tell whether a post doesn’t exist or if it just has zero votes.
- Parameters
numbers_dict (Dict[str, int]) – Dictionary whose keys are arbitrary labels (e.g. the post URLs) and whose values are post numbers to get votes for.
- Returns
A dictionary whose keys are the inputted labels and whose values are the corresponding vote counts (int). The vote count is 0 if the post is not found.
- Return type
Dict[str, int]
- Raises
ValueError – If any of the input post numbers are not valid.
get_comment_counts¶
- obscraper.get_comment_counts(disqus_ids_dict)¶
Get comment counts for some posts.
If no comment count is found, “None” is returned.
- Parameters
disqus_ids_dict (Dict[str, str | None]) – Dictionary whose keys are arbitrary labels (e.g. the post URLs) and whose values are the the corresponding Disqus ID strings.
- Returns
A dictionary whose keys are the inputted labels and whose values are the corresponding comment counts. The comment count is “None” if the post is not found.
- Return type
Dict[str, int | None]
- Raises
ValueError – If any of the input Disqus IDs are not valid.
get_edit_dates¶
- obscraper.get_edit_dates()¶
Get a dict of post edit dates.
- Returns
Dictionary whose keys are post names and values are the last edit dates of each post as “aware” datetime.datetime objects.
- Return type
Dict[str, datetime.datetime]
get_all_posts¶
- obscraper.get_all_posts()¶
Get all posts hosted on the overcomingbias site.
This includes vote and comment counts for each post, and their last edit dates.
Posts which are no longer hosted on the overcomingbias site are returned as “None”.
- Returns
A dictionary whose keys are post names and whose values are the corresponding posts.
- Return type
Dict[str, obscraper.Post]
get_posts_by_edit_date¶
- obscraper.get_posts_by_edit_date(start_date, end_date)¶
Get posts edited within a given date range.
- Parameters
start_date (datetime.datetime) – The start and end dates of the date range, as “aware” datetimes.
end_date (datetime.datetime) – The start and end dates of the date range, as “aware” datetimes.
- Returns
A dictionary whose keys are the URLs of posts edited within the date range, and whose values are the corresponding posts.
- Return type
Dict[str, obscraper.Post]
- Raises
ValueError – If start_date is after end_date.
get_posts_by_urls¶
- obscraper.get_posts_by_urls(urls)¶
Get list of posts identified by their URLs.
“None” is returned if a post could not be retrieved.
- Parameters
urls (List[str]) – A list of overcomingbias post URLs to scrape data for.
- Returns
A dictionary whose keys are the inputted URLs and whose values are the corresponding posts.
- Return type
Dict[str, obscraper.Post]
- Raises
ValueError – If any of the input URLs are not valid overcomingbias post URLs.
get_post_by_name¶
- obscraper.get_post_by_name(name)¶
Get a single post by its name.
- Parameters
name (str) – An overcomingbias post name, e.g. ‘2010/09/jobs-explain-lots’.
- Returns
The Post corresponding to the input name.
- Return type
- Raises
ValueError – If the input name is not a valid overcomingbias post name.
obscraper.InvalidResponseError – If the post could not be retrieved.
get_post_by_url¶
- obscraper.get_post_by_url(url)¶
Get a single post by its URL.
- Parameters
url (str) – An overcomingbias post URL.
- Returns
The Post corresponding to the input URL.
- Return type
- Raises
ValueError – If the input URL is not a valid overcomingbias post URL.
obscraper.InvalidResponseError – If the post could not be retrieved.
clear_cache¶
- obscraper.clear_cache()¶
Clear all cached data.
url_to_name¶
- obscraper.url_to_name(post_url)¶
Get the name of a post from its URL.
- Parameters
post_url (str) – The URL of the post.
- Returns
name – Name of the post, e.g. ‘2006/11/introduction’.
- Return type
str
- Raises
ValueError – If the input URL is not a valid overcomingbias post URL.
name_to_url¶
- obscraper.name_to_url(post_name)¶
Get the URL of a post from its name.
- Parameters
post_name (str) – The name of the post. E.g. ‘2010/09/jobs-explain-lots’.
- Returns
url – The URL of the post.
- Return type
str
- Raises
ValueError – If the input name is not a valid overcomingbias post name.
Serializers¶
PostEncoder¶
- obscraper.PostEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)¶
Encode obscraper.Post object to JSON.
All obscraper.Post attributes and properties (including post plaintext and URL) are included in the serialized object.
Inherits from
json.JSONEncoder.
PostDecoder¶
- obscraper.PostDecoder()¶
Decode a obscraper.Post object from JSON.
Inherits from
json.JSONDecoder, implementing a specialobject_hookto deserialize obscraper.Post objects.
Exceptions¶
InvalidResponseError¶
- exception obscraper.InvalidResponseError¶
An HTTP response returned unexpected content.
AttributeNotFoundError¶
- exception obscraper.AttributeNotFoundError¶
An attribute could not be extracted from an HTML page.
Constants¶
- obscraper.OB_POST_URL_PATTERN = '(^https?://www\\.overcomingbias\\.com/)(\\d{4}/\\d{2}/[a-z0-9-_%]+)(\\.html$)'¶
Regex pattern for “long” format overcomingbias URLs.
It consists of 3 capturing groups. The second group captures the post name.
- Type
str