Implementing new PUC scrapers
This guide explains how to implement a new scraper for collecting public utility commission case data using the AbstractScraper
base class.
Base Class Overview
All scrapers must inherit from AbstractScraper
and implement two required methods:
get_all_cases()
- Retrieves a list of all available casesget_case_details(case)
- Retrieves detailed information for a specific case
Scrapers should use the data models provided in the openpuc_scrapers.models
module to represent cases, filings, and attachments. See the Data Models documentation for more details.
Implementation Requirements
To create a new scraper:
Create a new class inheriting from
AbstractScraper
:from .base import AbstractScraper from ..models import Case, Filing, Attachment class NewStatePUC(AbstractScraper): """Interface for interacting with and parsing {State} PUC data."""
Implement the required abstract methods:
def get_all_cases(self) -> list[Case]: """Retrieve a list of all available cases.""" # Implementation here def get_case_details(self, case: Case) -> Case: """Retrieve details for a specific case.""" # Implementation here
Example Implementation
Here’s a simplified example showing the key components of a scraper implementation:
class NewStatePUC(AbstractScraper):
def get_all_cases(self) -> list[Case]:
# Fetch list of cases from website
response = requests.get(self._get_cases_url())
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
cases = []
for case_element in soup.find_all("div", class_="case"):
case = Case(
case_number=case_element.find("span", class_="number").text,
description=case_element.find("span", class_="description").text
)
cases.append(case)
return cases
def get_case_details(self, case: Case) -> Case:
# Fetch case details
response = requests.get(self._get_case_url(case.case_number))
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
# Update case details
case = self._parse_case_details(soup, case)
# Get filings
case.filings = self._parse_filings(soup, case)
return case
def _get_case_url(self, case_number: str) -> str:
return f"https://example.com/cases/{case_number}"
def _parse_case_details(self, soup: BeautifulSoup, case: Case) -> Case:
# Extract and clean case details
# WARNING: you'll probably need to do some data santization first!
case.case_type = soup.find("input", id="type")["value"]
case.description = " ".join(
soup.find("textarea", id="description").text.split()
)
return case
def _parse_filings(self, soup: BeautifulSoup, case: Case) -> list[Filing]:
filings = []
for filing_div in soup.find_all("div", class_="filing"):
# WARNING: you'll probably need to do some data santization first!
filing = Filing(
filed_date=datetime.strptime(
filing_div.find("span", class_="date").text,
"%m/%d/%Y"
).date(),
party_name=filing_div.find("span", class_="party").text,
filing_type=filing_div.find("span", class_="type").text,
description=filing_div.find("div", class_="description").text
)
filings.append(filing)
return filings
Testing Your Implementation
When implementing a new scraper:
Test the
get_all_cases()
method to ensure it returns valid Case objectsTest the
get_case_details()
method with various case numbersVerify all required fields are properly populated
Check data sanitization and normalization
Verify proper error handling for invalid cases or network issues
Test handling of edge cases (empty fields, unusual characters, etc.)
Common Challenges
Website Structure Changes: Design your parsing code to be resilient to minor HTML changes
Inconsistent Data: Handle variations in date formats, field names, and data quality
Error Handling: Implement robust error handling for network issues and invalid data
Data Sanitization: Clean and normalize data to ensure consistency and accuracy
Pagination: Implement pagination support if cases are spread across multiple pages
Rate Limiting: Respect website rate limits and implement appropriate delays