🧩 Core Components¶
This section breaks down each component of the pipeline.
⚙️ config.py¶
- Centralized configuration for all parameters:
 PROMETHEUS_PORT,SCRAPER_CONFIG,USER_AGENTS,ML_CONFIG.- Prevents magic numbers and centralizes system settings.
 
🚀 producer.py¶
- Discovers property listing URLs.
 - Enqueues each URL into AWS SQS.
 - Key Features:
 - Async Playwright scraping
 - Pagination handling
 - URL filtering/limiting
 - Concurrent enqueueing
 
📦 consumer.py¶
- Polls SQS for messages.
 - Scrapes property details.
 - Persists data into PostgreSQL.
 - Features:
 - Long-lived Playwright browser context
 - Async scraping with concurrency limits
 - Data validation
 - Upsert persistence logic
 - Prometheus metrics
 - Graceful shutdown (SIGINT, SIGTERM)
 
🕷️ scraper.py¶
- Encapsulates Playwright scraping logic.
 - Features:
 - Async context manager for browser lifecycle
 - User-agent rotation
 - Resilient retry logic
 - Single property page scraping with validation
 - Defensive scraping (closing pages after use)
 
📝 data_extractor.py¶
- Handles parsing and cleaning of data.
 - Fault-tolerant with helper methods (
safe_inner_text,safe_get_attribute). - Extracts multi-floor plans, normalizes data.
 
💾 Database Layer¶
dbmodels.py¶
- SQLModel ORM definitions:
 Propertytable (listing info, metadata)Pricing_and_floor_planstable (unit-level details)
db_ops.py¶
- Handles database sessions, inserts, updates.
 - Features:
 - Async PostgreSQL engine
 - Upsert logic with rollback on error
 - Numeric parsing & type conversion
 - Timezone-aware timestamps
 
📈 FastAPI Layer¶
- Routers: 
/properties,/analytics,/predict - Security: token-based authentication
 - Features:
 - Pydantic models for response validation
 - Analytics queries
 - Real-time ML predictions