Skip to main content

Overview

The datascout command discovers external data sources — APIs, datasets, open data portals, and commercial data providers — that can fulfill your project’s data and integration requirements. It prioritizes UK Government open data (data.gov.uk, api.gov.uk) and evaluates sources using weighted scoring across requirements fit, data quality, licensing, API quality, compliance, and reliability.

Command Syntax

arckit datascout <data-need>
Example:
arckit datascout "real-time traffic data UK"
arckit datascout "company financials"
arckit datascout "geolocation API"

Prerequisites

Mandatory

  • Requirements Document (REQ): Must contain Data Requirements (DR-xxx) or Integration Requirements (INT-xxx)
    • Command: arckit requirements <project>
    • The tool will warn if missing
  • Data Model (DATA): Identifies entities and attributes needing external data
  • Stakeholder Analysis (STKE): Identifies data ownership and governance needs
  • Architecture Principles (PRIN): Data standards, open data preferences

Workflow

1. Generate Requirements and Data Model

arckit requirements "traffic monitoring dashboard"
arckit data-model "traffic monitoring dashboard"

2. Discover Data Sources

arckit datascout "real-time traffic data UK"
The command will:
  1. Extract data needs from requirements (DR-xxx, INT-xxx, FR-xxx, NFR-xxx)
  2. Check UK Government sources FIRST (api.gov.uk, data.gov.uk)
  3. Research commercial APIs (Google, AWS, Azure)
  4. Research free/freemium APIs (OpenWeather, HERE)
  5. Research open datasets (Kaggle, data.world)
  6. Evaluate sources with weighted scoring
  7. Identify gaps and data utility opportunities

3. Review Output

The command creates:
  • File: projects/001-traffic-monitoring/research/ARC-001-DSCT-v1.0.md
  • Summary: Categories researched, sources discovered, UK Gov sources, top recommendations, gaps

4. Next Steps (Handoffs)

Data Model

Add discovered sources as new entities/attributes

Research

Research data source pricing and vendor selection

ADR

Record data source selection decisions

DPIA

Assess third-party sources containing personal data

Diagram

Create data flow diagrams showing integrations

Traceability

Map DR-xxx requirements to discovered sources

Data Source Categories

1. UK Government Open Data (Priority)

UK Government Open Data - Check FIRSTThe UK Government publishes thousands of datasets under Open Government Licence (OGL). These are:
  • Free to use (no licensing fees)
  • High quality (official government data)
  • Compliant (GDPR, accessibility, open standards)
  • Reliable (maintained by government departments)
Always check api.gov.uk and data.gov.uk before considering commercial alternatives.
Key UK Government Data Sources:
SourceDescriptionExample Use Cases
api.gov.ukCentral API catalog for government servicesCompanies House, DVLA, Land Registry, HMRC
data.gov.uk50,000+ open datasets from government departmentsCensus data, crime statistics, health data
Companies House APICompany information (free)Business intelligence, compliance checks
Ordnance SurveyGeospatial data, maps (free tier + premium)Location services, mapping, routing
NHS DigitalHealthcare data, hospital statistics (anonymized)Health analytics, service planning
DfT Traffic APIReal-time traffic flow, congestion (free)Traffic monitoring, route optimization
Met Office DataPointWeather forecasts, observations (free tier)Weather-based planning, alerts
ONS Open DataCensus, population, economic statistics (free)Demographics, market research
Police UK APICrime data by location (free)Safety analytics, risk assessment
Land RegistryProperty ownership, prices (premium)Property valuations, due diligence

2. Commercial APIs

Paid APIs from vendors (Google, AWS, Azure, Stripe, Twilio): Pros: Enterprise SLAs, global coverage, advanced features, dedicated support Cons: Licensing costs, vendor lock-in, compliance complexity (data residency, GDPR) Examples:
  • Google Maps API: Geocoding, routing, Places (pay-per-request)
  • Stripe API: Payment processing (transaction fees)
  • Twilio API: SMS, voice, WhatsApp (pay-per-message)
  • AWS Data Exchange: 3,500+ third-party datasets (subscription)

3. Free/Freemium APIs

Free tiers with usage limits, upgrade to paid for scale: Pros: Low cost for prototyping, easy to test Cons: Rate limits, no SLA, limited support, free tier may be discontinued Examples:
  • OpenWeather API: Weather data (60 calls/min free, then paid)
  • REST Countries API: Country data (unlimited free)
  • CurrencyLayer API: Exchange rates (1,000 requests/month free)
  • HERE Maps API: Geocoding, routing (250,000 transactions/month free)

4. Open Datasets

Public datasets (CSV, JSON, Parquet) for download: Pros: Free, no API rate limits, full data ownership Cons: No real-time updates, requires ETL pipeline, data quality varies Examples:
  • Kaggle: 50,000+ datasets (machine learning, analytics)
  • data.world: Collaborative data catalog (government, research, business)
  • AWS Open Data: Climate, genomics, satellite imagery (S3 buckets)
  • GitHub Awesome Public Datasets: Curated list of 1,000+ datasets

Evaluation Criteria (Weighted Scoring)

DataScout evaluates sources using weighted scoring:
CriterionWeightDescriptionScoring
Requirements Fit30%How well does it meet DR-xxx/INT-xxx requirements?0-10 (10 = perfect match)
Data Quality25%Accuracy, completeness, timeliness, freshness0-10 (10 = official government data)
License & Cost20%Open license (OGL, CC-BY), free vs paid, cost predictability0-10 (10 = free, open license)
API Quality15%REST/GraphQL, documentation, rate limits, authentication0-10 (10 = RESTful, well-documented)
Compliance5%GDPR, data residency (UK/EU), security certifications0-10 (10 = UK-hosted, GDPR-compliant)
Reliability5%Uptime SLA, vendor reputation, longevity0-10 (10 = 99.9% SLA, government source)
Total Score: Weighted sum (0-100) Example Scoring:
### DfT Real-Time Traffic API (UK Government)

**Total Score:** 92/100

| Criterion | Score | Weight | Weighted | Rationale |
|-----------|-------|--------|----------|----------|
| Requirements Fit | 10/10 | 30% | 30 | Perfectly matches DR-003 (real-time traffic flow) |
| Data Quality | 9/10 | 25% | 22.5 | Official DfT data, updated every 5 minutes |
| License & Cost | 10/10 | 20% | 20 | Free, Open Government Licence (OGL) |
| API Quality | 9/10 | 15% | 13.5 | RESTful, JSON, good docs, rate limit 600/hour |
| Compliance | 10/10 | 5% | 5 | UK-hosted, GDPR-compliant, no personal data |
| Reliability | 9/10 | 5% | 4.5 | 99.5% uptime (government SLA) |

**Recommendation:** ✅ HIGHLY RECOMMENDED (official government source, free, high quality)

Gap Analysis

DataScout identifies gaps where requirements cannot be fulfilled: Example:
## Gap Analysis

### GAP-001: Real-Time Parking Availability

**Requirement:** DR-005 - Real-time parking space availability (city-wide)

**Gap:** No centralized UK API for parking availability across councils

**Impact:** Cannot fulfill FR-012 (show nearby parking in app)

**Options:**
1. **Council-by-Council Integration:** Integrate with individual council parking APIs (e.g., Westminster, Camden)
   - **Pros:** Accurate, real-time
   - **Cons:** 30+ integrations needed for London alone, inconsistent APIs
2. **Parking Aggregator (ParkWhiz, JustPark):** Commercial API aggregating parking data
   - **Pros:** Single API, national coverage
   - **Cons:** Licensing cost (£5,000/year), limited council partnerships
3. **Build Custom Scraper:** Scrape council websites for parking data
   - **Pros:** Free
   - **Cons:** Unreliable, legal risks (ToS violations), no real-time guarantee

**Recommendation:** Option 2 (ParkWhiz API) - cost justified by national coverage and time savings

Data Utility Analysis

DataScout identifies data utility beyond primary requirements — additional value from discovered sources: Example:
## Data Utility Analysis

### Utility Opportunity: Census Demographics

**Primary Requirement:** DR-002 - Store customer postcodes for delivery routing

**Discovered Source:** ONS Census API (free, UK Government)

**Additional Utility:**
1. **Market Segmentation:** Link postcodes to Census demographics (age, income, household size)
   - **Use Case:** Personalized product recommendations (families vs singles)
2. **Service Planning:** Identify underserved areas (low delivery coverage, high demand)
   - **Use Case:** Optimize warehouse locations, delivery routes
3. **Compliance:** GDPR-compliant (anonymized, aggregated data)
   - **Use Case:** Avoid processing individual-level demographic PII

**Impact:**
- Enhances existing data model (E-001: Customer entity)
- Supports future FR-xxx requirements (personalization, analytics)
- No additional cost (free ONS API)

**Recommendation:** ✅ ADD TO DATA MODEL (new attribute: `census_area_id`)

Document Structure

The generated document includes:
# Data Source Discovery Document

## Document Control
- Document ID: ARC-001-DSCT-v1.0
- Version: 1.0
- Status: DRAFT

## Executive Summary
- Total Sources Discovered: 12
- UK Government Sources: 5
- Commercial APIs: 4
- Free/Freemium APIs: 2
- Open Datasets: 1

## Requirements Coverage
- Data Requirements (DR-xxx): 8 requirements
- Coverage: 87.5% (7/8 fulfilled)
- Gaps: 1 (DR-005: Real-time parking)

## Discovered Sources (by Category)

### UK Government Open Data
[Source 1, Source 2, ...]

### Commercial APIs
[Source 3, Source 4, ...]

### Free/Freemium APIs
[Source 5, Source 6, ...]

### Open Datasets
[Source 7, ...]

## Evaluation & Recommendations

### Top Recommended Sources
[Sorted by total score, 90+]

### Conditional Recommendations
[Scored 70-89, suitable for specific use cases]

### Not Recommended
[Scored <70, significant drawbacks]

## Gap Analysis
[Requirements not fulfilled, options, recommendations]

## Data Utility Analysis
[Additional value beyond primary requirements]

## Data Model Impact
[New entities, attributes, relationships to add]

## Integration Considerations
[Authentication, rate limits, data formats, error handling]

## Compliance & Privacy
[GDPR, data residency, third-party processors, DPIA triggers]

## Cost Analysis
[Licensing fees, pay-per-request, free tier limits]

## Next Steps
[Actions: add to data model, research vendors, create ADR, conduct DPIA]

Real-World Example

Project: Traffic Monitoring Dashboard (Project 002)Data Needs Extracted:
  • DR-003: Real-time traffic flow (speed, volume, congestion)
  • DR-004: Historical traffic patterns (for trend analysis)
  • INT-001: Geolocation API (map visualization)
  • INT-002: Weather data (correlate traffic with weather)
Categories Researched:
  • UK Government Open Data: 3 sources
  • Commercial APIs: 2 sources
  • Free/Freemium APIs: 1 source
  • Open Datasets: 1 source
Top Recommended Sources:
  1. DfT Real-Time Traffic API (UK Government) - Score: 92/100
    • Category: UK Gov Open Data
    • Cost: Free (Open Government Licence)
    • Coverage: Major roads, motorways (England)
    • Update Frequency: Every 5 minutes
    • API Quality: RESTful, JSON, 600 requests/hour
    • Requirements Fulfilled: DR-003 (real-time traffic)
    • Recommendation: ✅ PRIMARY SOURCE
  2. Ordnance Survey Maps API (UK Government) - Score: 88/100
    • Category: UK Gov Open Data (Premium tier)
    • Cost: Free tier (10,000 requests/month), then £500/month
    • Coverage: UK-wide maps, geolocation, routing
    • API Quality: RESTful, excellent docs, 600 requests/min
    • Requirements Fulfilled: INT-001 (geolocation)
    • Recommendation: ✅ RECOMMENDED (free tier sufficient for MVP)
  3. Met Office DataPoint API (UK Government) - Score: 85/100
    • Category: UK Gov Open Data
    • Cost: Free tier (5,000 requests/day), then £200/month
    • Coverage: UK-wide weather forecasts, observations
    • Update Frequency: Hourly
    • Requirements Fulfilled: INT-002 (weather data)
    • Recommendation: ✅ RECOMMENDED
  4. DfT Road Safety Data (UK Government Open Dataset) - Score: 78/100
    • Category: UK Gov Open Dataset (CSV download)
    • Cost: Free
    • Coverage: Accident data (2010-2023)
    • Update Frequency: Annual
    • Requirements Fulfilled: DR-004 (historical patterns)
    • Data Utility: Identify high-accident zones for route optimization
    • Recommendation: ✅ ADD TO DATA MODEL
Requirements Coverage: 100% (4/4 requirements fulfilled)Gaps Identified: 0Data Utility Highlights:
  • Road Safety Data: Can identify high-accident zones → recommend safer routes (beyond primary requirement)
  • Census Demographics: Link traffic patterns to demographics (commuter vs leisure traffic)
Data Model Impact:
  • Add entity: E-005: TrafficFlow (source: DfT API)
  • Add entity: E-006: WeatherCondition (source: Met Office API)
  • Add entity: E-007: AccidentHotspot (source: DfT Road Safety Dataset)
  • Add attributes: E-001: Location (census_area_id for demographics)
Next Steps:
  1. Run /arckit data-model 002 to add new entities
  2. Run /arckit adr dft-traffic-api to document API selection decision
  3. Run /arckit dpia 002 (Met Office API may contain location PII)
  4. Register for DfT API key: https://www.api.gov.uk/dft/traffic
  5. Implement API integration (rate limit: 600 requests/hour)

Tips & Best Practices

Prioritize UK Government SourcesUK Government open data is:
  • Free (no licensing fees, no pay-per-request)
  • Official (authoritative, high quality)
  • Compliant (GDPR, accessibility, open standards)
  • Reliable (government SLAs, long-term support)
Always check api.gov.uk and data.gov.uk before commercial APIs.
Third-Party Data = Third-Party RiskWhen integrating external data sources:
  • GDPR: If source contains personal data, you are a Data Controller (liable for breaches)
  • Data Processing Agreements: Required for third-party processors (AWS, Google, Stripe)
  • DPIA: Required if high-risk processing (special category data, large scale)
  • Vendor Lock-In: Avoid proprietary formats, prefer open standards (JSON, CSV)
Free Tier LimitsFree/freemium APIs often have:
  • Rate limits: 100-1,000 requests/hour (may not scale to production)
  • No SLA: Service may be down without notice
  • Feature limitations: Premium features (geocoding, historical data) locked
  • Discontinuation risk: Free tier may be discontinued (e.g., Google Maps pricing changes)
Plan for paid tier migration before launch.

Quality Checks

Before generating the document, ArcKit validates:

Requirements

Prerequisite: Run before datascout to identify DR-xxx/INT-xxx

Data Model

Next step: Add discovered sources as entities

DPIA

Next step: Assess third-party sources with PII

Research

Integration: Research vendor pricing and selection

ADR

Integration: Document data source selection decisions

Diagram

Downstream: Create data flow diagrams

Additional Resources

Build docs developers (and LLMs) love