Wheat Disease Classification Media Scrape
Automated system for collecting and analyzing wheat disease-related news articles from South Asia.
Overview
This Flask-based web application automatically scrapes, validates, and processes news articles related to wheat diseases (particularly wheat rust/yellow rust) from multiple sources including Google News and NewsAPI. The system focuses on South Asian regions (Nepal, India, Bangladesh, Bhutan) to track wheat disease outbreaks and agricultural news.
The system uses Large Language Models (LLM) to extract structured information from articles and provides both web interface and API endpoints for accessing the processed data.
Features
Multi-Source News Aggregation
- Google News Integration with regional filtering
- NewsAPI integration for professional sources
- Keyword-based search for wheat disease terminology
- Regional filtering for South Asian agricultural news
Intelligent Content Processing
- LLM-powered structured data extraction
- Automatic wheat variety identification
- Geographic location extraction
- Disease type and outbreak classification
Database Management
- SQLite backend with SQLAlchemy ORM
- Duplicate detection and URL validation
- Status tracking and retry logic
- Structured content storage
API Access
- RESTful API endpoints for data access
- Filtering by country, district, date, and type
- Statistical data aggregation
- JSON response format
System Architecture
Data Collection Content Extraction LLM Processing Storage & API
↓ ↓ ↓ ↓
┌─────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐
│ Google News │ │ URL Validation │ │ OpenRouter API │ │ SQLite │
│ NewsAPI │───▶│ newspaper3k lib │───▶│ Inference Server│───▶│ Database │
│ │ │ HTTP 200 Check │ │ Mistral Models │ │ REST API │
│ │ │ Text Extraction │ │ JSON Structure │ │ Web App │
└─────────────┘ └─────────────────┘ └─────────────────┘ └─────────────┘
Detailed Process Flow
- Data Collection: Automated scraping from Google News and NewsAPI using scheduled jobs; searches for wheat disease keywords with regional filtering for South Asia.
- Content Extraction: Validates URLs with HTTP 200 checks then extracts article text and metadata (newspaper3k).
- LLM Processing: Processes article text using inference servers to extract structured information (country, district, date, news type, keywords, varieties, confidence).
- Storage & API: Stores processed data in SQLite and serves via RESTful API endpoints.
LLM Prompts
System Prompt
You are a wheat disease expert that analyzes news articles about wheat diseases. Extract key information and return it in JSON format with the following fields.
- Required fields:
country,district(array),date(YYYY-MM-DD),news_type,keywords,variety,confidence_score,explanation. - District rules:
- Always return districts as an array of individual district names (never a single comma string).
- Extract administrative district names only — do not include regional descriptions.
- If multiple districts are mentioned, list each separately (e.g.,
["Ambala", "Yamunanagar", "Karnal"]). - Do not return regions like Southern Punjab or Potohar region instead of district names.
- Date rules:
- Use actual disease observation dates only (YYYY-MM-DD). Do NOT use publication or byline dates.
- Look for phrases such as observed in, sighted on, reported on, alert issued on.
- If a date range is given, calculate and return the median date; if multiple sighting dates exist, return the earliest one; if none, return
null.
- news_type: choose from
sighting,warning,advisory, orothers. - Keywords & variety: return arrays of disease-related keywords and mentioned wheat varieties (e.g.,
WH-711). - confidence_score: float between 0.0 and 1.0 expressing overall extraction confidence.
- Always return valid JSON containing all required fields.
- REMEMBER: Publication/byline dates (e.g., 'February 08, 2022') are forbidden — only use actual disease occurrence dates.
User Prompt
Please analyze the provided article text and extract the structured fields described above.
- Focus on disease observation phrases: observed in, sighted on, reported on, alert issued on, etc.
- Extract districts as an array of names — do not return regional labels or a comma-separated string.
- If dates are ranges, compute the median; if multiple dates exist, return the earliest; if none, return
null.
Show full user prompt
Please analyze this text: {text}
IMPORTANT: Look for actual disease sighting dates, not publication dates. Use phrases like 'observed in', 'last week of January', or 'alert issued on' for date extraction.
CRITICAL: The 'district' field must be a JSON array of strings, not a single string. For example: ["District1", "District2", "District3"] NOT "District1, District2, District3" or "Region (District1, District2, District3)"
LLM Output JSON Structure
{
"country": "India",
"district": "Ambala",
"date": "2024-02-15",
"news_type": "sighting",
"keywords": ["yellow rust", "wheat disease", "fungus"],
"variety": ["WH-711", "HD-2967"],
"confidence_score": 0.9,
"explanation": "Yellow rust outbreak reported in Hussaini and Dhanaura villages"
}
Field Descriptions:
- country/district: Geographic location of the disease occurrence
- date: Actual disease sighting date (YYYY-MM-DD format)
- news_type: Classification - 'sighting', 'warning', 'advisory', or 'others'
Search Strategy & Keywords
Disease Detection Keywords
AI Models & Processing
API Endpoints
/api/wheat_disease — Retrieve records with optional filtering./api/wheat_disease/{id} — Get specific record by ID./api/wheat_disease/stats — Statistical overview.Example Usage
CURL:
curl -u username:password "https://mediascraper.saralcodes.xyz/api/wheat_disease?country=India&limit=10"
Python:
import requests
response = requests.get(
"https://mediascraper.saralcodes.xyz/api/wheat_disease?country=India&limit=10",
auth=("username", "password")
)
print(response.json())
username:password with valid credentials.