Wheat Disease Classification Media Scrape

Automated system for collecting and analyzing wheat disease-related news articles from South Asia.

Overview

This Flask-based web application automatically scrapes, validates, and processes news articles related to wheat diseases (particularly wheat rust/yellow rust) from multiple sources including Google News and NewsAPI. The system focuses on South Asian regions (Nepal, India, Bangladesh, Bhutan) to track wheat disease outbreaks and agricultural news.

The system uses Large Language Models (LLM) to extract structured information from articles and provides both web interface and API endpoints for accessing the processed data.

Features

Multi-Source News Aggregation
  • Google News Integration with regional filtering
  • NewsAPI integration for professional sources
  • Keyword-based search for wheat disease terminology
  • Regional filtering for South Asian agricultural news
Intelligent Content Processing
  • LLM-powered structured data extraction
  • Automatic wheat variety identification
  • Geographic location extraction
  • Disease type and outbreak classification
Database Management
  • SQLite backend with SQLAlchemy ORM
  • Duplicate detection and URL validation
  • Status tracking and retry logic
  • Structured content storage
API Access
  • RESTful API endpoints for data access
  • Filtering by country, district, date, and type
  • Statistical data aggregation
  • JSON response format

System Architecture

Data Collection      Content Extraction      LLM Processing      Storage & API
       ↓                     ↓                    ↓                 ↓
┌─────────────┐    ┌─────────────────┐    ┌─────────────────┐    ┌─────────────┐
│ Google News │    │ URL Validation  │    │ OpenRouter API  │    │ SQLite      │
│   NewsAPI   │───▶│ newspaper3k lib │───▶│ Inference Server│───▶│ Database    │
│             │    │ HTTP 200 Check  │    │ Mistral Models  │    │ REST API    │
│             │    │ Text Extraction │    │ JSON Structure  │    │ Web App     │
└─────────────┘    └─────────────────┘    └─────────────────┘    └─────────────┘
Detailed Process Flow
  1. Data Collection: Automated scraping from Google News and NewsAPI using scheduled jobs; searches for wheat disease keywords with regional filtering for South Asia.
  2. Content Extraction: Validates URLs with HTTP 200 checks then extracts article text and metadata (newspaper3k).
  3. LLM Processing: Processes article text using inference servers to extract structured information (country, district, date, news type, keywords, varieties, confidence).
  4. Storage & API: Stores processed data in SQLite and serves via RESTful API endpoints.

LLM Prompts

System Prompt

You are a wheat disease expert that analyzes news articles about wheat diseases. Extract key information and return it in JSON format with the following fields.

  • Required fields: country, district (array), date (YYYY-MM-DD), news_type, keywords, variety, confidence_score, explanation.
  • District rules:
    • Always return districts as an array of individual district names (never a single comma string).
    • Extract administrative district names only — do not include regional descriptions.
    • If multiple districts are mentioned, list each separately (e.g., ["Ambala", "Yamunanagar", "Karnal"]).
    • Do not return regions like Southern Punjab or Potohar region instead of district names.
  • Date rules:
    • Use actual disease observation dates only (YYYY-MM-DD). Do NOT use publication or byline dates.
    • Look for phrases such as observed in, sighted on, reported on, alert issued on.
    • If a date range is given, calculate and return the median date; if multiple sighting dates exist, return the earliest one; if none, return null.
  • news_type: choose from sighting, warning, advisory, or others.
  • Keywords & variety: return arrays of disease-related keywords and mentioned wheat varieties (e.g., WH-711).
  • confidence_score: float between 0.0 and 1.0 expressing overall extraction confidence.
  • Always return valid JSON containing all required fields.
  • REMEMBER: Publication/byline dates (e.g., 'February 08, 2022') are forbidden — only use actual disease occurrence dates.
User Prompt

Please analyze the provided article text and extract the structured fields described above.

  • Focus on disease observation phrases: observed in, sighted on, reported on, alert issued on, etc.
  • Extract districts as an array of names — do not return regional labels or a comma-separated string.
  • If dates are ranges, compute the median; if multiple dates exist, return the earliest; if none, return null.
Show full user prompt
Please analyze this text: {text}

IMPORTANT: Look for actual disease sighting dates, not publication dates. Use phrases like 'observed in', 'last week of January', or 'alert issued on' for date extraction.

CRITICAL: The 'district' field must be a JSON array of strings, not a single string. For example: ["District1", "District2", "District3"] NOT "District1, District2, District3" or "Region (District1, District2, District3)"
LLM Output JSON Structure
{
  "country": "India",
  "district": "Ambala",
  "date": "2024-02-15",
  "news_type": "sighting",
  "keywords": ["yellow rust", "wheat disease", "fungus"],
  "variety": ["WH-711", "HD-2967"],
  "confidence_score": 0.9,
  "explanation": "Yellow rust outbreak reported in Hussaini and Dhanaura villages"
}

Field Descriptions:

  • country/district: Geographic location of the disease occurrence
  • date: Actual disease sighting date (YYYY-MM-DD format)
  • news_type: Classification - 'sighting', 'warning', 'advisory', or 'others'

Search Strategy & Keywords

The system uses targeted search strategies combining disease-specific keywords, regional focus, and advanced AI models to identify and extract wheat disease information from news sources.
Disease Detection Keywords
wheat news wheat disease wheat rust yellow rust stripe rust
AI Models & Processing
mistralai/mistral-small-3.2-24b-instruct:freeCurrently Used
deepseek/deepseek-chat-v3-0324:freeAvailable
All accessed via OpenRouter API • Temperature: 0.1

API Endpoints

GET /api/wheat_disease — Retrieve records with optional filtering.
GET /api/wheat_disease/{id} — Get specific record by ID.
GET /api/wheat_disease/stats — Statistical overview.
Example Usage

CURL:

curl -u username:password "https://mediascraper.saralcodes.xyz/api/wheat_disease?country=India&limit=10"

Python:

import requests

response = requests.get(
    "https://mediascraper.saralcodes.xyz/api/wheat_disease?country=India&limit=10",
    auth=("username", "password")
)
print(response.json())
Authentication: All API endpoints require HTTP Basic Auth. Replace username:password with valid credentials.
Built for agricultural resilience in South Asia