Wheat Disease Media Scrape

Overview

This Flask-based web application automatically scrapes, validates, and processes news articles related to wheat diseases (particularly wheat rust/yellow rust) from multiple sources including Google News and NewsAPI. The system focuses on South Asian regions (Nepal, India, Bangladesh, Bhutan) to track wheat disease outbreaks and agricultural news.

The system uses Large Language Models (LLM) to extract structured information from articles and provides both web interface and API endpoints for accessing the processed data.

Features

Multi-Source News Aggregation

Google News Integration with regional filtering
NewsAPI integration for professional sources
Keyword-based search for wheat disease terminology
Regional filtering for South Asian agricultural news

Intelligent Content Processing

LLM-powered structured data extraction
Automatic wheat variety identification
Geographic location extraction
Disease type and outbreak classification

Database Management

SQLite backend with SQLAlchemy ORM
Duplicate detection and URL validation
Status tracking and retry logic
Structured content storage

API Access

RESTful API endpoints for data access
Filtering by country, district, date, and type
Statistical data aggregation
JSON response format

System Architecture

Data Collection      Content Extraction      LLM Processing      Storage & API
       ↓                     ↓                    ↓                 ↓
┌─────────────┐    ┌─────────────────┐    ┌─────────────────┐    ┌─────────────┐
│ Google News │    │ URL Validation  │    │ OpenRouter API  │    │ SQLite      │
│   NewsAPI   │───▶│ newspaper3k lib │───▶│ Inference Server│───▶│ Database    │
│             │    │ HTTP 200 Check  │    │ Mistral Models  │    │ REST API    │
│             │    │ Text Extraction │    │ JSON Structure  │    │ Web App     │
└─────────────┘    └─────────────────┘    └─────────────────┘    └─────────────┘

Detailed Process Flow

Data Collection: Automated scraping from Google News and NewsAPI using scheduled jobs; searches for wheat disease keywords with regional filtering for South Asia.
Content Extraction: Validates URLs with HTTP 200 checks then extracts article text and metadata (newspaper3k).
LLM Processing: Processes article text using inference servers to extract structured information (country, district, date, news type, keywords, varieties, confidence).
Storage & API: Stores processed data in SQLite and serves via RESTful API endpoints.

LLM Prompts

System Prompt

You are a wheat disease expert that analyzes news articles about wheat diseases. Extract key information and return it in JSON format with the following fields.

Required fields: country, district (array), date (YYYY-MM-DD), news_type, keywords, variety, confidence_score, explanation.
District rules:
- Always return districts as an array of individual district names (never a single comma string).
- Extract administrative district names only — do not include regional descriptions.
- If multiple districts are mentioned, list each separately (e.g., ["Ambala", "Yamunanagar", "Karnal"]).
- Do not return regions like Southern Punjab or Potohar region instead of district names.
Date rules:
- Use actual disease observation dates only (YYYY-MM-DD). Do NOT use publication or byline dates.
- Look for phrases such as observed in, sighted on, reported on, alert issued on.
- If a date range is given, calculate and return the median date; if multiple sighting dates exist, return the earliest one; if none, return null.
news_type: choose from sighting, warning, advisory, or others.
Keywords & variety: return arrays of disease-related keywords and mentioned wheat varieties (e.g., WH-711).
confidence_score: float between 0.0 and 1.0 expressing overall extraction confidence.
Always return valid JSON containing all required fields.
REMEMBER: Publication/byline dates (e.g., 'February 08, 2022') are forbidden — only use actual disease occurrence dates.

User Prompt

Please analyze the provided article text and extract the structured fields described above.

Focus on disease observation phrases: observed in, sighted on, reported on, alert issued on, etc.
Extract districts as an array of names — do not return regional labels or a comma-separated string.
If dates are ranges, compute the median; if multiple dates exist, return the earliest; if none, return null.

Show full user prompt

Please analyze this text: {text}

IMPORTANT: Look for actual disease sighting dates, not publication dates. Use phrases like 'observed in', 'last week of January', or 'alert issued on' for date extraction.

CRITICAL: The 'district' field must be a JSON array of strings, not a single string. For example: ["District1", "District2", "District3"] NOT "District1, District2, District3" or "Region (District1, District2, District3)"

LLM Output JSON Structure

{
  "country": "India",
  "district": "Ambala",
  "date": "2024-02-15",
  "news_type": "sighting",
  "keywords": ["yellow rust", "wheat disease", "fungus"],
  "variety": ["WH-711", "HD-2967"],
  "confidence_score": 0.9,
  "explanation": "Yellow rust outbreak reported in Hussaini and Dhanaura villages"
}

Field Descriptions:

country/district: Geographic location of the disease occurrence
date: Actual disease sighting date (YYYY-MM-DD format)
news_type: Classification - 'sighting', 'warning', 'advisory', or 'others'

Search Strategy & Keywords

The system uses targeted search strategies combining disease-specific keywords, regional focus, and advanced AI models to identify and extract wheat disease information from news sources.

Disease Detection Keywords

wheat news wheat disease wheat rust yellow rust stripe rust

AI Models & Processing

mistralai/mistral-small-3.2-24b-instruct:freeCurrently Used

deepseek/deepseek-chat-v3-0324:freeAvailable

All accessed via OpenRouter API • Temperature: 0.1

API Endpoints

GET /api/wheat_disease — Retrieve records with optional filtering.

GET /api/wheat_disease/{id} — Get specific record by ID.

GET /api/wheat_disease/stats — Statistical overview.

Example Usage

CURL:

curl -u username:password "https://mediascraper.saralcodes.xyz/api/wheat_disease?country=India&limit=10"

Python:

import requests

response = requests.get(
    "https://mediascraper.saralcodes.xyz/api/wheat_disease?country=India&limit=10",
    auth=("username", "password")
)
print(response.json())

Authentication: All API endpoints require HTTP Basic Auth. Replace username:password with valid credentials.

Wheat Disease Classification Media Scrape