Introduction
Data extraction and filtering are crucial for developers working with large research datasets. Whether you're working on government archives, industry reports, or academic research projects, extracting meaningful insights efficiently can be challenging.
I'm going to explore how we tackled a problem—filtering and extracting research project records based on species names from structured JSON data. We started with regular expressions (regex), ran into limitations, and ultimately built a Python automation solution that dynamically extracts project numbers and relevant fields.
The challenge: Filtering JSON records with regex
Initially, our goal was straightforward: to search for records containing a specific species name (e.g., "Mud Crab") within various fields of a large JSON dataset. Given the structured nature of the data, our first instinct was to use regular expressions (regex) for pattern matching.
However, with a total dataset of 487,250 records, we knew that testing needed to be methodical. We began with a single record, carefully validating the output before scaling up to 20 records, then 100, and eventually 1,000. At each stage, we conducted rigorous accuracy checks, ensuring that every matched result was 100% correct before increasing the dataset size. This iterative approach allowed us to refine our filtering logic and validate performance before applying it to the full dataset.
Consider this sample dataset:
{
"value": [
{
"ID": 101,
"ProjectNumber": "2020-001",
"Title": "Sustainable aquaculture for bluefin tuna",
"Background": "This research focuses on the long-term sustainability of bluefin tuna farming.",
"Need": "There is an increasing global demand for sustainable seafood.",
"Tags": "Aquaculture;Tuna;Fisheries",
"Program": "Sustainable Fisheries"
},
{
"ID": 102,
"ProjectNumber": "2021-045",
"Title": "Mud Crab research and habitat restoration",
"Background": "Mud Crab populations are declining due to environmental changes.",
"Need": "Investigating suitable habitat restoration methods for Mud Crab sustainability.",
"Tags": "Aquaculture;Mud Crab;Conservation",
"Program": "Marine Research"
}
]
}
The first approach was to use regex patterns in VS Code to filter records where `"Mud Crab"` appeared in specific fields, like `"Background"`, `"Need"`, or `"Tags"`.
Initial regex pattern:
(?:(Background|Need|Tags|Title|Objectives|OutputsOutcomes|Program|Buckets)":\s*"[^"]*Mud Crab[^"]*")
This approach worked for simple queries but quickly broke down when:
- Records contained multiple matches, making it difficult to enforce conditions like "exactly one occurrence"
- Case sensitivity and variations in spacing caused some matches to be missed
- Complex JSON structures required more flexible filtering beyond what regex could easily handle.
At this point, we realised that regex alone wasn't enough —we needed a more structured approach using Python.
Moving to Python for structured filtering
To overcome regex limitations, we built a Python-based filtering tool. This tool:
- Loads JSON data dynamically
- Filters records based on occurrences of a species name
- Extracts project numbers if required
- Handles variations in text and structure robustly
Building a Python class for structured filtering
The first iteration of our Python solution was a class-based program that efficiently handled filtering:
import json
import re
class SpeciesFilter:
"""Filters JSON research project records based on species name occurrences."""
TARGET_FIELDS = ["Background", "Need", "Tags", "Title", "Objectives", "OutputsOutcomes", "Program", "Buckets"]
def __init__(self, json_filename, species_name, min_occurrences=1):
self.json_filename = json_filename
self.species_name = species_name
self.min_occurrences = min_occurrences
self.records = self.load_json()
self.filtered_records = []
def load_json(self):
"""Loads JSON data and verifies its structure."""
with open(self.json_filename, "r", encoding="utf-8") as file:
data = json.load(file)
return data.get("value", [])
def count_occurrences_per_record(self, record):
"""Counts occurrences of the species name in a record."""
count = sum(
len(re.findall(fr"\b{re.escape(self.species_name)}\b", record.get(field, ""), re.IGNORECASE))
for field in self.TARGET_FIELDS
)
return count
def filter_records(self):
"""Filters records based on species occurrences."""
self.filtered_records = [
record for record in self.records if self.count_occurrences_per_record(record) >= self.min_occurrences
]
print(f"Matching records: {len(self.filtered_records)}")
def display_project_numbers(self):
"""Displays project numbers if available."""
print("Project Numbers:")
for record in self.filtered_records:
print(record.get("ProjectNumber", "N/A"))
def run(self, project_number_flag=False):
"""Runs the filtering process and displays results."""
self.filter_records()
if project_number_flag:
self.display_project_numbers()
else:
print(json.dumps(self.filtered_records, indent=4))
# Example usage:
# species_filter = SpeciesFilter("projects-08.json", "Mud Crab", min_occurrences=1)
# species_filter.run(project_number_flag=True)
Enhancing the script with command-line functionality
To make the script more dynamic, we enabled command-line arguments so users could specify:
- The JSON file to process
- The species name to search for
- The minimum occurrences required for a match
- Whether to only return project numbers
import sys
if __name__ == "__main__":
if len(sys.argv) < 3:
print("Usage: python main.py <json_file> <species_name> [min_occurrences] [project_number=true]")
sys.exit(1)
json_filename = sys.argv[1]
species_name = sys.argv[2]
min_occurrences = int(sys.argv[3]) if len(sys.argv) > 3 and sys.argv[3].isdigit() else 1
project_number_flag = "project_number=true" in sys.argv
species_filter = SpeciesFilter(json_filename, species_name, min_occurrences)
species_filter.run(project_number_flag)
Final outcome: A dynamic and robust solution
This approach provided a scalable and maintainable solution to filtering JSON research records:
- Regex helped prototype the problem but was too rigid
- Python offered flexibility and accurate filtering
- The command-line tool allowed users to dynamically modify filters
- The script was optimised for real-world research datasets
Example commands and results:
# Find all projects mentioning "Mud Crab" and display project numbers
python3 main.py projects-08.json "Mud Crab" project_number=true
# Find all projects where "Mud Crab" appears at least twice
python3 main.py projects-08.json "Mud Crab" 2
Example output (Project Numbers Only)
Matching records: 3
Project Numbers:
2021-045
2018-322
2022-654