Skip to main content

Introduction

Data extraction and filtering are crucial for developers working with large research datasets. Whether you're working on government archives, industry reports, or academic research projects, extracting meaningful insights efficiently can be challenging.  

I'm going to explore how we tackled a problem—filtering and extracting research project records based on species names from structured JSON data. We started with regular expressions (regex), ran into limitations, and ultimately built a Python automation solution that dynamically extracts project numbers and relevant fields.  

 

The challenge: Filtering JSON records with regex

Initially, our goal was straightforward: to search for records containing a specific species name (e.g., "Mud Crab") within various fields of a large JSON dataset. Given the structured nature of the data, our first instinct was to use regular expressions (regex) for pattern matching.

However, with a total dataset of 487,250 records, we knew that testing needed to be methodical. We began with a single record, carefully validating the output before scaling up to 20 records, then 100, and eventually 1,000. At each stage, we conducted rigorous accuracy checks, ensuring that every matched result was 100% correct before increasing the dataset size. This iterative approach allowed us to refine our filtering logic and validate performance before applying it to the full dataset.

Consider this sample dataset:

{
   "value": [
       {
           "ID": 101,
           "ProjectNumber": "2020-001",
           "Title": "Sustainable aquaculture for bluefin tuna",
           "Background": "This research focuses on the long-term sustainability of bluefin tuna farming.",
           "Need": "There is an increasing global demand for sustainable seafood.",
           "Tags": "Aquaculture;Tuna;Fisheries",
           "Program": "Sustainable Fisheries"
       },
       {
           "ID": 102,
           "ProjectNumber": "2021-045",
           "Title": "Mud Crab research and habitat restoration",
           "Background": "Mud Crab populations are declining due to environmental changes.",
           "Need": "Investigating suitable habitat restoration methods for Mud Crab sustainability.",
           "Tags": "Aquaculture;Mud Crab;Conservation",
           "Program": "Marine Research"
       }
   ]
} 

 

The first approach was to use regex patterns in VS Code to filter records where `"Mud Crab"` appeared in specific fields, like `"Background"`, `"Need"`, or `"Tags"`. 

 

Initial regex pattern:

(?:(Background|Need|Tags|Title|Objectives|OutputsOutcomes|Program|Buckets)":\s*"[^"]*Mud Crab[^"]*")

This approach worked for simple queries but quickly broke down when:

  1. Records contained multiple matches, making it difficult to enforce conditions like "exactly one occurrence"
  2. Case sensitivity and variations in spacing caused some matches to be missed
  3. Complex JSON structures required more flexible filtering beyond what regex could easily handle.

At this point, we realised that regex alone wasn't enough —we needed a more structured approach using Python.

 

Moving to Python for structured filtering

To overcome regex limitations, we built a Python-based filtering tool. This tool:

  • Loads JSON data dynamically
  • Filters records based on occurrences of a species name
  • Extracts project numbers if required
  • Handles variations in text and structure robustly

 

Building a Python class for structured filtering

The first iteration of our Python solution was a class-based program that efficiently handled filtering:

import json
import re
class SpeciesFilter:
   """Filters JSON research project records based on species name occurrences."""
   TARGET_FIELDS = ["Background", "Need", "Tags", "Title", "Objectives", "OutputsOutcomes", "Program", "Buckets"]
   def __init__(self, json_filename, species_name, min_occurrences=1):
       self.json_filename = json_filename
       self.species_name = species_name
       self.min_occurrences = min_occurrences
       self.records = self.load_json()
       self.filtered_records = []
   def load_json(self):
       """Loads JSON data and verifies its structure."""
       with open(self.json_filename, "r", encoding="utf-8") as file:
           data = json.load(file)
       return data.get("value", [])
   def count_occurrences_per_record(self, record):
       """Counts occurrences of the species name in a record."""
       count = sum(
           len(re.findall(fr"\b{re.escape(self.species_name)}\b", record.get(field, ""), re.IGNORECASE))
           for field in self.TARGET_FIELDS
       )
       return count
   def filter_records(self):
       """Filters records based on species occurrences."""
       self.filtered_records = [
           record for record in self.records if self.count_occurrences_per_record(record) >= self.min_occurrences
       ]
       print(f"Matching records: {len(self.filtered_records)}")
   def display_project_numbers(self):
       """Displays project numbers if available."""
       print("Project Numbers:")
       for record in self.filtered_records:
           print(record.get("ProjectNumber", "N/A"))
   def run(self, project_number_flag=False):
       """Runs the filtering process and displays results."""
       self.filter_records()
       if project_number_flag:
           self.display_project_numbers()
       else:
           print(json.dumps(self.filtered_records, indent=4))
# Example usage:
# species_filter = SpeciesFilter("projects-08.json", "Mud Crab", min_occurrences=1)
# species_filter.run(project_number_flag=True)

 

Enhancing the script with command-line functionality

To make the script more dynamic, we enabled command-line arguments so users could specify:

  • The JSON file to process
  • The species name to search for
  • The minimum occurrences required for a match
  • Whether to only return project numbers
import sys
if __name__ == "__main__":
   if len(sys.argv) < 3:
       print("Usage: python main.py <json_file> <species_name> [min_occurrences] [project_number=true]")
       sys.exit(1)
   json_filename = sys.argv[1]
   species_name = sys.argv[2]
   min_occurrences = int(sys.argv[3]) if len(sys.argv) > 3 and sys.argv[3].isdigit() else 1
   project_number_flag = "project_number=true" in sys.argv
   species_filter = SpeciesFilter(json_filename, species_name, min_occurrences)
   species_filter.run(project_number_flag)

 

Final outcome: A dynamic and robust solution

This approach provided a scalable and maintainable solution to filtering JSON research records:

  • Regex helped prototype the problem but was too rigid
  • Python offered flexibility and accurate filtering
  • The command-line tool allowed users to dynamically modify filters
  • The script was optimised for real-world research datasets

Example commands and results:

# Find all projects mentioning "Mud Crab" and display project numbers

python3 main.py projects-08.json "Mud Crab" project_number=true

# Find all projects where "Mud Crab" appears at least twice

python3 main.py projects-08.json "Mud Crab" 2

 

Example output (Project Numbers Only)

Matching records: 3
Project Numbers:
2021-045
2018-322
2022-654

 

Related articles

Andrew Fletcher13 Feb 2025
Deploying a Python project from UAT to production using Git
When deploying a Python project from a User Acceptance Testing (UAT) environment to Production, it’s essential to ensure that all dependencies and configurations remain consistent. Particularly in our situation where this was going to be the first deployment of AI semantic search functionality to...
Andrew Fletcher19 Nov 2024
How to resolve issues with Python and virtual environments in pyenv
For developers working with Python, setting up and managing environments can sometimes lead to frustrating terminal errors. If you’ve encountered issues like the `python: command not found` error or struggled to create a virtual environment, this guide walks through resolving these common problems...