Skip to main content

Working on a project where the JSON dataset contains over 460,000 named records, we are preparing to upsert these records into Pinecone. However, for validation and testing purposes, it's essential to cross-check how many times a specific term appears across the dataset. To ensure data integrity, we only want to include one record per occurrence, regardless of whether the term appears once or multiple times within a record. This is where regular expressions, commonly known as regex, come into play. Regex are powerful tools for pattern matching and text manipulation. Whether you're a seasoned developer or just starting out, mastering regex can significantly enhance your ability to handle and analyse data efficiently. In this article, we'll explore how to use regex in Visual Studio Code (VS Code) to count the number of times a specific term appears within a dataset, particularly when the data is organised into arrays. We'll use the term 'SBT' as our primary example.

 

Understanding regular expressions

Regular expressions are sequences of characters that define a search pattern. They are widely used for string searching and manipulation, allowing users to perform complex searches with concise syntax. Regex can be applied in various programming languages and tools, including VS Code, making it an essential skill for developers, data analysts, and anyone working with text data.

 

key components of regex

Literals: Exact characters to match (e.g., SBT).
Metacharacters: Special characters that control how the regex operates (e.g., \w, \s).
Quantifiers: Specify how many instances of a character or group are required (e.g., *, +, ?).
Anchors: Define the position in the text (e.g., ^ for the start, $ for the end).
Groups and ranges: Allow for complex matching patterns (e.g., (abc), [A-Z]).

Understanding these components is crucial for crafting effective regex patterns tailored to your specific needs.

 

Setting up VS Code for regex searches

Visual Studio Code is a versatile and widely-used code editor that supports regex searches out of the box. To perform a regex search in VS Code:

  1. Open the Find Dialog:
    • Press Ctrl + F (Windows/Linux) or Cmd + F (Mac) to open the search bar.
  2. Enable Regex Mode:
    • Click on the .* icon in the search bar to activate regex mode. When enabled, VS Code will interpret your search query as a regex pattern.

With regex mode active, you can now input complex search patterns to locate specific terms or patterns within your files.

 

Crafting a regex to count term occurrences

Creating a regex pattern to count how many objects within a JSON array contain the term SBT. We'll break down the process step by step, using a sample dataset for illustration.

 

Sample JSON data

Consider the following JSON structure, where the key "value" holds an array of objects:

{
   "value": [
       {
           "ID": 1,
           "Title": "Guidance on Adaptation of Commonwealth Fisheries management to climate change",
           "Tags": "Climate Adaptation;Climate Change;Ecosystem;Fisheries Management;RAC CMWTH"
       },
       {
           "ID": 261,
           "Title": "SCRC: Visiting Expert: Weaning of SBT larvae; training of CST Hatchery staff (Nick King)",
           "Tags": "Aquaculture;Education"
       },
       {
           "ID": 328,
           "Title": "Next-generation Close-kin Mark Recapture: using SNPs to identify half-sibling pairs in Southern Bluefin Tuna and estimate abundance, mortality and selectivity",
           "Tags": "Biology;Biomass;Data;Economic;Fisheries Management;Genomics;Modelling;Mortality;Processing & packaging;Publication;RAC CMWTH;Reproduction;Stock Assessment"
       }
   ]
}

In this dataset:

  • The first object does not contain the term SBT;
  • The second object contains SBT once;
  • The third object contains SBT multiple times.

Our goal is to count how many objects contain SBT, which in this case should be 2.

 

Constructing the regex pattern

To achieve this, we'll create a regex pattern that matches entire objects containing SBT as an isolated term. Here's how to construct and apply the regex:

 

step 1: basic object matching

Start by crafting a regex that matches an entire JSON object:

\{[\s\S]*?SBT[\s\S]*?\}

Explanation of the regex pattern:

\{: Matches the opening curly brace { of an object

[\s\S]*?: Non-greedy match of any characters (including newlines) before SBT

SBT: The exact term we're searching for

[\s\S]*?: Non-greedy match of any characters after SBT

\}: Matches the closing curly brace } of an object

This pattern effectively captures any object that contains the term SBT anywhere within it.

 

step 2: ensuring term isolation

However, we want to ensure that SBT is matched as an isolated term, not as part of another word (e.g., "SBTCorp"). To enforce this, we can enhance our regex with isolation logic:

\{[\s\S]*?(?<!\w)\s*SBT\s*(?!\w)[\s\S]*?\}

Breakdown:

(?<!\w): Negative lookbehind to ensure SBT isn't preceded by a word character

\s*: Allows for optional whitespace before SBT

SBT: The exact term to match

\s*: Allows for optional whitespace after SBT

(?!\w): Negative lookahead to ensure SBT isn't followed by a word character

This refined pattern ensures that SBT stands alone, surrounded by spaces or at the boundaries of the object.

 

Applying the regex in VS Code

Follow these steps to apply the regex in VS Code:

  1. Open the Find Dialog
    • Press Ctrl + F (Windows/Linux) or Cmd + F (Mac).
  2. Enable Regex Mode
    • Click on the .* icon to activate regex search.
  3. Enter the Combined Regex Pattern
    • \{[\s\S]*?(?<!\w)\s*SBT\s*(?!\w)[\s\S]*?\}
  4. Review the Matches
    • VS Code will highlight all objects matching the pattern. Each highlighted match corresponds to an object containing SBT as an isolated term
  5. Count the Matches
    1. The number of highlighted matches represents the number of objects containing SBT. In our sample data, this should be 2.

 

Alternative: using word boundaries

If your isolation requirements are simpler, you can use word boundaries (\b) to ensure SBT isn't part of another word:

\{[\s\S]*?\bSBT\b[\s\S]*?\}

Explanation of the regex pattern:

\b: Word boundary anchors that ensure SBT isn't part of a longer word.

This approach is more straightforward and effective for standard word isolation without accounting for surrounding spaces or special characters like parentheses.

 

Practical example: counting SBT occurrences

Let's apply our regex to the sample JSON data.

First Object

Title: "Guidance on Adaptation of Commonwealth Fisheries management to climate change"
Tags: "Climate Adaptation;Climate Change;Ecosystem;Fisheries Management;RAC CMWTH"
Result: No match (does not contain SBT)

Second Object

Title: "SCRC: Visiting Expert: Weaning of SBT larvae; training of CST Hatchery staff (Nick King)"
Tags: "Aquaculture;Education"
Result: Match (contains SBT once)

Third Object:

Title: "Next-generation Close-kin Mark Recapture: using SNPs to identify half-sibling pairs in Southern Bluefin Tuna and estimate abundance, mortality and selectivity"
Tags: "Biology;Biomass;Data;Economic;Fisheries Management;Genomics;Modelling;Mortality;Processing & packaging;Publication;RAC CMWTH;Reproduction;Stock Assessment"
Result: Match (contains SBT multiple times)

Applying our regex correctly identifies 2 objects containing SBT, aligning with our expectations.

 

Additional tips for effective regex usage in VS Code

Test your regex

Before applying it in VS Code, use online tools like regex101.com to test and visualize your regex patterns.

 

Handle case sensitivity

By default, regex in VS Code is case-sensitive. If you need a case-insensitive search (e.g., matching "sbt", "Sbt"), toggle the Aa icon in the search bar or add the i flag to your regex:

/\{[\s\S]*?(?<!\w)\s*SBT\s*(?!\w)[\s\S]*?\}/gi

 

Manage nested objects

The provided regex assumes objects are not nested within each other. For JSON with nested objects, you may need more advanced patterns or consider using a JSON parser.

 

Use capturing groups

If you need to extract specific parts of the matched objects, incorporate capturing groups () into your regex.

 

The wrap

Regular expressions are invaluable for efficiently searching and analyzing text data, including structured formats like JSON. By leveraging regex within VS Code, you can swiftly count occurrences of specific terms within datasets, ensuring accurate and targeted data analysis. Using SBT as an example, we've demonstrated how to construct and apply regex patterns to identify and count term occurrences within JSON arrays.

Whether you're managing large datasets, performing data validation, or automating text processing tasks, mastering regex in VS Code can significantly enhance your productivity and accuracy.

Related articles

Andrew Fletcher06 Oct 2024
How to search for "text" in VS Code while excluding comments
When working with code in Visual Studio Code, you may need to search for specific instances of a function or method, such as self.logger.log. However, it can be frustrating to sift through lines that are commented out, like # self.logger.log. Fortunately, VS Code provides a powerful search feature...
Andrew Fletcher17 Oct 2023
Managing VS Code extensions via terminal
Visual Studio Code (VS Code) allows you to manage extensions using the VS Code Command Line Interface (CLI) called code. With the code CLI, you can install, list, uninstall, and manage extensions from the command line. &nbsp;To check you have the code prompt running, runcode --versionResponse you're...