Using a meta-meta-prompt to create a self-documenting python script for data cleaning

Author
Affiliation

S. Keshav

University of Cambridge

Published

October 21, 2025

This document (writen by AI after the specified point below) shows how one can use a meta-meta-prompt to generate a meta-prompt for a task such that the task self-documenting, that is, the AI documents both the prompt used to generate the task code as well as the code itself. This may sound confusing, so let’s break it up.

  1. The task is to generate python code to clean a data file called data.csv. The python code is inserted into a Quarto markdown as executable python, so that when the document is rendered, the code is automatically run and the results placed in the document.
  2. We therefore need a prompt to specify the task and also ask the AI to document the prompt and the output of the task in a Quarto markdown document
  3. The AI is asked how to create such a prompt
  4. Since we’re asking AI how to create such a prompt, we are discussing a meta-prompt, that is, the first prompt is a meta-meta-prompt

Meta-Meta-Prompt

NoteMeta-Meta-Prompt

I would like your help with a complicated bit of thinking. What prompt can I give you so that you produce not only some output, but also a quarto markdown document that uses a callout to state the prompt and a ‘folded’ code block to state your response. For example, if I gave you the prompts ‘write a python script to print “hello world”’ then your output should be 1) the python script and 2) a quarto markdown document that gives this prompt and your response.

Response to Meta-Meta-Prompt

I can help you create a meta-prompt for this! Here’s what you could use:

Prompt to give me:

“Please complete the following task AND create a Quarto markdown document that records both the prompt and your response.

Task: [YOUR TASK HERE]

The Quarto document should include: 1. A callout block containing this prompt 2. A folded code block containing your complete response

Please provide both the direct answer and the Quarto markdown document as separate artifacts.”

Would you like me to demonstrate this with a specific example, or would you like to refine the meta-prompt structure further?

Meta-Prompt

Note

Please complete the following task AND create a Quarto markdown document with executable code blocks.

I would like you to write python code to clean data in a CSV file called data.csv. The fields in the file are in the order scientific name, GPS location (lat, long), Age, trunk circumference, height.

Please remove:

• Any trees lying outside the bounding box 17.186736, 78.208403, 17.168470, 78.233911

• Any trees without a GPS location

• Any trees whose age was marked with an implausible number (negative or more than 99 years)

• Any trees with a trunk circumference that was missing, or implausibly small (< 5cm) or implausibly big (> 300 cm)

• Duplicate trees: If two trees are closer than 25 cm to each other, then they should be marked as duplicate. If they both share the scientific name, then one is kept (randomly chosen), if not, they are both discarded.

To find duplicates, find the Haversine distance between trees. Make the algorithm efficient by placing trees into a 1m grid and checking only nearby grid locations when checking for duplicates.

The Quarto document should include: 1. A callout block (callout-note) containing this prompt 2. An executable code block (using {python} or appropriate language) with: * #| code-fold: true to make it collapsible * #| code-summary: "Show code" for the fold label * Your complete code solution

When I render this document with quarto render, Quarto will execute the code and display the output automatically.

Everything after this line was generated by AI

Task Description

NoteData Cleaning Task

I would like you to write python code to clean data in a CSV file called data.csv. The fields in the file are in the order scientific name, GPS location (lat, long), Age, trunk circumference, height.

Please remove:

• Any trees lying outside the bounding box 17.186736, 78.208403, 17.168470, 78.233911

• Any trees without a GPS location

• Any trees whose age was marked with an implausible number (negative or more than 99 years)

• Any trees with a trunk circumference that was missing, or implausibly small (< 5cm) or implausibly big (> 300 cm)

• Duplicate trees: If two trees are closer than 25 cm to each other, then they should be marked as duplicate. If they both share the scientific name, then one is kept (randomly chosen), if not, they are both discarded.

To find duplicates, find the Haversine distance between trees. Make the algorithm efficient by placing trees into a 1m grid and checking only nearby grid locations when checking for duplicates.

Data Cleaning Implementation

Show code
import pandas as pd
import numpy as np
from math import radians, sin, cos, sqrt, atan2
from collections import defaultdict

def haversine_distance(lat1, lon1, lat2, lon2):
    """
    Calculate the Haversine distance between two points in meters.
    
    Parameters:
    -----------
    lat1, lon1 : float
        Latitude and longitude of first point in degrees
    lat2, lon2 : float
        Latitude and longitude of second point in degrees
    
    Returns:
    --------
    float
        Distance between points in meters
    """
    R = 6371000  # Earth's radius in meters
    
    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
    
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * atan2(sqrt(a), sqrt(1-a))
    
    return R * c

def lat_lon_to_grid(lat, lon, grid_size=1.0):
    """
    Convert lat/lon to grid coordinates based on grid_size in meters.
    
    Parameters:
    -----------
    lat, lon : float
        Latitude and longitude in degrees
    grid_size : float
        Size of grid cell in meters (default: 1.0)
    
    Returns:
    --------
    tuple
        (grid_lat, grid_lon) grid coordinates
    """
    # Approximate conversion at given latitude
    meters_per_degree_lat = 111320
    meters_per_degree_lon = 111320 * cos(radians(lat))
    
    grid_lat = int(lat * meters_per_degree_lat / grid_size)
    grid_lon = int(lon * meters_per_degree_lon / grid_size)
    
    return (grid_lat, grid_lon)

def get_neighboring_cells(grid_cell):
    """
    Get the 9 neighboring cells (including the cell itself).
    
    Parameters:
    -----------
    grid_cell : tuple
        (grid_lat, grid_lon) coordinates
    
    Returns:
    --------
    list
        List of neighboring cell coordinates
    """
    grid_lat, grid_lon = grid_cell
    neighbors = []
    for dlat in [-1, 0, 1]:
        for dlon in [-1, 0, 1]:
            neighbors.append((grid_lat + dlat, grid_lon + dlon))
    return neighbors

def parse_gps(gps_string):
    """
    Parse GPS location string into lat, lon.
    
    Parameters:
    -----------
    gps_string : str
        GPS location in format "lat, lon"
    
    Returns:
    --------
    tuple
        (lat, lon) or (None, None) if parsing fails
    """
    if pd.isna(gps_string):
        return None, None
    try:
        parts = str(gps_string).split(',')
        if len(parts) == 2:
            lat = float(parts[0].strip())
            lon = float(parts[1].strip())
            return lat, lon
    except:
        pass
    return None, None

# Read the CSV file
df = pd.read_csv('data.csv', header=None, 
                 names=['scientific_name', 'gps_location', 'age', 'trunk_circumference', 'height'])

initial_count = len(df)
print(f"Initial number of trees: {initial_count}")

# Parse GPS location into separate lat and lon columns
df[['lat', 'lon']] = df['gps_location'].apply(lambda x: pd.Series(parse_gps(x)))

# Convert numeric columns to numeric types, coercing errors to NaN
df['lat'] = pd.to_numeric(df['lat'], errors='coerce')
df['lon'] = pd.to_numeric(df['lon'], errors='coerce')
df['age'] = pd.to_numeric(df['age'], errors='coerce')
df['trunk_circumference'] = pd.to_numeric(df['trunk_circumference'], errors='coerce')
df['height'] = pd.to_numeric(df['height'], errors='coerce')

# Step 1: Remove trees without GPS location
df = df.dropna(subset=['lat', 'lon'])
print(f"After removing trees without GPS: {len(df)}")

# Step 2: Remove trees outside bounding box
# Bounding box coordinates: lat_min, lon_min, lat_max, lon_max
lat_min, lat_max = 17.168470, 17.186736
lon_min, lon_max = 78.208403, 78.233911

df = df[(df['lat'] >= lat_min) & (df['lat'] <= lat_max) & 
        (df['lon'] >= lon_min) & (df['lon'] <= lon_max)]
print(f"After removing trees outside bounding box: {len(df)}")

# Step 3: Remove trees with implausible age
df = df[(df['age'] >= 0) & (df['age'] <= 99)]
print(f"After removing trees with implausible age: {len(df)}")

# Step 4: Remove trees with missing or implausible trunk circumference
df = df[df['trunk_circumference'].notna()]
df = df[(df['trunk_circumference'] >= 5) & (df['trunk_circumference'] <= 300)]
print(f"After removing trees with implausible trunk circumference: {len(df)}")

# Step 5: Remove duplicate trees using grid-based approach
# Reset index for easier manipulation
df = df.reset_index(drop=True)

# Create spatial grid (1m cells)
grid = defaultdict(list)
for idx, row in df.iterrows():
    cell = lat_lon_to_grid(row['lat'], row['lon'])
    grid[cell].append(idx)

# Track which trees to remove
to_remove = set()

# Check for duplicates within 25 cm
for cell, indices in grid.items():
    # Get neighboring cells
    neighbors = get_neighboring_cells(cell)
    
    # Collect all indices in neighboring cells
    candidate_indices = []
    for neighbor_cell in neighbors:
        if neighbor_cell in grid:
            candidate_indices.extend(grid[neighbor_cell])
    
    # Remove duplicates from candidate list
    candidate_indices = list(set(candidate_indices))
    
    # Check each pair
    for i, idx1 in enumerate(indices):
        if idx1 in to_remove:
            continue
            
        for idx2 in candidate_indices:
            if idx2 <= idx1 or idx2 in to_remove:
                continue
            
            # Calculate distance
            row1 = df.loc[idx1]
            row2 = df.loc[idx2]
            
            dist = haversine_distance(row1['lat'], row1['lon'], 
                                     row2['lat'], row2['lon'])
            
            # If within 25 cm (0.25 meters)
            if dist <= 0.25:
                # Check if same scientific name
                if row1['scientific_name'] == row2['scientific_name']:
                    # Keep one randomly
                    to_remove.add(np.random.choice([idx1, idx2]))
                else:
                    # Discard both
                    to_remove.add(idx1)
                    to_remove.add(idx2)

# Remove duplicates
df_clean = df.drop(index=list(to_remove))
print(f"After removing duplicates: {len(df_clean)}")

# Save cleaned data
df_clean.to_csv('data_cleaned.csv', index=False)
print(f"\nCleaned data saved to 'data_cleaned.csv'")
print(f"Final number of trees: {len(df_clean)}")
print(f"Total trees removed: {initial_count - len(df_clean)}")
Initial number of trees: 19164
After removing trees without GPS: 18354
After removing trees outside bounding box: 18059
After removing trees with implausible age: 18057
After removing trees with implausible trunk circumference: 11111
After removing duplicates: 10722

Cleaned data saved to 'data_cleaned.csv'
Final number of trees: 10722
Total trees removed: 8442

Summary

The script first parses the GPS location field (which contains “lat, lon” as a single string) into separate latitude and longitude columns.

The cleaning process then removes trees based on the following criteria:

  1. Missing GPS coordinates - Trees without location data or with unparseable GPS strings
  2. Geographic bounds - Trees outside the specified bounding box
  3. Invalid age - Trees with age < 0 or > 99 years
  4. Invalid trunk circumference - Trees with missing values or circumference < 5cm or > 300cm
  5. Duplicates - Trees within 25cm of each other, handled based on species match

The duplicate detection uses an efficient spatial grid algorithm (1m × 1m cells) to avoid checking all possible pairs, making the algorithm scale well with large datasets.

The cleaned data is saved to data_cleaned.csv.