---
title: "Using a meta-meta-prompt to create a self-documenting python script for data cleaning"
author:
- name: "S. Keshav"
affiliation: "University of Cambridge"
email: "sk818@cam.ac.uk"
date: "October 21, 2025"
format:
html:
embed-resources: true
code-fold: true
code-tools: true
pdf:
toc: true
number-sections: true
engine: jupyter
---
This document (writen by AI after the specified point below) shows how one can use a meta-meta-prompt
to generate a meta-prompt for a task such that the task self-documenting, that is,
the AI documents both the prompt used to generate the task code as well as the code itself.
This may sound confusing, so let's break it up.
1. The task is to generate python code to clean a data file called data.csv. The python code is inserted into a Quarto markdown as executable python, so that when the document is rendered, the code is automatically run and the results placed in the document.
2. We therefore need a prompt to specify the task and also ask the AI to document the prompt and the output of the task in a Quarto markdown document
3. The AI is asked how to create such a prompt
4. Since we're asking AI how to create such a prompt, we are discussing a meta-prompt, that is, the first prompt is a meta-meta-prompt
## Meta-Meta-Prompt
::: {.callout-note}
## Meta-Meta-Prompt
I would like your help with a complicated bit of thinking. What prompt can I give you so that you produce not only some output, but also a quarto markdown document that uses a callout to state the prompt and a 'folded' code block to state your response. For example, if I gave you the prompts 'write a python script to print "hello world"' then your output should be 1) the python script and 2) a quarto markdown document that gives this prompt and your response.
:::
### Response to Meta-Meta-Prompt
I can help you create a meta-prompt for this! Here's what you could use:
**Prompt to give me:**
"Please complete the following task AND create a Quarto markdown document that records both the prompt and your response.
Task: [YOUR TASK HERE]
The Quarto document should include:
1. A callout block containing this prompt
2. A folded code block containing your complete response
Please provide both the direct answer and the Quarto markdown document as separate artifacts."
Would you like me to demonstrate this with a specific example, or would you like to refine the meta-prompt structure further?
## Meta-Prompt
::: {.callout-note}
Please complete the following task AND create a Quarto markdown document with executable code blocks.
I would like you to write python code to clean data in a CSV file called data.csv. The fields in the file are in the order scientific name, GPS location (lat, long), Age, trunk circumference, height.
Please remove:
• Any trees lying outside the bounding box 17.186736, 78.208403, 17.168470, 78.233911
• Any trees without a GPS location
• Any trees whose age was marked with an implausible number (negative or more than 99 years)
• Any trees with a trunk circumference that was missing, or implausibly small (< 5cm) or implausibly big (> 300 cm)
• Duplicate trees: If two trees are closer than 25 cm to each other, then they should be marked as duplicate. If they both share the scientific name, then one is kept (randomly chosen), if not, they are both discarded.
To find duplicates, find the Haversine distance between trees. Make the algorithm efficient by placing trees into a 1m grid and checking only nearby grid locations when checking for duplicates.
The Quarto document should include:
1. A callout block (callout-note) containing this prompt
2. An executable code block (using `{python}` or appropriate language) with:
* `#| code-fold: true` to make it collapsible
* `#| code-summary: "Show code"` for the fold label
* Your complete code solution
When I render this document with `quarto render`, Quarto will execute the code and display the output automatically.
:::
***Everything after this line was generated by AI***
## Task Description
::: {.callout-note}
## Data Cleaning Task
I would like you to write python code to clean data in a CSV file called data.csv. The fields in the file are in the order scientific name, GPS location (lat, long), Age, trunk circumference, height.
Please remove:
• Any trees lying outside the bounding box 17.186736, 78.208403, 17.168470, 78.233911
• Any trees without a GPS location
• Any trees whose age was marked with an implausible number (negative or more than 99 years)
• Any trees with a trunk circumference that was missing, or implausibly small (< 5cm) or implausibly big (> 300 cm)
• Duplicate trees: If two trees are closer than 25 cm to each other, then they should be marked as duplicate. If they both share the scientific name, then one is kept (randomly chosen), if not, they are both discarded.
To find duplicates, find the Haversine distance between trees. Make the algorithm efficient by placing trees into a 1m grid and checking only nearby grid locations when checking for duplicates.
:::
## Data Cleaning Implementation
```{python}
#| code-fold: true
#| code-summary: "Show code"
import pandas as pd
import numpy as np
from math import radians, sin, cos, sqrt, atan2
from collections import defaultdict
def haversine_distance(lat1, lon1, lat2, lon2):
"""
Calculate the Haversine distance between two points in meters.
Parameters:
-----------
lat1, lon1 : float
Latitude and longitude of first point in degrees
lat2, lon2 : float
Latitude and longitude of second point in degrees
Returns:
--------
float
Distance between points in meters
"""
R = 6371000 # Earth's radius in meters
lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
dlat = lat2 - lat1
dlon = lon2 - lon1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * atan2(sqrt(a), sqrt(1-a))
return R * c
def lat_lon_to_grid(lat, lon, grid_size=1.0):
"""
Convert lat/lon to grid coordinates based on grid_size in meters.
Parameters:
-----------
lat, lon : float
Latitude and longitude in degrees
grid_size : float
Size of grid cell in meters (default: 1.0)
Returns:
--------
tuple
(grid_lat, grid_lon) grid coordinates
"""
# Approximate conversion at given latitude
meters_per_degree_lat = 111320
meters_per_degree_lon = 111320 * cos(radians(lat))
grid_lat = int(lat * meters_per_degree_lat / grid_size)
grid_lon = int(lon * meters_per_degree_lon / grid_size)
return (grid_lat, grid_lon)
def get_neighboring_cells(grid_cell):
"""
Get the 9 neighboring cells (including the cell itself).
Parameters:
-----------
grid_cell : tuple
(grid_lat, grid_lon) coordinates
Returns:
--------
list
List of neighboring cell coordinates
"""
grid_lat, grid_lon = grid_cell
neighbors = []
for dlat in [-1, 0, 1]:
for dlon in [-1, 0, 1]:
neighbors.append((grid_lat + dlat, grid_lon + dlon))
return neighbors
def parse_gps(gps_string):
"""
Parse GPS location string into lat, lon.
Parameters:
-----------
gps_string : str
GPS location in format "lat, lon"
Returns:
--------
tuple
(lat, lon) or (None, None) if parsing fails
"""
if pd.isna(gps_string):
return None, None
try:
parts = str(gps_string).split(',')
if len(parts) == 2:
lat = float(parts[0].strip())
lon = float(parts[1].strip())
return lat, lon
except:
pass
return None, None
# Read the CSV file
df = pd.read_csv('data.csv', header=None,
names=['scientific_name', 'gps_location', 'age', 'trunk_circumference', 'height'])
initial_count = len(df)
print(f"Initial number of trees: {initial_count}")
# Parse GPS location into separate lat and lon columns
df[['lat', 'lon']] = df['gps_location'].apply(lambda x: pd.Series(parse_gps(x)))
# Convert numeric columns to numeric types, coercing errors to NaN
df['lat'] = pd.to_numeric(df['lat'], errors='coerce')
df['lon'] = pd.to_numeric(df['lon'], errors='coerce')
df['age'] = pd.to_numeric(df['age'], errors='coerce')
df['trunk_circumference'] = pd.to_numeric(df['trunk_circumference'], errors='coerce')
df['height'] = pd.to_numeric(df['height'], errors='coerce')
# Step 1: Remove trees without GPS location
df = df.dropna(subset=['lat', 'lon'])
print(f"After removing trees without GPS: {len(df)}")
# Step 2: Remove trees outside bounding box
# Bounding box coordinates: lat_min, lon_min, lat_max, lon_max
lat_min, lat_max = 17.168470, 17.186736
lon_min, lon_max = 78.208403, 78.233911
df = df[(df['lat'] >= lat_min) & (df['lat'] <= lat_max) &
(df['lon'] >= lon_min) & (df['lon'] <= lon_max)]
print(f"After removing trees outside bounding box: {len(df)}")
# Step 3: Remove trees with implausible age
df = df[(df['age'] >= 0) & (df['age'] <= 99)]
print(f"After removing trees with implausible age: {len(df)}")
# Step 4: Remove trees with missing or implausible trunk circumference
df = df[df['trunk_circumference'].notna()]
df = df[(df['trunk_circumference'] >= 5) & (df['trunk_circumference'] <= 300)]
print(f"After removing trees with implausible trunk circumference: {len(df)}")
# Step 5: Remove duplicate trees using grid-based approach
# Reset index for easier manipulation
df = df.reset_index(drop=True)
# Create spatial grid (1m cells)
grid = defaultdict(list)
for idx, row in df.iterrows():
cell = lat_lon_to_grid(row['lat'], row['lon'])
grid[cell].append(idx)
# Track which trees to remove
to_remove = set()
# Check for duplicates within 25 cm
for cell, indices in grid.items():
# Get neighboring cells
neighbors = get_neighboring_cells(cell)
# Collect all indices in neighboring cells
candidate_indices = []
for neighbor_cell in neighbors:
if neighbor_cell in grid:
candidate_indices.extend(grid[neighbor_cell])
# Remove duplicates from candidate list
candidate_indices = list(set(candidate_indices))
# Check each pair
for i, idx1 in enumerate(indices):
if idx1 in to_remove:
continue
for idx2 in candidate_indices:
if idx2 <= idx1 or idx2 in to_remove:
continue
# Calculate distance
row1 = df.loc[idx1]
row2 = df.loc[idx2]
dist = haversine_distance(row1['lat'], row1['lon'],
row2['lat'], row2['lon'])
# If within 25 cm (0.25 meters)
if dist <= 0.25:
# Check if same scientific name
if row1['scientific_name'] == row2['scientific_name']:
# Keep one randomly
to_remove.add(np.random.choice([idx1, idx2]))
else:
# Discard both
to_remove.add(idx1)
to_remove.add(idx2)
# Remove duplicates
df_clean = df.drop(index=list(to_remove))
print(f"After removing duplicates: {len(df_clean)}")
# Save cleaned data
df_clean.to_csv('data_cleaned.csv', index=False)
print(f"\nCleaned data saved to 'data_cleaned.csv'")
print(f"Final number of trees: {len(df_clean)}")
print(f"Total trees removed: {initial_count - len(df_clean)}")
```
## Summary
The script first parses the GPS location field (which contains "lat, lon" as a single string) into separate latitude and longitude columns.
The cleaning process then removes trees based on the following criteria:
1. **Missing GPS coordinates** - Trees without location data or with unparseable GPS strings
2. **Geographic bounds** - Trees outside the specified bounding box
3. **Invalid age** - Trees with age < 0 or > 99 years
4. **Invalid trunk circumference** - Trees with missing values or circumference < 5cm or > 300cm
5. **Duplicates** - Trees within 25cm of each other, handled based on species match
The duplicate detection uses an efficient spatial grid algorithm (1m × 1m cells) to avoid checking all possible pairs, making the algorithm scale well with large datasets.
The cleaned data is saved to `data_cleaned.csv`.