About Sleuth’s Data Validation

Some Principles for Collecting and Classifying Observation

We should move toward a neutral scale (intensity) and away from a negative scale(severity) to capture parents’ observations.
We should augment traditional human labeling with automated LLM-based labels for data organization.

How do we identify related topics in data?

In order to make Sleuth useful, we need to identify when parents might be referring to the same topic with slightly different wording This classification helps us to:

Group parents’ existing comments by theme for reading and analysis
Provide search tools inside our applications
Offer a non-clinical but very useful automatic labeling system to Sleuth’s users.

We use two methods for doing this, old school and high tech, which we’ll describe here:

Method #1: Human labeling

Our first technique, developed for grounded theory in sociology, design thinking, and machine learning, is called qualitative coding. We have people read each distinct behavior, symptom, skill, or event that a parent has written about and assign that text snippet to a group of observationally similar items that other parents have written. This is a labor-intensive process. As we read each item, we either assign it to an existing grouping or create a new grouping.

To make this work, it is important to classify every single distinct item that someone enters. Skipping an item introduces biases. Sleuth uses specialized techniques to encourage parents to provide distinct items that we can classify without too much noise.

The goal of manual labeling is to create groups of text that are as small and precise as possible, where all elements are identical in meaning. Then larger groups are created by combining two smaller groups. This classic method is described by Van Maanen, Miles and Huberman, and many others. We want the groupings to be “mutually exclusive and collectively exhaustive” (Minto). There should be little overlap in meaning between two different groupings, and the set of all groupings should efficiently cover all observations parents have made.

It is important to note that this process does not use clinical judgment. We are not interested in whether two symptoms are related to some larger condition based on clinical knowledge, or whether one symptom might be caused by another symptom. We are just noting whether two observations parents have made in non-technical language could be referring to the exact same experience. It is actually important to avoid using external knowledge at this stage, so that we can make sure we can identify patterns best reflected in parents’ statements.

Once we have created groups of text, we use these as examples (i.e. “training data”) to train standard machine learning methods to label new text.

Method #2: Clustering based on LLMs

Until 2022, there was no great alternative to qualitative coding. For instance, in 2021, the best LLM-based models had trouble recognizing that “difficulty sitting still” and “very high energy” were related descriptions of kids’ behavior. “Difficulty sitting” was a distinctive phrase that could reflect other circumstances such as back pain or an uncomfortable chair. The contextual understanding of these phrases was missing.

However, the emergence of better LLMs in the last 12 months has offered a second approach that we find to work better with human oversight. OpenAI’s language models allow us to relate and measure the association of “difficulty sitting still” with “very high energy” within Sleuth.

At the time of our analysis, we used OpenAI to represent any word or phrase by a vector of 1,536 numbers (see more info here), such as “[-0.01697835698723793, -0.005800504703074694, -0.001661704620346427,..., -0.023514503613114357]”. Each number in the vector takes on a value that represents the strength of an element of the original word or phrase. The distance between two different phrases in this 1,536-dimensional space is a representation of the actual difference in meaning between those two phrases based on qualities of the text (specific emotions, gender, quantity, intensity, temperature, etc.).

A famous example of this mathematical representation is the statement “king - man + woman = queen” or “paris – france + poland = warsaw”. The numerical representation of text allows a mathematical interpretation of meaning.

Once we represent parents’ observations through this method, we can use clustering techniques in machine learning to group symptoms. Phrases with similar meaning are located closer together.

The clustering technique Sleuth uses, agglomerative clustering, works in a manner very similar to the qualitative method: evaluating the distances between each of the phrases parents wrote and iteratively assigning them to larger and larger groups based on their themes. These clusters can also be used to automatically assign a label to all of the items in the group, for instance, by picking a label closest to the center of the group.

Using these two techniques, we built a giant data table with content like this:

Table shows two different techniques: OpenAI clustering and manual clustering

The usefulness of a cluster still comes down to a mostly human judgment. We cleaned and merged human and Open-AI-produced clusters, but we found groups that were only - or best - identified by OpenAI.

While this process works using purely non-technical language, once we had our groupings, we still asked a pediatric adviser to check that we didn’t make any clear errors by equating two observations a clinical expert would recognize as actually being different. This led to some interesting discussions. How much overlap is there between a “stomach ache” and “stomach pain” in the ways they are used by parents?

Capturing developmental milestones

For milestones, we have started with a pre-existing list. Popular resources to help parents and experts assess development include Ages and Stages, the Survey of Well-Being of Young Children, a checklist from the Centers for Disease Control and Prevention, PEDS, Bright Futures, Zero to Three and many others.

Sleuth’s derived its milestones from the Survey of Well-Being of Young Children (SWYC). See here for a public overview of SWYC and research here ¹. References to the research on SWYC include:

Sheldrick RC, Marakovitz S, Garfinkel D, Carter AS, Perrin EC. Comparative Accuracy of Developmental Screening Questionnaires. JAMA Pediatr. 2020 Feb 17. PMID: 32065615.
Sheldrick RC, Schlichting LE, Berger B, Clyne A, Ni P, Perrin EC, Vivier PM. Establishing New Norms for Developmental Milestones. Pediatrics. 2019 12; 144(6). PMID: 31727860.
Sheldrick RC, Garfinkel D. Is a Positive Developmental-Behavioral Screening Score Sufficient to Justify Referral? A Review of Evidence and Theory. Acad Pediatr. 2017 Jul; 17(5):464-470. PMID: 28286136.

We do not use SWYC’s surveys, Likert scales, or scoring and we strongly encourage parents to refer to SWYC’s peer reviewed resources and seek professional advice if they have concerns about a child’s development.

How do we design questions in Sleuth? (item response design)

The fastest way for parents to capture data is with a single click on a multiple choice question. This reduces the time required to save an observation from twenty seconds of writing to one second of (thoughtful!) clicking. This also increases our ability to systematically relate parents’ observations, despite their subjectivity.

We also want to provide parents space to create thorough notes, but part of Sleuth’s project is to test the usefulness of easy-to-answer multiple choice questions. Then, we can direct parents’ attention to topics they might want to give more in-depth consideration.

To make Sleuth fast and easy to use, we standardize our survey-style questions. This is actually an unusual project. In psychology and medicine, it is typical for every survey to have a slightly different format. (Seehere for extensive links to screening tools.) Sometimes, one scale uses different measures for each question as in this scale to measure drooling:

REID, S.M., JOHNSON, H.M. and REDDIHOUGH, D.S. (2010), The Drooling Impact Scale: a measure of the impact of drooling in children with developmental disabilities. Developmental Medicine & Child Neurology, 52: e23-e28. https://doi.org/10.1111/j.1469-8749.2009.03519.x

Having distinctive measures (Constantly? Profusely? Offensively? in the scale above) and more items in the scale (10 options) can improve the responses’ accuracy - when people are paying close attention. But these complex scales also require extra effort to answer. Once the effort is too exhausting, the quality of people’s answers drops or they stop answering the questions.

Sleuth’s goal is to create an environment where frequent and ongoing responses to survey questions are fun and easy. As a result, we prefer to have standard scales that would work for all of the 285 topics that we currently cover and any ways they might be mixed and matched.

We could capture a variety of data…

Sleuth’s tracker covers many of these, including location:

(Symptom location in the Sleuth app)

But for our surveys, we need a simplified scale. In our case, severity is a bad measure for many topics parents mention. It requires every topic to be viewed in a negative light. We prefer to ask, “How strong is your child’s alertness?” not “How severe is your child’s lack of alertness?” Most kids are alert, so asking about the strength of a child’s alertness is a more natural question and results in more accurate answers.

We compromise and use severity to measure traditionally negative topics such as acne, “How severe is your child’s acne?” and we use strength as our measure of most neutral or positive behaviors and milestones.

This lets us cover 285 topics with just two variations on the same scale.

The option Not At All is a little awkward. Sometimes, None or some other null value is more meaningful. However, we prefer to standardize, and we find that parents quickly grasp what Not At All means.

Similarly, we use a five-item scale. This is very standard. Research shows a small improvement with seven items over five, but also recognizes the difficulty people have answering seven item questions.

For frequency, we steer clear of subjective measures of frequency (“a lot” vs. “a little”) and favor the precision of absolute numbers: how many times per minute, hour, day, week, month, or year?

How reliable is Sleuth?

We check that Sleuth’s data is reliable through observations and testing. Here are some of these steps:

Sleuth’s data on developmental milestones is similar to published research

Unlike our behavioral data which is unique, we can compare Sleuth’s data on milestones against academic research. The Survey of Well-Being of Young Children uses a 3 point scale (“Not Yet”, “Somewhat”, “Very Much”) while Sleuth uses a 5 point scale (“Not At All”, “Mild”, “Moderate”, “Strong”, “Very Strong”). However, we both have models that predict the age when 50% of kids will “somewhat” perform (SWYC) or have “moderate” skill at (Sleuth) the same milestones. Here are those results, with Sleuth’s slightly different variant of the milestone wording on the left:

Milestone	SWYC: Child age (in years) at which each 50% of parents are expected to report that children at least "somewhat" pass milestone	Sleuth: Child age (in years) at which 50% of parents are expected to report "moderate" skill at each milestone.	Difference (in months) between expected age in SWYC and Sleuth measures
Following a moving toy with the eyes	0.1	0	1
Holding head steady without support	0.1	0.1	0
Keeping head steady when held upright	0.1	0.15	1
Bringing hands together	0.1	0.1	0
Laughing	0.1	0.15	0
Looking when called by name	0.2	0.15	0
Looking for you when upset	0.2	0	2
Moving things from one hand to the other	0.2	0.5	3
Rolling over from tummy to back	0.3	0.3	0
Saying: "mamama" or "babababa"	0.3	0.5	3
Copying sounds you make	0.3	0.4	1
Banging two objects together	0.4	0.5	1
Holding up arms to be picked up	0.5	0.55	1
Playing games like "peek-a-boo"	0.5	0.35	2
Picking up food and eating it	0.5	0.6	1
Saying: "mama" or "dada" for parent	0.5	0.55	0
Sitting up independently	0.6	0.5	1
Pulling up to stand	0.6	0.6	1
Following directions like "give me..."	0.7	0.95	3
Kicking a ball	1.1	1.25	2
Running	1.1	0.95	2
Climbing up a ladder at playground	1.2	1.75	7
Naming at least 5 familiar objects	1.2	1.05	2
Jumping off the ground with two feet	1.3	1.75	5
Naming at least 5 body parts	1.4	1.75	5
Putting two words together like "more water"	1.5	1.05	5
Asking "Why?" or "How?" questions	1.6	2.25	8
Naming at least one color	1.6	1.75	2
Saying own name when asked	1.8	1.8	1
Explaining the reasons for things	1.9	1.8	2
Comparing things ("bigger"/"shorter")	1.9	1.75	2
Answering "When...?" or "Why...?" questions	2.0	1.8	2
Drawing simple shapes like a circle or square	2.0	1.8	3
Following simple rules for games	2.2	1.75	6
Drawing pictures you recognize	2.7	2.9	2
Naming the days of week in order	3.2	3.55	4
Writing own first name	3.2	4.05	10

(These SWYC norms are in Supplemental Material in: Sheldrick RC, Schlichting LE, Berger B, Clyne A, Ni P, Perrin EC, Vivier PM. Establishing New Norms for Developmental Milestones. Pediatrics. 2019 12; 144(6). PMID: 31727860.)

These results are close, especially considering that SWYC and Sleuth use different scales and different sample populations. The data used to evaluate SWYC comes from more than 40,000 screens in pediatric practices in Minnesota, Rhode Island, and Massachusetts. Sleuth’s data comes from the stratified sample described in our Methodology section.

In the areas of greatest disagreement, we cautiously note that Sleuth’s data about timelines seems closer to widespread observations. We think 1.6 years old is slightly early to expect kids to ask “Why?” or “How?” questions, for example. Similarly, 3.2 years old seems early for many kids to write their own name.

The similarity of Sleuth and SWYC results is reassuring for any concerns about how the COVID pandemic influences these measures: The milestone norms for SWYC in the table above were collected between 2014 and 2017. Sleuth’s matching data was collected in May of 2023. The similarity in our results shows that for these foundational topics (and most topics in Sleuth), the big picture may not have been influenced that dramatically by COVID. That said, we think ongoing data collection should capture important population-level changes in kids’ health and development reflecting the pandemic and many other important trends. Some of Sleuth’s data will definitely reflect effects from COVID.

Sleuth’s probabilistic models closely fit observed data

According to Sleuth’s statistics, we see a 46% chance that parents will say that a child 3 years and 18 days old has moderate to very severe tantrums or meltdowns. But how accurate is this estimate?

The main challenge for assessing error (and noise) is that most of Sleuth’s models don’t make a single prediction at a specific age. We always describe the likelihood of five cases: “Not At All”, “Mild”, “Moderate”, “Severe” (or “Strong”), and “Very Severe” (or “Very Strong”) or specific frequencies of events. We have to evaluate our performance at predicting the distribution of outcomes rather than one binary (yes/no) event.

To handle this, we look for the average difference between our model’s predictions and the sample data we have with parents’ actual observations for kids around each year of age. (The decision to group by year rather than month or some other bucket is a trade-off we can discuss elsewhere.)

This approach is similar to the method of comparing cumulative probability distributions in a Kolmogorov–Smirnov test, except Sleuth is interested in the mean absolute distance between the red and blue lines (rather than the largest distance as shown below, and with “child age” on the X axis from 1 to 17 years old):

From Wikipedia, Kolmogorov-Smirnov test

When we look at this gap for every prediction at, approximately, every year of age, we see this average difference between observed and predicted percentiles:

Average difference in % between observed data and predictions

This chart is a histogram with the frequency of the average “error” for each topic at all ages, from data that looks like this:

(results for the first 7 topics)

The average of these averages is reasonably low: 4.11% average error across all topics and ages. If our model were just making random guesses about Sleuth’s data, the average error would be 28.6%.

When we have more data, we will also do an “out-of-sample” test of the same result: building our models with part of the data and then measuring error by comparing the models against “new” unobserved data. This extra test would require about twice as much data in Sleuth, but we expect similar average error from an out-of-sample test.

Sleuth’s data on symptom clusters matches matches published research

We have systematically collected data from children who have been diagnosed by medical professionals with 5 conditions:(1) attention-deficit hyperactivity disorder (ADHD), (2) autism spectrum disorder, (2) delayed speech development, (4) anxiety disorders, and (5) depression.

Although Sleuth’s data collection was not performed with reference to existing research, our data should correspond closely to findings in peer-reviewed articles. This is what we see.

For comparison to the research, we use clusters to see which symptoms commonly go together in parents’ observations and to observe larger patterns in parents’ reports.

Attention-deficit hyperactivity disorder

Here are the clusters in the most common symptoms parents mention for kids diagnosed with ADHD:

In these charts, symptoms connected more closely in the branches of the sideways tree (closer to the left) are more commonly observed to be “severe” at the same time.

In blue, we see clusters that fit the traditional diagnostic criteria of “hyperactivity” and “difficulty paying attention” in kids with ADHD. Every other one of the symptoms and grouping, however, also matches the symptomology described in research. (See alsohere, here, and here).

In general, we notice that parents place more emphasis on secondary effects (“sequelae”) of the clinical presentation: anxiety and forgetfulness, for example, in kids with ADHD. It is interesting that “forgetfulness” is more closely linked to “anxiety” and “insomnia” than it is to “difficulty paying attention,” suggesting a role of ADHD-driven stress in the more casual observation of forgetfulness.

If we compare the average “intensity” of symptoms, we also see a clear pattern of differences between kids with and without a diagnosis of ADHD in Sleuth’s data:

This is based on 776 kids with an ADHD diagnosis and 512 kids without an ADHD diagnosis, but whose parents answered the same questions.

Autism spectrum disorder

Parents’ observations about kids diagnosed with autism, again, fit the diagnostic criteria: limited eye contact and responsiveness, delayed speech development, sensory sensitivity, and stylized, repetitive movement (seehere, here, and here).

Anxiety disorders

Because of the variety of strains of anxiety - from social anxiety to phobios to stress presenting as panic attacks - the symptoms of kids diagnosed with anxiety disorders seem more varied. For kids with anxiety, comparison to the diagnosed (and general) populations may help parents observe indications of specialized disorders in addition to, or distinct from, a generalized anxiety disorder.

As with other conditions, parents of kids with anxiety also mention these secondary symptoms and behaviors: nail biting, headaches, difficulty concentrating, and irritability.

Delayed speech and language

Delayed speech is an instance where Sleuth diverges from the diagnostic criteria, which might also use early vocabulary assessment tools like the MacArthur-Bates Communicative Development Inventories for screening. (Sleuth does, in fact, have encouragement to provide the short form MacArthur-Bates CDI from one of its authors. We would love to do this in the future.)

However, this is another case where parents’ observations might be useful to capture at-home signals such as out-of-ordinary levels of frustration with difficulty communicating.

Among these symptoms, stuttering was not differentially associated with speech delays when kids with a diagnosis of delays were compared to the general population of kids on Sleuth. This irrelevance of stuttering for identifying delays is also supported by research.

Depression

Once again, the symptoms of depression that parents report to Sleuth very closely match the peer-reviewed research. We see low energy and malaise, sadness and anxiety, feelings of hopelessness, and more extreme self-harm and suicidal thoughts.

The prominence and sharp increase in depression in teenagers in the U.S. makes these symptoms an important topic for coverage in Sleuth. We also see in parents’ observations the patterns widely decried in the research: increased isolation (“keeping alone (keeping to self)”) seeming to lead directly to sadness and anxiety in the same cluster at the top of this image.

Would models based on Sleuth’s data accurately identify conditions?

The clean fit of Sleuth’s data for previous research raises the question: could Sleuth’s data accurately identify conditions? Our goal for Sleuth is not to perform this function of clinical screening, assessment, or detection. However, it is important to us to understand the precision of the data.

If our data are consistent and reliable, then we should be able to guess what condition a child has been diagnosed with based on the symptoms a parent has written into Sleuth. This is a limited test for two reasons:

Some kids may have a condition but they may not have been diagnosed with it yet. This will appear as a mistake in Sleuth’s testing even though our prediction is actually correct: we could accurately predict that a child has ADHD when a child does, even though our data shows that the child has not received that diagnosis.
Because we are working with kids who have already received a diagnosis, the symptoms parents have reported to Sleuth may be influenced by the language doctors and other specialists use to identify that condition in general.

When we do these tests, as with all of this analysis, we are working with data that is de-identified. The only information we have about a child is age, gender, and the parents’ responses to the 23 ADHD-related topic questions. For a more thorough model in the future, we will include other demographics.

Sleuth’s accuracy at predicting whether or not kids have a diagnosis of attention-deficit hyperactivity disorder with a random forest classifier is 76% (using the sample of 776 with and 512 without). This is an out-of-sample test, evaluating our ability to predict the diagnosis of kids whose data was not used to build the predictive model. The sensitivity of the model is 82% and the specificity is 70%.

Interestingly, if we optimize precision, we can produce a model with 95%+ accuracy in an out-of-sample test for the highest-confidence subset (25%) of the test data. We look forward to sharing more of these investigations as Sleuth’s work continues.

References

Sheldrick RC, Schlichting LE, Berger B, Clyne A, Ni P, Perrin EC, Vivier PM. Establishing New Norms for Developmental Milestones. Pediatrics. 2019 12; 144(6). PMID: 31727860.