PDA

View Full Version : when is deviation from the baseline 'significant'?


itaja
10-12-2009, 07:51 PM
Here's a sample problem:

Let's say I have a collection of 10,000 documents on a particular theme. Each document is associated with one or more of 200 authors. Some authors are more prolific than others (i.e. they have more documents attributed to them in the corpus). We are able to calculate the number of records associated with each author. The numbers (and percentage of total documents) for the most-prolific authors are:

Author A: 2,700 records (27.0%)
Author B: 1,300 records (13.0%)
Author C: 1,100 records (11.0%)
Author D: 900 records (9.0%)
Author E: 800 records (8.0%)
Author F: 700 records (7.0%)
Author G: 580 records (5.8%)
Author H: 400 records (4.0%)
Author I: 320 records (3.2%)
Author J: 230 records (2.3%)

Now, let's say that we have a fancy document clustering widget that takes the 10,000 record corpus and organizes it into 200 clusters of records that range in size from 200 to 15 records, where documents in each cluster are semantically related to each other (i.e. each document cluster represents a distinct "sub-theme" or topic). We're going to assume for the purpose of this exercise that the widget works perfectly and that each document can be assigned to only one cluster.

Now, let's say that we are interested in identifying topics (i.e. document clusters) that are anomalously associated with particular authors, where "anomalous" simply means deviating significantly from the global baseline distributions.

So, for instance, let's say that Author G (who is associated with 5.8% of the 10,000 overall records) has documents in 160 of the 200 document clusters, and a representation within those clusters ranging from 0.5% to 60% of total cluster records.

So here are my questions:

1. How big does a document cluster need to be and how large does a particular author's inter-cluster deviation need to be before the anomalous association between author and document cluster can be considered "significant" in this context? In the "Author G" example given above, for instance, let's say that his most over-represented clusters are as follows:

# cluster records \\ Author G records
15 \\ 9 (60%)
24 \\ 9 (37.5%)
18 \\ 4 (22.2%)
62 \\ 11 (17.7%)
16 \\ 2 (12.5%)
170 \\ 19 (11.2%)
30 \\ 3 (10%)
200 \\ 19 (9.5%)

Which of these show a "significant" over-representation, and how does one characterize/quantify the degree of confidence?

2. How can we reduce this type of calculation down to an equation and/or rule of thumb?

3. Are we missing any required data points to do this type of calculation?

Thanks for your thoughts and feedback...

(Needless to say, I am not a statistician... but I am grappling with a problem very much like this one and don't know how to proceed)

itaja
10-13-2009, 06:49 PM
OK, well, I had to forge ahead with this, so I came up with my own little equation to sort by deviation from the norm after correcting for sample (cluster size). Here's what I did:

2*((Sr-(Ss*Pp)/SQRT(Ss))

Where
Sr= 'sample records' (number of records for a particular author in a given sample (i.e. document cluster))

Ss = 'sample size' (total number of records in the document cluster in question)

Pp = 'population proportion' (calculated representation for the author in the full corpus)

So in descriptive language, Pp*Ss calculates an "expected" value for each author for each cluster; Sr-(Pp*Ss) calculates deviation from the expected value for each cluster; division by the square root of the sample size corrects for variation in sample size.

So I'll now give an easier question. What's wrong with the above?





Ss=

itaja
10-13-2009, 07:27 PM
Oh, and to make it more concrete, for the examples I gave for "Author G" in my initial post, here are the scores I came up with:

(Author G proportion in the population (Pp) = .058)

"Scores"

Cluster 1: 4.2 [Ss=15; Sr=9]
Cluster 2: 3.1 [Ss=24; Sr=9]
Cluster 3: 1.4 [Ss=18; Sr=4]
Cluster 4: 1.9 [Ss=62; Sr=11
Cluster 5: 0.5 [Ss=16; Sr=2]
Cluster 6: 1.4 [Ss=170; Sr=19]
Cluster 7: 0.5 [Ss=30; Sr=3]
Cluster 8: 1.0 [Ss=200; Sr=19]

So Cluster 1, with 9/15 records for Author G deviates most substantially from the global baseline; Cluster 2, with 9/24 records, is next in line, followed by Clusters 4, 6, 3, 8, 5, and 7.

A score of 0 would indicate zero deviation from the baseline. If I have figured this right, scores above ~2 would show significant deviation from the baseline at 95% confidence level. Scores of less than 1 would be within one standard deviation of global mean.

Yes? No?