itaja
10-12-2009, 07:51 PM
Here's a sample problem:
Let's say I have a collection of 10,000 documents on a particular theme. Each document is associated with one or more of 200 authors. Some authors are more prolific than others (i.e. they have more documents attributed to them in the corpus). We are able to calculate the number of records associated with each author. The numbers (and percentage of total documents) for the most-prolific authors are:
Author A: 2,700 records (27.0%)
Author B: 1,300 records (13.0%)
Author C: 1,100 records (11.0%)
Author D: 900 records (9.0%)
Author E: 800 records (8.0%)
Author F: 700 records (7.0%)
Author G: 580 records (5.8%)
Author H: 400 records (4.0%)
Author I: 320 records (3.2%)
Author J: 230 records (2.3%)
Now, let's say that we have a fancy document clustering widget that takes the 10,000 record corpus and organizes it into 200 clusters of records that range in size from 200 to 15 records, where documents in each cluster are semantically related to each other (i.e. each document cluster represents a distinct "sub-theme" or topic). We're going to assume for the purpose of this exercise that the widget works perfectly and that each document can be assigned to only one cluster.
Now, let's say that we are interested in identifying topics (i.e. document clusters) that are anomalously associated with particular authors, where "anomalous" simply means deviating significantly from the global baseline distributions.
So, for instance, let's say that Author G (who is associated with 5.8% of the 10,000 overall records) has documents in 160 of the 200 document clusters, and a representation within those clusters ranging from 0.5% to 60% of total cluster records.
So here are my questions:
1. How big does a document cluster need to be and how large does a particular author's inter-cluster deviation need to be before the anomalous association between author and document cluster can be considered "significant" in this context? In the "Author G" example given above, for instance, let's say that his most over-represented clusters are as follows:
# cluster records \\ Author G records
15 \\ 9 (60%)
24 \\ 9 (37.5%)
18 \\ 4 (22.2%)
62 \\ 11 (17.7%)
16 \\ 2 (12.5%)
170 \\ 19 (11.2%)
30 \\ 3 (10%)
200 \\ 19 (9.5%)
Which of these show a "significant" over-representation, and how does one characterize/quantify the degree of confidence?
2. How can we reduce this type of calculation down to an equation and/or rule of thumb?
3. Are we missing any required data points to do this type of calculation?
Thanks for your thoughts and feedback...
(Needless to say, I am not a statistician... but I am grappling with a problem very much like this one and don't know how to proceed)
Let's say I have a collection of 10,000 documents on a particular theme. Each document is associated with one or more of 200 authors. Some authors are more prolific than others (i.e. they have more documents attributed to them in the corpus). We are able to calculate the number of records associated with each author. The numbers (and percentage of total documents) for the most-prolific authors are:
Author A: 2,700 records (27.0%)
Author B: 1,300 records (13.0%)
Author C: 1,100 records (11.0%)
Author D: 900 records (9.0%)
Author E: 800 records (8.0%)
Author F: 700 records (7.0%)
Author G: 580 records (5.8%)
Author H: 400 records (4.0%)
Author I: 320 records (3.2%)
Author J: 230 records (2.3%)
Now, let's say that we have a fancy document clustering widget that takes the 10,000 record corpus and organizes it into 200 clusters of records that range in size from 200 to 15 records, where documents in each cluster are semantically related to each other (i.e. each document cluster represents a distinct "sub-theme" or topic). We're going to assume for the purpose of this exercise that the widget works perfectly and that each document can be assigned to only one cluster.
Now, let's say that we are interested in identifying topics (i.e. document clusters) that are anomalously associated with particular authors, where "anomalous" simply means deviating significantly from the global baseline distributions.
So, for instance, let's say that Author G (who is associated with 5.8% of the 10,000 overall records) has documents in 160 of the 200 document clusters, and a representation within those clusters ranging from 0.5% to 60% of total cluster records.
So here are my questions:
1. How big does a document cluster need to be and how large does a particular author's inter-cluster deviation need to be before the anomalous association between author and document cluster can be considered "significant" in this context? In the "Author G" example given above, for instance, let's say that his most over-represented clusters are as follows:
# cluster records \\ Author G records
15 \\ 9 (60%)
24 \\ 9 (37.5%)
18 \\ 4 (22.2%)
62 \\ 11 (17.7%)
16 \\ 2 (12.5%)
170 \\ 19 (11.2%)
30 \\ 3 (10%)
200 \\ 19 (9.5%)
Which of these show a "significant" over-representation, and how does one characterize/quantify the degree of confidence?
2. How can we reduce this type of calculation down to an equation and/or rule of thumb?
3. Are we missing any required data points to do this type of calculation?
Thanks for your thoughts and feedback...
(Needless to say, I am not a statistician... but I am grappling with a problem very much like this one and don't know how to proceed)