Stylometry is the statistical analysis of linguistic style. Writing can be a unique, personal process. The vocabulary selected, the syntax, and grammatical decisions leave behind a signature. The same is true in digital data. Information Management Today brought this interesting information to our attention in their article, “Even Anonymous Coders Leave Fingerprints.”
Automated tools can now accurately identify the author of a forum post, for example, as long as they have adequate training data to work with. Stylometry can also apply to artificial language samples like code. Software developers are leaving behind fingerprints.
Researchers did a binary experiment with code samples from Google’s annual Code Jam competition. The machine learning algorithm correctly identified a group of 100 individual programmers 96 percent of the time, using eight code samples from each. Even when the sample size was widened to 600 programmers, the algorithm still made an accurate identification 83 percent of the time.
Melody K. Smith
Sponsored by Access Innovations, the world leader in taxonomies, metadata, and semantic enrichment to make your content findable.