In this whitepaper we consider the problem of outbound-filtering of emails to prevent accidental leakage of confidential information, We examine how to do this with the GPLed open-source spam filter CRM114 and test the accuracy of this filter against a 10,000+ document corpus of hand-classified emails (both confidential and non-confidential) in Japanese. We look into what moving parts are involved in these filters, and how they can be set up. The results show that a hybrid of multiple CRM114 filters outperforms a human-crafted regular-expression filter by nearly 100x in recall, by detecting > 99.9% of confidential documents, and with a simultaneous false alarm rate of less than 5.3%. As the programmers creating the machine-learning programs don't know how to read or write Japanese, this problem is an almost ideal case of the Searle “Chinese Room” problem.
Secdocs is a project aimed to index high-quality IT security and hacking documents. These are fetched from multiple data sources: events, conferences and generally from interwebs.
Serving 8166 documents and 531.0 GB of hacking knowledge, indexed from 2419 authors from 163 security conferences.