(Deprecated) Lisa's Tech Blog: spam

Friday, December 16, 2016

[安全论文阅读笔记］Survey on Web Spam Detection: Principles and Algorithms

Date: 2016-12-16

这篇文章发表在SIGKDD Explorations 2013，作者是来自UIUC的 Nikita Spirin 和 Jiawei Han

这篇文章总结了web spam 检测的主要算法分类。主要针对的spam是搜索引擎spam，而非social media spam。

Spam的分类以及技术
1. Content Spam
因为搜索引擎对网页的内容的排名采用TFIDF模型。因此这些spam会在内容里加入一些popular的词，来提高rank。
2. Link Spam
搜索引擎采用page rank来评估网页排名，因此这些spam会通过提高incoming link数量质量来提高目标页面的排名，他们也会通过购买被抛弃的域名来获取有一定reputation的域名。
3. Cloking and Redirection
对于同一个页面，Spammers会根据不同的clients来展示不同的内容。因此对于搜索引擎爬虫，他们可以放比较有利于rank的内容，而对于普通用户，他们可以展示广告内容。

已有的检测方法大概可以分为三类
1. content-based methods
这些方法主要通过分析word counts, language models, HTML页面的结构，clocking score
2. link-based methods
这些方法主要通过分析link构成的图结构的特性，label propagation，Link pruning and reweighting, graph regularization (建议如果有意通过link结构来做检测的同学可以细读具体内容）
3. data-based methods, e.g., user behavior, clicks, HTTP sessions.
这些方法通过Markov model来分析用户行为等

Saturday, January 30, 2016

Spam Filter Challenge

Adaptation of Adversaries [1]

The adversaries are motivated to transform the test data to reduce the learner's effectiveness.
Spam filter designers

Attempt to learn good filters by training their algorithms on Spam (and legitimate) email messages received in the recent past.

Spammer

Are motivated to reverse-engineer existing Spam filters and use this knowledge to generate messages which are different enough from the (inferred) training data to circumvent the filters.

Solutions

Increase the robustness of the learning algorithm to generic training/test data differences via standard methods such as regularization or minimization of worst-case loss [1]

However, these techniques do not account for the adversarial nature of the training/test set discrepancies and may be overly conservative.

Predictive analystics to anticipate and counter the adversaries [1]

For example, predictions can be made using extrapolation or game-theoretic considerations, and can be employed to transform training instances so that they become similar to (future) test data and therefore provider a more appropriate basis for learning.

Time-varying posture to increase uncertainty [1]

Pros

This approach is flexible, scalable, easy to implement, and hard to reverse-engineer.

Reference

[1] Moving Target Defense for Adaptive Adversaries, by Richard Colbaugh and Kristin Glass, in ISI 2013.