
Saturday December 16, 2006
Search Engine Comparison
Comparison of search engine traffic and trends.
The overall picture looks stagnant to me. Three changes appear in early 2006 - Google's traffic peaks, Yahoo begins testing previous lows and the real suprise is that Baidu.com peaks. MSN and Yahoo topped out 2002 and 2003, respectively. The recent shakeup at Yahoo shouldn't be a surprise, the graphs show Yahoo in a decline.
The appearance of three different events in the same period (early 2006) could be a warning sign.
Items to note - The ranking in Google Trends matches the ranking in Alexa... Yahoo, then MSN, followed by Google and Baidu last. It's a likely confirmation that search trends tend to mirror network traffic. Notice how Google and MSN are near a crossover point in each graph.



( Dec 16 2006, 08:54:33 PM EST )
Permalink
Click Fraud Addendum
Another collection of unusual traffic. A series of logins which have the same browser/OS stamp and the page / hit counts always equal each other.
Is this a another click fraud engine?
What the heck is this?
Why is it coordinated?
Why not just let each system fire off at random?

( Dec 16 2006, 05:45:38 AM EST )
Permalink

Friday December 15, 2006
Vector & Clickage
Vector & Frequency
Vector & Checkmate
Vector & Sequence
Information always leaves a fingerprint. Sometimes it is smudged, sometimes it is misleading but it is always there.
Suppose Casper had a serious financial problem. Suppose he borrowed a lot of money to buy houses that he can't resell. Then suppose he created a website about his problem.
If his website was intimately revealing, he might draw a crowd. Its traffic stream would grow and its meme propagate. Casper might then entertain thoughts about website advertising to solve his problem.
Enter Guido wearing a pin-stripe suit. But now Guido carries a laptop instead of a submachine gun. He might hire a programmer. That programmer might write a program, a Generator that mimics internet traffic. I can imagine that program, the ease of coding it. Guido could hire another programmer for data-mining, monitoring for sites with rapidly increasing traffic, knowing that most sites never attain commercial viability.
If Casper's site appeared on Team Guido's radar, they may ponder his financial situation and watch and wait for its traffic to sag. Then they could loan him money and stoke his hopes for higher clickstream traffic. And the Generator would make it so, slowly over time.
And Casper might not even know that his banker's name is Guido.
And... it might not be.
How do we know?
We look for fingerprints.
( Dec 15 2006, 09:59:21 PM EST )
Permalink

Wednesday December 13, 2006
The Cultural Diffusion Resurrected, Part II
Part I introduces the "Cultural Diffusion" theory and some circumstantial evidence.
I wanted definitive, exhaustive proof of the Diffusion and that it is an Internet-related effect. That effort produced the "Document Diversity" entries...
Hypothesis #1 - We can't measure memes directly, but keywords are a rough proxy. If the # of keywords in general use increases, then the # of memes is probably increasing. This would be strong evidence of the Cultural Diffusion because it's based on the entire population of keywords, not a few selected examples.
Document Diversity outlines the methodology and proves that keyword count is constant over time. It disproves Hypothesis #1.
Hypothesis #2 - Now we know that # of keywords in circulation is roughly constant. But does the makeup of those words change over time? Does "memetic drift" exist, i.e. a turnover of memes in the Ideosphere? An accelerating rate of turnover could also create the "Cultural Diffusion".
The Memetic Drift entry proves that there is memetric drift. But it shows a constant rate of change, no acceleration, ergo it disproves the Hypothesis #2
Hypothesis #3 - Now we know that keyword count and memetic drift remain constant over time. But the frequency distribution of keywords may be changing. Assume the frequency of keyword usage is a normal curve. If that curve is flattening over time, then it's a proxy for shrinkage in "mainstream culture" and for growth in fringe (marginal) sub-cultures. Can we measure a change in the frequency of keywords usage over time?
Document Diversity Result measured the rate of change in keyword frequency. But my dataset isn't big enough for valid results. It seems that there's roughly 10,000 keywords in use and the sample size must accurately measure small changes over a fifteen year period. I estimate that 2 million queries are needed and my research is stalled here.
Other supporting entries -
My original ideas to prove that the Cultural Diffusion exists.
Document Diversity - The Prequel is my first realization of how many problems are cropping up while building the original data set. The Usenet hierarchy went through a major restructuring in 1989, so I eliminate 1988 and 1989 from consideration.
Document Diversity - The Rollback is a rethinking after Hypothesis #1 failed from bad data-selection methodology.
Document Diversity - The Regrouping rethinks the original theory and attempts to find a new mechanism for the "Cultural Diffusion".
( Dec 13 2006, 03:45:18 PM EST )
Permalink
The Cultural Diffusion Resurrected, Part I
A rework of "The Cultural Diffusion" entries in expectations of new research which proves or disproves the theory.
Total information grows faster than human population and it's easily duplicated, so the finite space of each human skull has a greater diversity of information than ever in history. If information is a prime driver of culture, then "cultural diffusion" should occur; mainstream culture should shrink as fringe cultures spawn and grow -

The result should be escalating costs expressed as information problems - miscommunication, cultural conflicts, etc. Eventually, the net benefit of diversity should exceed its societal benefits and produce a big drop in the price of information as demand falls off.
An operational view as more memes enter our finite mental bandwidth -
Circumstantial Evidence
"Love" is my baseline (it was the strongest foundation meme I found) and proxy for total meme bandwidth. I charted it against the exponential growth of several "foundation" memes to measure changes in relative bandwidth.
In the cultural diffusion theory, most memes should gradually lose bandwith to new memes. The labels marked "divergence" indicate a meme in decline, losing its percentage share of bandwidth -






( Dec 13 2006, 03:43:51 PM EST )
Permalink

Tuesday December 12, 2006
Today's Mystery Guest

( Dec 12 2006, 09:03:55 PM EST )
Permalink
Webmaster Radio
I did an interview for Webmaster Radio about datamining and click fraud detection. The broadcast date is sometime this week as part of the "Click Fraud" series with Jim Hedger and the projected audience numbers are somewhere in the 100K to 300K range.
That's about 10X more than my best website traffic,
which itself was 10X greater than any other day (my site was mentioned in a slashdot article)
Then off to San Francisco on Thursday and Friday to meet at two companies.
And apparently I got added to a South Korean search engine just now.
( Dec 12 2006, 03:31:06 PM EST )
Permalink

Monday December 11, 2006
The Traffic Generator
Traffic Generator & Click Fraud Detection Strategy
I thought I was the test subject of a new search engine. But now I believe that this is fraud. The traffic generator which hit my site had clear variations from my normal traffic but it could easily be modified to be less detectable -
- Too many simultaneous operating systems per IP. My traffic rarely has two systems from a single IP, much less simultaneously.
- The traffic is too dense and the changeover too abrupt. When I get traffic like this, it always references the same blog entry. The engine may have tried to mimic my apparent traffic (which is far higher than my real traffic).
- Too many 2-page hits. 90% of my traffic is a 1-page hit, 5% is 3+.
- The page hits don't follow a human click flow. The pages have no relationship to each other.
- All hits are bookmarks, no entry points into the site.
- Most bookmarks are older and rarely referenced by real traffic.
It might not be illegal.
Immoral, yes.
I believe the engine is generating the browser and OS signatures.
The traffic generator looks remarkably like a search engine spider except for:
- many IP addresses which read a few pages each (versus few IP addresses which read many pages)
- multiple OS and browser signatures per IP
- doesn't identify itself as a search engine spider
Detailed documentation -



The strangest part is that the suspected site gave me new insights into Internet evolution and structure. The rework of my mining methodology succeeded in identifying anomalous traffic. My postings about their website traffic anomalies may have spooked a few people.
I've been troubleshooting stuff since 1978. First electro-mechanical stuff, then electronics, software for the past fifteen years but I'm surprised the fraud was so transparent. My percentage bet is that it's about the advertising money. I suspect that the engine is generating income in the millions of dollars.
( Dec 11 2006, 07:15:57 PM EST )
Permalink
Traffic Generator Detection Strategy
MemeMiner uses Dejanews.com time-series keyword searches to predict technical trends. But sometimes it gives false predictions so I developed a more sophisticated model by adding graphs from Google Trends, Nielson's Blogpulse and Alexa.com to confirm trends. As a consequence, I detected interesting anomalies in web traffic which led me to the traffic generator.
Miner Theory - Normal meme propagation usually looks like an S-curve over time.

As memes propagate over the Internet, they seep into domains at different rates. In this new model, I sample three points (see the MySpace Meme as an example): a memetic entry point (communication avenue), a measurement avenue, a reference avenue and then look at the propagation differences -
1) Communication avenue - the true vector of meme propagation like Dejanews.com, BlogPulse.com.
2) Measurement avenues - Alexa.com or Netcraft.com.
3) Reference avenues - Google.com/trends, other search engines.

Three different points provides insight into how and why a meme is propagating, a trouble-shooting guide.

Note: "nothing" means "less than expected". Meme propagation promotes seepage (bleedover) at different rates throughout the Ideosphere.
( Dec 11 2006, 05:59:36 PM EST )
Permalink

Saturday December 09, 2006
State Of Affairs
I lost my car last night.
The police found it for me.
My first thought was that they'd towed it.
I did an interview yesterday, too.
Those two events sparked a realization.
My faith in authority and management figures has declined.
It started with the deceit at Boise State University, my good Mormons friends engaged in their conspiracies to promote and protect their peers at everyone else's expense. Then I worked for Third Wave on a project which was defrauding the City of Las Vegas, exposed by the LV Review Journal in this article and this article. At Saleslogix, we hired the CFO of iNBC, itself notorious for fraud, and our balance sheets started looking strange and there were dubious statements to the Board of Directors (remember "Tsunami"?!
). At Avnet I was caught between the self-serving old guard and self-serving ex-Motorola newcomers, both engaged in short-sighted political trickery.
The last two employers promised but didn't deliver. Then I learned my lesson and discarded this Portland job offer as soon as it turned strange. That job reopened a few days ago so I probably made a good decision.
And now I've found myself some search engine fraud.
I have seen so much fraud, deceit and spite over the past ten years. Yokels, paranoid kids, promotions for liars, disquieting silences from leadership.
When will the trend reverse?
Why doesn't anybody else notice it?
Or talk about it?
( Dec 09 2006, 11:54:11 AM EST )
Permalink

Friday December 08, 2006
The Traffic Generator Vivisected
The Traffic Generator Identified
The Traffic Generator Detection Strategy
"Why did the [traffic generator] hit my site twice?"
My theory? Because it is automated and recursive.
It has to generate bleedover traffic in "innocent" crosslinked sites to maintain the illusion that the fraud site is experiencing a surge of traffic.
I originally thought that the traffic generator used crosslinks directly from the primary fraud site. No. It's more sophisticated than that. It indexes the website of the crosslink and stores the references in a database. It recurses through all links on the fraud site and then generates randomized IP hits against the secondary crosslinked sites. It probably only recurses one level deep. High traffic sites wouldn't notice the discrepencies of the engine and low traffic sites may not check or care.
I understand the scheme.
I'll update this later with visual diagrams.
Update: My traffic suggests that Texas (Houston & Dallas) is a likely geographical location for the Owners of the Traffic Generator. 
( Dec 08 2006, 08:53:27 PM EST )
Permalink
Defcon 2007 Submission
I have my Defcon 2007 submission figured out!!!!
Detecting Search Engine Advertising Fraud
Tactical Fraud Analysis
Strategic Analysis
It's perfect. It uses the Meme theory from my previous presentations and evolves it into a real-world security application. I'll have to rework it a little. The odds are good that third-party documentation and confirmations will be available by July, too.
I'm close to breaking out.
I've been in this sucky trading range for 3 1/2 fricking years.
But I have a goal now.
August, 2007.
Defcon.
Thank you, Casey!
---
You are magical to me.
( Dec 08 2006, 05:55:46 PM EST )
Permalink

Tuesday December 05, 2006
Valentine's Day, 2007
Perhaps Casey Serin is right. Maybe I need a "retreat day" to review the recent past and contemplate the near future. Valentine's Day is a good choice. It's far out enough for good planning, but not so far it may devolve into a "possible" plan.
I like that. It's an appropriate day for me, considering my luck with women. I might take that day off and sit out at a lake somewhere with my laptop, do a rambling stream of consciousness weblog.
That's a Tuesday.
I'll have to find an area that won't be crowded a on a Tuesday, hopefully out here by a secluded lake. I might take pictures.
Yup.
We definitely need a plan that we can stick to.
( Dec 05 2006, 03:11:14 AM EST )
Permalink

Monday December 04, 2006
Online Dating Paranoia Addendum
Previous update
I did cancel all personal ads in August but I succumbed to weakness and tried Yahoo Personals again several days ago. The results were even suckier, seeing as I'm in Idaho and my possible matches are 10% the size of Seattle's.
Cancelled it again.
There's a ItsJustLunch.com in Boise.
I'll reactivate my account if I decide that I'm staying here.
Bars - Last night I met Kristin, a twenty-two year old coed from North Carolina, dirty blonde, 5'3", under 100 lbs, probably. I beat her two of three games. Then I played Cat, a goth about my size, she beat me but (oh, I wish!) ... it was close. And lost to Jenny and her boyfriend. But I did win 5 of 6 against the club staff!
So. Technically I met three women, and Kristin seemed oddly oblique but she circumvented the regular playing rotation to play me three times. Maybe it's her standard operating procedure for anyone, though.
And... twenty-two... Oh, man.
I'm not sure I could do it.
( Dec 04 2006, 02:48:09 PM EST )
Permalink

Sunday December 03, 2006
New Hit Count High! Ho!
Well, okay, it's mostly Chinese search engines indexing the site (I assume) but it's still kewl to people that don't know better. And I picked up two new search engines, one Swedish, the other Spanish.

Take me wandering through these streets
where bright lights and angels meet
Stone to stone they take me on
I'm walking till the break of dawn.
Off to play pool or something.
( Dec 03 2006, 10:45:50 PM EST )
Permalink
Today's Page Hits: 2006