I haven’t posted here lately… I hope many of you still remember
my name…
I am working in my day job on a very interesting and challenging problem
(yes, mostly in Ruby).
Since I have known many Rubyists who were creative and imaginative, I
thought I would seek opinions here.
If you are familiar with the term “cross-device matching,” that is what
this
is all about.
If you’re not familiar – here is a rough synopsis of the classic
problem.
Ad networks (and such) use cookies and pixels and whatever techniques
they can in order to better target their advertising.
There are strict privacy constraints, of course. No one is supposed to
store
information like, “This is Dr. Chandra from Urbana, Illinois” – but
it’s
perfectly
OK to store information like “this is user 123, who searched for a new
car
today, and is the same guy who bought a toaster last week.”
The big problem is that “user 123” on a laptop may be user 456 on a
tablet
and user 789 on a phone. Being able to match or associate these users
with
a good level of probability is sort of a Holy Grail in the industry.
Of course, if you’re Facebook or Google or something, you can do
“deterministic”
matching with a very high degree of certainty. Otherwise, you have to
take
the
“probabilistic” approach, as I am here.
So I am making some progress here, but I am really reaching out for new
and
interesting ideas.
In essence, I am examining a data stream of millions of anonymized users
and
trying to group them together based on pure data analysis. We have quite
a
bit
of information including URL clicked, IP address, user agent, time of
day,
DMA,
device type, and so on.
For an app-related event, we can get the Apple IDFA or the Android ID.
We
cannot find those IDs for a browser-related event, even on the phone.
We
can
access (our) cookies if there are any (browser but not app), etc. etc.
If I had near-infinite storage and processing power, I would build a
matrix
of
several quadrillion entries and update it over time, finding essentially
a
probability
vector for each user with respect to every other user. Then I could
apply
some
heuristics and weight them appropriately.
However, to do this in “reasonable” time with limited RAM and disk is
another
problem entirely.
I’m having acceptable success so far, but I am definitely interested in
hearing
others’ thoughts on this.
Thanks,
Hal F.