Hello!
We are thinking of different ways to handle hot data in our system. We
get a lot of information that’s very relevant for periods of time. So,
for example, we have some data that we query on the basis of
counts over last 30 minutes
data collected in the last 30 minutes
items collected today
items collected within the past week
…and so on.
We are looking at different strategies to keep this data updated and to
manage scaling the database. So, we are looking at horizontal sharding
[splitting data into multiple tables that inherit from a master and have
exclusion constraints] and are also considering having tables that hold
data as:
All the data
Data for the last 24 hours
Data for the last 2 hours
That way, when we want something recent, we would just query the most
recent table, but in the few occasions that we need something more, we
go to the larger tables that are sharded.
Just for reference, we are doing something like this:
select count(*), count(distinct group_id) from
(select group_id, recorded_on from data_store_v2 order by id DESC limit
900000) td
where recorded_on >= (now() - ‘24 hr’::INTERVAL);
We are getting 800,000 data items a day right now and the above query
takes around 14 seconds on a single table that is not sharded and has
around 268million records. Every week, this table becomes slightly
slower since we add close to 6 million records every week.
I’ve read this:
and am looking at ways that everything is managed under Rails, if
possible.
So, the questions (and thanks for reading this far) are:
What is a good way to do this while still working within
ActiveRecord?
This is a capability that we’d like to attach to any model that might
need it. What would be a good way to approach a gem for it?
I’m sure there will be a few more questions as we progress on this.
Thanks,
Mohit.