Stack Overflow and various forums are rife with questions about how to
efficiently update or insert ActiveRecords into a table. (In fact, one
of my earliest questions to this forum was on this very topic.)
My recent work regularly calls for an efficient “update or insert”
method for tens of thousands of records. After a fair amount of
googling, I don’t see a generalized solution, so I’ve decided to design
and implement the function for real. In this note, I’m proposing a
design specifically for critique and feedback.
(First question: are there other forums where I should be posing this
question?)
A typical use case might look like this:
=====
class MyModel < ActiveRecord::Base
with_update_or_insert
def self.bulk_load(csv_file)
# generate a lot of records
records = CSV.table(csv_file).map {|row| MyModel.new(row.to_hash)}
# :name and :group form a compound key for determining uniqueness
update_or_insert(records, :key => [:name, :group])
end
end
The method itself is defined as follows:
def self.update_or_insert(array_of_records, options = {})
…
end
The options:
:key determines what column or columns determine uniqueness of each
record. May be a symbol that names a single column or an array of
symbols that names a compound key. If omitted or blank, records are
inserted without checking whether they are already present in the db.
:on_duplicate determines what happens when a duplicate is present in
the db. Recognized values are :update (the default), :ignore, :error.
For :update, the existing record is updated. For :ignore, the existing
record is untouched. For :error, the method raises an error.
:batch_size determines the maximum number of records to be processed
in one transaction. If omitted, defaults to a value specific to the
current database adaptor. Regardless of the value given, the current
database adaptor may impose a smaller value.
Comments and open issues:
C1: I’m assuming that some parts of update_or_insert will dispatch to
database specific code. I’ve got a pretty good sense of what
PostgreSQL, MySQL and SQLite need to implement this. I have no clue
what the other database adaptors (e.g. MongoDB) will require.
C2: Type conversion might be problematic. Different backends require
different type-specific quoting. (But ActiveRecord appears to know how
to do this – I should be able to use its code.)
C3: I’ve assuming that array_of_records is an array of unsaved
ActiveRecords. But perhaps it should accept hash-like objects to make
it independent of ActiveRecord. If that’s the case, we lose the
per-column type information (see C2 above).
C4: I’m assuming that for efficiency, ActiveRecord validations will NOT
be called, nor will the in-memory ActiveRecords be updated. In effect,
the ActiveRecords are simply in-memory structures to hold the data.
C5: duplicate records in array_of_records may be problematic for some
dbs. The most general ANSI compliant implementations of UPSERT
essentially makes two db calls: an UPDATE for records that are already
present and an INSERT for records that are new. But if the
array_of_records itself contains duplicate records, the INSERT operation
will attempt to insert them twice. If there is a unique index on the
table, the db will raise an error, otherwise you’ll end up with
duplicate copies, which could be unexpected. Can you tolerate this as a
restriction? Or should update_or_insert do an extra operation on the
array_of_records to detect duplicates?
C6: How should this be packaged? (Hint: I’ve never created a gem
before.) I’d like to start with support for one DB (likely PostgreSQL),
but provide a framework to make it easy to support other adaptors.
C7: What value, if any, should update_or_insert return?
Have your eyes glazed over yet? Comments are warmly welcomed…
- ff