dslreports logo
Blog 'site' » Mysql and a billion rows using innodb

I'm working on recreating and reloading our 'reverse index' that is used for site search. This reverse indexes 19 million forum posts and a few other smaller collections of text, generating one (weighted) record per unique "word" per object. That is how one reaches a billion rows: Average 50 unique "words" per "item" x 20 million "items" = 1 billion records.

I have some random timings for your amusement:

1. creating it in the first place.

Surprisingly, this is the least dramatic part of the job. Reading all the plain text, stripping format, splitting it up into words, creating all the location objects in another (mere 20 million row) data table, and then dumping the final file as 19gb of text file takes "only" a couple of hours. It could be much faster if I ran this job 4 times in parallel (and pushed our mysql server to deliver the old posts more quickly).

2. Simply reading the entire file.

Since it is 19gb and the disk array can do ~150mb/sec sequential read then obviously even reading the entire file once from end to end is going to take a few minutes to do. Modifying it or removing items is really tricky. So it better be right from the word go!

3. Loading this data into mysql.

The fastest way to load data into a mysql table is to use batch inserts that to make large single transactions (megabytes each). The syntax is something like:

insert ignore into table (fields) values (....),(....),(...)

The 'ignore' is so one possible duplicate doesn't blow out the entire insert.

Inserting un-ordered into mysql/myisam like this, with perl creating the insert strings, can be done at a high rate of speed if the table does not have an index. Testing this shows it running in less than 1 hour (way over 300,000 rows per second inserted).

4. Sorting in mysql, space usage in myisam and InnoDB

Doing a table sort in mysql is a killer. I don't actually know the runtime. I gave up after a couple of hours. I don't even know whether the runtime is exponential with the size of the table, or linear. Interestingly, sorting and then just selecting the 'top' items by limit is very quick, in the order of 10 minutes or so. This high speed for "limit N" sort is also reported by the excellent mysql performance blog: how fast can you sort data with mysql?. The table in mysql takes little more space than the text file to store as myisam, but it balloons to more than 40gb of data under innodb, or 65gb if it is fragmented - and innodb optimize/defragment is famously slow! I'm afraid of running it. So I should sort the data before inserting it. I want maximum clustering for these 7 years worth of old data, even if new data will slowly degrade the clustering.

5. Sorting the data using gnu unix sort

Sorting a billion rows is handled by unix sort with sort/merge passes. It creates as many temporary files as you are prepared to allocate in memory. If you do:

sort -s 2g

(which instructs sort to use 2gb of memory) then sort creates 700mb temporary files at a rate of about one file every 2 minutes. The overall sort time is around one hour. If sort was multi-threaded it could be as many times faster as you have cores available.

6. Loading it all into mysql InnoDB

First attempt, with two indexes, innodb quickly slows down to approx 5000 to 7000 rows per second. At 6000 rows a second fully loading this table would take over 24 hours! If speed further declined, this could stretch further. An unacceptable result with a lot of uncertainty during execution (will it ever finish?).

Removing a tiny secondary index (which nevertheless internally duplicates the PK), setting innodb_flush_log_at_trx_commit=0 from 2 (not recommended for production), setting innodb_flush_method=O_DIRECT, increasing the innodb_buffer_pool_size to 10g from 6g, further increasing the insert packet size to 25k rows per insert, and setting a write lock on the table, the insertion rate starts at 100,000 per second but quickly slows to ~50,000 per second where (hooray) it appears to be stable. If it can maintain this average it will be done in 5 hours. This is 10x faster. I suppose only one change was responsible for most of this improvement. I don't know which, however.

With 4 cores during this procedure approx 2 are occupied, and disk i/o averages about 15mb/sec write. (The perl loader that is reading the flat file and building the insert transactions is only taking 10% of one core).

Since there is still disk subsystem bandwidth (only a small percentage of time is disk-wait) plus with 50% idle net cpu capacity, bulk inserts like this are not fully taking advantage of more than two cores.

After "tuning" and during the bulk insert:

Per second averages calculated from the last 32 seconds
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 119377, signal count 118789
Mutex spin waits 0, rounds 24466870, OS waits 79493
RW-shared spins 30137, OS waits 14091; RW-excl spins 74491, OS waits 17320
------------
TRANSACTIONS
------------
Trx id counter 0 505351913
Purge done for trx's n:o < 0 505325357 undo n:o < 0 0
History list length 18
Total number of lock structs in row lock hash table 0
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0 0, not started, process no 23767, OS thread id 1153526080
MySQL thread id 27375, query id 83255267 localhost root
show innodb status
---TRANSACTION 0 0, not started, process no 23767, OS thread id 1149110592
MySQL thread id 27384, query id 83255200 orange.dslreports.com 192.168.1.200 dslreports
---TRANSACTION 0 505351912, ACTIVE 0 sec, process no 23767, OS thread id 1150515520 inserting, thread declared inside InnoDB 163
mysql tables in use 3, locked 3
3 lock struct(s), heap size 368, 0 row lock(s), undo log entries 1841
MySQL thread id 25258, query id 83255261 orange.dslreports.com 192.168.1.200 dslreports update
insert ignore into ReverseIndex (ReverseIndex_Word, ReverseIndex_Score, Location_ID, y) values ('everything',0.5,38147190,2000),('everything',0.5,3814720,2006),('everything',0.5,38147203,2000),('everything',0.5,38147324,2000),('everything',0.5,38147344,2000),('everything',0.5,38147368,2000),('everything',0.5,38147432,2000),('everything',0.5,38147447,2000),('everything',0.5,3814753,2006),('everything',0.5,38147570,2000),('everything',0.5,38147600,2000),('everything',0.5,38147624,2000),('everything',0.5,38147672,2000),('everything',0.5,38147684,2000),('everything',0.5,38147760,2000),('everything
--------
FILE I/O
--------
I/O thread 0 state: waiting for i/o request (insert buffer thread)
I/O thread 1 state: waiting for i/o request (log thread)
I/O thread 2 state: waiting for i/o request (read thread)
I/O thread 3 state: waiting for i/o request (write thread)
Pending normal aio reads: 0, aio writes: 0,
 ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 1
44374 OS file reads, 78902654 OS file writes, 383233 OS fsyncs
14.84 reads/s, 17280 avg bytes/read, 187.46 writes/s, 22.91 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 16778, free list len 8408, seg size 25187,
5103138 inserts, 2021625 merged recs, 38211 merges
Hash table size 12750011, used cells 7987928, node heap has 18521 buffer(s)
8803.85 hash searches/s, 5457.74 non-hash searches/s
---
LOG
---
Log sequence number 36 2142170188
Log flushed up to   36 2141719998
Last checkpoint at  36 1564120222
0 pending log writes, 0 pending chkp writes
77784071 log i/o's done, 3.16 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 6921245678; in additional pool allocated 10765312
Dictionary memory allocated 56840
Buffer pool size   393216
Free buffers       32
Database pages     374663
Modified db pages  270643
Pending reads 0
Pending writes: LRU 0, flush list 1, single page 0
Pages read 92038, created 2499921, written 13243868
15.66 reads/s, 32.69 creates/s, 815.47 writes/s
Buffer pool hit rate 1000 / 1000
--------------
ROW OPERATIONS
--------------
1 queries inside InnoDB, 0 queries in queue
1 read views open inside InnoDB
Main thread process no. 23767, id 1140881728, state: sleeping
Number of rows inserted 400273553, updated 0, deleted 154959, read 131020195
6169.28 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
 

The server is mysql 5.1.21-beta-log running on x86_64 Suse 10 SP1 with 16gb of memory/ Intel 5140 cpus (2 x dual core) connected to the afore-benchmarked MD3000 disk enclosure. The filesystem is XFS, over two striped LUNs using LVM.

« back

view:
topics flat nest 

PhoenixDown
FIOS is Awesome
Premium Member
join:2003-06-08
Fresh Meadows, NY

PhoenixDown

Premium Member

Sphinx Search

I've been hearing really good things about Sphinx Search, could that be a consideration?