A large table is a pain for many reasons as long as it is in a system. And as if that’s not enough, it is also a difficult task to get rid of it. In this post, we will understand why it is a pain to do this operation and what we can do about it. It will be like asking the table “Tell me what happened and I will ease up the eviction”.
So what happened? When a table is dropped (or truncated), InnoDB has to scan the pages throughout the buffer pool and remove all those belonging to that table. For a large buffer pool, this crawling in the buffer pool pages and eviction process will be slower. When we say “scan buffer pool”, it mainly looks for “LRU”, “FLUSH” (Dirty pages), and “AHI” entries.
LRU: Buffer pool pages are stored in a linked list of pages in order of usage. As the data reaches the end of the list, it is evicted to make space for new data. When the room is needed to add a new page to the buffer pool, the least recently used page is evicted and a new page is added to the middle of the list.
AHI: This is a hash index containing index pages that are frequently accessed. InnoDB has a mechanism that monitors index searches. If InnoDB notices that queries could benefit from building a hash index, it does so automatically.
The issue we’re going to discuss has been resolved in MySQL 8.0.23 (onwards) though this is still relevant to many MySQL deployments on lower versions. From MySQL 8.0.23, this process of drop table has been improved and MySQL no longer waits for the eviction. InnoDB implemented a lazy eviction process for the respective pages from the buffer-pool.
Check a sample ENGINE INNODB STATUS below:
BUFFER POOL AND MEMORY ---------------------- Total large memory allocated 3576902451200 Dictionary memory allocated 5414868 Internal hash tables (constant factor + variable factor) Adaptive hash index 419777277888 (54392761408 + 365384516480) Page hash 53118808 (buffer pool 0 only) Dictionary cache 13603605220 (13598190352 + 5414868) File system 1374872 (812272 + 562600) Lock system 9470361768 (9461541368 + 8820400) Recovery system 0 (0 + 0) Buffer pool size 209689600 Buffer pool size, bytes 3435554406400 Free buffers 16263 Database pages 187372028 Old database pages 69165341 Modified db pages 2323769 Pending reads 0 Pending writes: LRU 0, flush list 0, single page 0 Pages made young 1831432982, not young 28737547535 0.00 youngs/s, 0.00 non-youngs/s Pages read 1655886985, created 633099435, written 15361171213 0.00 reads/s, 0.00 creates/s, 0.00 writes/s Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 1 / 1000 Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s LRU len: 187372028, unzip_LRU len: 0 I/O sum:cur, unzip sum:cur
Note the “Database Pages” and “LRU len”, representing the number of pages that need to be scanned.
Next, the ball is passed to the underlying OS to delete the file. Depending on how large the file is and how busy the system is, IO operation is going to take time.
Broadly speaking, the cost for drop table operation drills down to the following two tasks:
- Traverse through the buffer pool and evict the pages found
- Delete the file from the OS disk
Task one is MySQL-specific, while task two is OS (disk) dependent. We will look for improvements in these two aspects of the process.
1. Traverse through the buffer pool and evict the pages found
Let’s see if we can find out ways to improve performance for this task.
The steps are:
Save current buffer pool state > reduce the buffer pool size > drop table (scanning small list) > reset buffer pool-size > restore buffer pool contents
SET GLOBAL innodb_buffer_pool_dump_now=ON; SHOW STATUS LIKE 'Innodb_buffer_pool_dump_status'; SET GLOBAL innodb_buffer_pool_size=128M; DROP TABLE LARGE_TABLE; SET GLOBAL innodb_buffer_pool_size=1T; SET GLOBAL innodb_buffer_pool_load_now=ON; SHOW STATUS LIKE 'Innodb_buffer_pool_dump_status';
Sounds really easy but it doesn’t work; actually, it makes the situation worse because of the following reasons.
- Buffer pool resize is a blocking operation, it will still have to scan the list, defragment and resize the buffer pool.
- Running traffic on a reduced buffer pool will be a performance penalty.
- The same blocking operation will repeat upon buffer-pool extension.
Idea 2: Stop using the table (no selects, no writes on the table that need to be removed)
- It is possible to avoid table operations by revoking grants from the user on the table or controlling read/writes on the application level.
- This will cause the buffer pool to slowly evict the pages out depending on how busy MySQL is.
- The eviction process will be eased out, as it traverses through the linked list, it won’t find any relevant pages for the table.
- That said, you will still be scanning the large linked list.
In summary for MySQL level, you can not avoid traversing through the linked list of the buffer-pool but you can save some time on the eviction process by waiting and letting the buffer-pool evict it eventually.
2. Delete the file from OS disk
Now, this is the task that depends on how busy and fast your disks are. The slower the file deletion operation, the longer it will take for MySQL to return as “delete successful”. So what can we do to improve this deletion operation? Let’s see.
Idea 1: Smaller the file is on disk, the faster it will be to remove.
We can make the data smaller on disk by purging the data from the table. That’s something we can do using pt-archiver. That said, the table will become fragmented and the size is still going to remain the same unless an OPTIMIZE TABLE or a dummy alter (ALTER TABLE … ENGINE=InnoDB) is performed.
Use pt-archiver to slowly DELETE the table records > bring down the table size > Give LRU algorithm some time to evict those pages from buffer pool > DROP
This approach involves multiple activities needing more time as an archive process followed by a cool-off period could take longer based on the table size. Now note that for larger tables, pt-archiver is going to take a long time. We are still going to have a fragmented table with a large table size on disk until we run a dummy alter.
Idea 2: Don’t delete the underlying tablespace file (ibd)
What if we totally skip the file delete operation? MySQL will return the command as soon as the buffer-pool scan is done! But can we?
Yes, partially yes. We can use a hard link to “fool MySQL” into believing that the data file for the table was removed even though it is still there!
(You can skip this section if you’re already aware of this concept)
A hard link is a directory entry that associates a name with a file. Every filename that you see is a hard link. Let’s explore a bit:
MacBook-Pro:~ kedarvaijanapurkar$ touch percona.rocks MacBook-Pro:~ kedarvaijanapurkar$ echo "Test" > percona.rocks MacBook-Pro:~ kedarvaijanapurkar$ ls -li percona.rocks 20594511 -rw-r--r-- 1 kedarvaijanapurkar staff 5 Aug 26 16:22 percona.rocks
Here “percona.rocks” is a hard link. Note that the first value “20594511” is a file inode represented by “percona.rocks” filename. Now let’s create another hard link.
MacBook-Pro:~ kedarvaijanapurkar$ ln percona.rocks new_hardlink MacBook-Pro:~ kedarvaijanapurkar$ ls -li *hardlink* 20594511 -rw-r--r-- 2 kedarvaijanapurkar staff 5 Aug 26 16:22 percona.rocks 20594511 -rw-r--r-- 2 kedarvaijanapurkar staff 5 Aug 26 16:22 new_hardlink MacBook-Pro:~ kedarvaijanapurkar$
Note that we can see two different files but the inode is still the same “20594511”, pointing to the same physical location/file.
Without going into more details about the OS, understand that the file names are a hard link to the actual data on the disk and thus each file must have at least one hard link. Thus, a file from the disk will not be deleted as long as it has one hard link associated with it.
Here we have two hard links; namely, percona.rocks and new_hardlink having the same content (as they point to the same physical data).
MacBook-Pro:~ kedarvaijanapurkar$ cat percona.rocks Percona is great and hardlink is a fun concept. MacBook-Pro:~ kedarvaijanapurkar$ cat new_hardlink Percona is great and hardlink is a fun concept. MacBook-Pro:~ kedarvaijanapurkar$
Using this knowledge, if we create a hard link to the data file (ibd), the DROP TABLE will not actually delete the file from the OS but only remove the hard link, which is very light on the disk IO.
We can use the following steps to speed up the drop table.
Steps for dropping a large table in MySQL
Let’s say we have to drop large_table.
- Create a hard link on the server (elevated privileges required)
/ ln large_table.ibd large_table.ibd.hardlink
- Issue drop table
set SQL_LOG_BIN=0; drop table if exists large_table;
Because of the hard link we just created, dropping the table will only remove the large_table.ibd link. The original data still exists, but MySQL is unaware of this leftover.
- Truncate file slowly until small enough to run
/ truncate -s -1GB large_table.ibd.hardlink
# You can increase the truncate size from 1G to 10 to 100 depending on how fast it moves to bring down the file size.
- Execute the rm command
It is better to ensure that the table is not queried in order to avoid the “eviction” process and MySQL just traverses through the linked-list.
Considering you have a replicated environment, I’d recommend you perform on a (passive) replica server first, followed by production, to have an idea beforehand. You may also perform this task with failover: perform on replica > failover > perform on the old primary.
Risk involved: This approach to speed up a drop table in older MySQL versions, involves handling database files on OS level and a human error (as always) could be catastrophic.
Let me know how you go about dropping those massive tables and if this approach interests you.