And day 2 at this year’s PASS Summit starts with a sweet surprise from Aaron Bertrand ( b | t ), Kinder eggs. It promises to be a good day.
Today is Dr. DeWitt‘s keynote (did I mention that he’s a University of Michigan alum? Go Blue!), and here we go…
8:15 AM
Douglas McDowell, outgoing Vice-President of Finance starts with information about the PASS budget. Summit is the largest source of revenue for PASS, the Business Analytics Conference provided a nice contribution to the budget this year (over $100,000), and PASS has a one million dollars in reserve.
Last year PASS spent 7.6 million dollars on the SQL Server community, with the largest amount spent on the Summit. The second largest cost was the BA Conference. Per Douglas, Headquarters (HQ) is a critical investment for PASS. Right now the IT department has 3 individuals maintaining 520 websites. (And you thought you were a busy DBA!) One initiative for PASS this year, and going forward, is an international expansion, which took about 30% of the budget this past year. Overall, PASS is a very good financial place – and thanks to Douglas for all his work as a Board member.
8:31 AM
Bill Graziano takes the stage to thank Douglas for his time on the Board, and also Rob Farley who moves off the Board this year. Bill asked Rushabh to come on stage…Rushabh has been on the Board of Directors for 8 years. He’s held the positions of VP of Marketing, Executive VP of Finance, and President.
8:34 AM
Incoming PASS President Tom LaRock takes the stage, and starts with an omage to Justin Timberlake and Jimmy Fallon’s hashtag video. Awesome. Tom introduces the incoming PASS BoD:
- Thomas LaRock (President)
- Adam Jorgensen (Executive VP of Finance)
- Denise McInerney (VP of Marketing)
- Bill Graziano (Immediate Past President)
- Jen Stirrup (EMEA seat)
- Tim Ford (US seat)
- Amy Lewis (open seat)
Tom has announced the PASS BA Conference – it will be May 7-9, 2014 in CA. Next year’s Summit will be November 4-7, 2014 in Seattle, WA.
The WIT Lunch is today – and I’m on the panel so I hope to see you there!
8:41
Dr. DeWitt takes the stage, and the topic is Hekaton: Why, What, and How.
I was able to meet the co-author of this session, Rimma Nehme, before today’s keynote – she’s a Senior Researcher in his lab (which is apparently in an old Kroger grocery store building on the Madison campus).
DeWitt says that Hekaton is an OLTP rocket ship. The marketing team has renamed Hekaton to In-Memory OLTP, and DeWitt wants people to vote on Twitter. I am Team #Hekaton…it just sounds cooler (and it’s much easier to type).
Covering three things: What, the Why and How of Hekaton.
Hekaton is memory optimized, but durable. It’s a very high performance OLTP engine, but can be used for more than that. It’s fully integrated into SQL Server 2014, not a bolt-on. Architected for modern CPUs. (Slide deck will be posted later, I’ll post the link when I have it.)
Why Hekaton? Many OLTP databases now fit in memory. There are certain kinds of workloads that SQL Server can no longer meet. Historically, OLTP performance has been improved by better software (driven by TPC benchmarks), CPU performance doubling every 2 years, existing DBMS software maturing. DeWitt says we’ve done as much as we can with mainline products. CPUs are not getting faster – that well is drive.
Hekaton was picked because the goal was 100X improvements. Not quite there yet. Customers have seen 15-20X. If you’re burning 1 million instructions/sec and only yields 100 TPS. If you want to get to 10,000 TPS, but reduce number of instructions/sec to a value that’s just not possible.
Getting to 100X with the flood of new products available (e.g. Oracle-TimesTen, IBM-SolidDB, Volt-VoltDB), including Hekaton. Why a new engine? Why not just pin all the tables in memory? That won’t do the trick. Performance would still be limited by the use of:
- latches for shared data structures such as the buffer pool and lock table
- locking as the concurrency control mechanism
- interpretation of query plans
Implications of a shared buffer pool is a consideration. Assume the pool is empty. Query 1 comes along and needs page 7. Is page 7 in the pool? No, a frame is allocated, the query has to wait while the IO occurs. The IO completes and Query 1 can continue. Remember that the data structure is a shared structure. If Query 2 checks for page 7, the buffer manager will report where it is, but Query 2 will be blocked by the latch on page 7 until Query 1 is finished.
(sidebar: a transaction or query only holds 2 latches at a time)
There can be significant contention for latches on “hot” pages in the buffer pool. This can be a big performance hit. All “shared” data must be protected with latches.
The need for concurrency control… Query 1: A = A + 100. Database actions: Read A, update the value, write A. Query 2: A = A + 500. Database actions: Read A, update the value, write A. If A was originally 1000, after both queries, it will be 1600. This represents a serial schedule.
Two phase locking developed by Dr. Jim Gray – which is the standard. Two simple rules:
- Before access, query must acquire “appropriate” lock type from Lock Manager
- Once a query releases a lock, no further locks can be acquired
If these rules are followed, resulting schedule of action is equivalent to some serial (good) schedule. Dr. Gray got Turin Award for proof of this, one of two given to database scientists.
(sidebar: can I get a fellowship in Dr. DeWitt’s lab? Seriously…)
Still need a deadlock detection/resolution mechanism also needed (wanted to get rid of this for Hekaton…which is why it’s been a 5 year effort).
After a query is parsed and optimized, get an execution plan which is given to a query interpreter that walks the tree of operators and executes them in a particular order. When the database is on disk, the cost of interpreting tree is insignificant.
All these three things (concurrency, query optimization and latches) are why you can’t get to 100X with current implementations.
Currently in SQL Server, shared data structures use latches. Concurency control is done via locking, and query execution is via interpretation.
With Hekaton, shared data structures are lock-free. For concurrency control, versions with timestamps plus optimistic concurrency control is used. For query execution, compile into DLL that loads when queries are executed. This is what will get us to 100X.
SQL Server has 3 query engines – relational, apollo (column store), and Hekaton.
To use Hekaton, create a memory-optimized table. Two kinds of durability: schema-only and schema-and-data. (Every Hekaton table must have a primary key index – can be hash or range. Also have new b-tree in Hekaton, b-w tree, which gives high performance on range queries.) Some schema limitations for V1. Once you’ve created the table, then populate the table. Run a SELECT INTO statement, or do a BULK LOAD from a file. Just need to make sure it’s going to fit into memory. Then, use the table. Via standard ad-hoc T-SQL query interface (termed “interop”), up to 3X performance boost. Adapt, recompile and execute T-SQL SPs, get 5X-30X improvement.
Query optimization is the hardest part of relational databases, per Dr. DeWitt. Lock-free data structures truly are rocket science – they make query optimization look simple.
Lock-free data structures invented by Maurice Herlihy at Brown University – got him elected to the National Academy of Engineering (which is a big deal).
When you think lock-free, think about latch free – it allows multiple processes with threads to access the same data structure without blocking. Dr. DeWitt has a great slide showing performance differences with multiple threads for latch vs. lock-free. He mentioned that it was a tough slide to animate (and if you see it, you’ll understand why…I was actually wondering how he did it). With lock-free (aka latch-free?) – an update does not block reader threads – there is no performance hits. Every shared data structure in Hekaton was built around this functionality.
In Hekaton, now have a different concurrency control. It’s optimistic:
- Conflicts are assumed to be rare
- Transactions run to “completion” without blocking or setting locks
- Conflicts detected during a Validation phase
Second component of concurrency control is multiversion – updating a row creates a NEW version of the row. It works really well when you do this in memory. The third component is timestamps – every row version has a timestamp:
- Each row version has an associated time range
- transactions use their begin timestamp to select correct version
- timestamps also used to create a total order for transaction to obtain equivalent of serial order
This approach drastically reduces number of threads – dozens not hundreds.
Transaction phases in Hekaton:
- Read committed versions of rows
- Updates create new tentative versions
- Track read set, write set, and scan set
When the transaction is done, goes through second phase which is validation, this is where the concurrency control mechanism decides whether transaction can commit. Reaches commit point…
When transaction begins, current clock value is used as Begin_TS for transaction. At the start of the validation phase, transaction is given unique End_TS. It is used during validation to determine whether it is safe to commit the transaction. Begin_TS are NOT unique, End_TS are ALWAYS unique.
Hekaton tables have either hash or range index on unique key. Rows allocated space from SQL’s heap storage. Additional indices (hash or range) on other attributes.
Hekaton row format – all rows tagged with Begin_TS and End_TS. Latest version has infinity on the End_TS ( most recent version of the row). The Begin_TS is the End_TS of the inserting transaction. The End_TS is the logical time when the row was deleted and/or replaced with a new version.
Multiversioning Example – Have transaction to increase value by 10,000. A new version of the row is created. Pointers are used to link the rows together in memory. Don’t think about them being contiguous in memory. The transaction puts its signature (transaction ID) on each row (End_TS of initial row, Begin_TS of new row). When the transaction is later validated and committed, for all rows it updated/created/deleted, it will re-access each row with that transaction ID and replace it with the End_TS. NO LATCHES ARE USED! NO LOCKS ARE SET! NO BLOCKING OF ANY TRANSACTIONS! (I’m not not yelling, neither is Dr. DeWitt.) This is timestamps and versioning – used on rows AND transactions. Always create new versions of rows when doing updates. Per Dr. DeWitt, competitors are not going to have the same level of performance.
9:27 AM
Optimistic multi-version – this is the lock/latch-free mechanism in Hekaton (Dr. DeWitt says it so fast it’s hard to catch it 🙂
When is it safe to discard “old” versions of a row? When the begin timestamp of the oldest query in the system is ahead of the last End_TS…older rows no longer needed. Hekaton garbage collection is non-blocking, cooperative, incremental, parallel, and self-throttling. It has minimal impact on performance. It happens completely under the covers.
Steps:
- Updates create new version of each updated row
- Transaction use combination of time stamps and versions for concurrency control
- A transaction is allowed to read only versions of rows whose “valid” time overlaps the Begin_TS of the Xi.
- Transactoins essentially never block (WAIT, there’s a caveat here that Dr. DeWitt is glossing over…hm…)
Validation Phase
- Transaction obtains a unique End_TS
- Determine if transaction can be safely committed.
Validation steps depend on the isolation level of the transaction – “new” isolation levels for Hekaton.
Read Stability key idea: check that each version read is still “visible” at the end of the transaction using End_TS.
Phantom Avoidance requires a repeat each scan checking whether new versions have become visible since the transaction started. And if any scan returns additional rows, validation fails. This sounds expensive, but, keep in mind all rows are in memory. It is only performed for transaction running at a serializable level, and it is still a LOT cheaper than acquiring and releasing locks.
There is a also a post-processing phase with 3 sub-phases (which I couldn’t type fast enough…oy).
Checkpoints & Recovery – the data is not lost, have a normal checkpoint process, use logs to generate checkpoints (holds data during shutdown). Restart/recovery – starts by loading a known checkpoint and scans log to recover all work since then, fully integrated with HA (giving readable secondaries of memory optimized tables).
Standard method for query execution on a relational system is complicated…and slow-ish. Regular T-SQL access is Interop. Queries can access and update both Hekaton and disk-resident tables. Interpreted execution limits performance. When you put this all into a DLL, get much faster execution.
Native plan generation – query through parse tree, produces logical plan and then feeds it into optimizer which produces physical plan – but these are likely to be different for Hekaton (different algorithms and cost model). Take the physical plan and then translate it into C code (the ugliest C code you’ve ever seen, no function calls, per DeWitt), goes into C compiler and produces DLL – which is very slim, only what’s needed to run the SP. The DLL is then invoked and loaded – it’s stored in the catalog.
9:50
Dr. DeWitt just gave a bunch of quantitative data showing performance improvement in terms of transaction/sec, instructions/sec and CPU…I couldn’t copy it fast enough. 🙂
For more details, there is a session at 1:30 PM in Room 208 A-B there is a DBA Focused Session, tomorrow at 10:15 AM there is a dev-focused session.
Dr. DeWitt takes some time to thank his team. This is something I really appreciate about him. He brings Rimma Nehme on stage (who surprised him by showing up today) and gives nothing but praise…ending with a slide that has a picture of Rimma and a bubble: “Let the woman drive.” Love it!
You can download DeWitt’s deck and supporting files here.
Thank you Dr. DeWitt!