tag:dreamwidth.org,2013-06-07:2034635pnathan on softwarenotes on the software worldpnathan2015-05-31T18:49:28Ztag:dreamwidth.org,2013-06-07:2034635:6644Optimizations2015-05-31T18:49:28Z2015-05-31T18:49:28Zchipperpublic0An amusing anecdote from the Clink development trenches.<br /><br />Intersection is, roughly, O(n^2). To be more precise, it's O(min(n,m)^2), where n and m are the lengths of the input vectors.<br /><br />This turns out to be hugely important when implementing JOIN in a relational database, because you wind up intersecting n-ways for n tables. <br /><br />Some empirical analysis of a 3-way intersection:<br /><br />Intersect from largest to smallest:<br /><br /><pre>
CLINK> (time (ref *foo* :row
(multi-column-query *foo*
(list
`(1 ,#'(lambda (s) (find #\1 s)))
`(0 ,(lambda (x) (> x 300)))
`(3 ,(lambda (x) (> x 900)))))))
Evaluation took:
4.015 seconds of real time
4.149000 seconds of total run time (4.149000 user, 0.000000 system)
103.34% CPU
8,676,250,063 processor cycles
1,252,768 bytes consed
</pre><br /><br /><br />And from smallest to largest:<br /><br /><pre>
CLINK> (time (ref *foo* :row
(multi-column-query *foo*
(list
`(1 ,#'(lambda (s) (find #\1 s)))
`(0 ,(lambda (x) (> x 300)))
`(3 ,(lambda (x) (> x 900)))))))
Evaluation took:
0.766 seconds of real time
0.879000 seconds of total run time (0.879000 user, 0.000000 system)
114.75% CPU
1,655,372,433 processor cycles
1,074,592 bytes consed
</pre><br /><br /><br />We can clearly see that our runtime dropped by roughly 4x, cycles dropped about 7x, and our allocations by about 0.2x. <br /><br />One of the most fun things about the Clink project is that it directly uses concepts from undergraduate computer science courses and applies them.<br /><br /><img src="https://www.dreamwidth.org/tools/commentcount?user=pnathan&ditemid=6644" width="30" height="12" alt="comment count unavailable" style="vertical-align: middle;"/> commentstag:dreamwidth.org,2013-06-07:2034635:3220cl-linq information2013-12-26T03:20:44Z2013-12-26T03:20:44Zpublic0The Lisp REPL is a particularly awesome tool, particularly when paired with SLIME or other customized evaluation system for live programming.<br /><br />This insight has led to R, ipython, Macsyma, MySQL, Postgres, and other systems having their own REPL.<br /><br />However, a serious problem in the Common Lisp REPL is the inability to sling large sums of data around easily, perform queries, etc. It's simply not built in to the system to have multimillion rows of data, perform queries on it, and feed it into particular functions. Lists are too slow; vectors are too primitive, hash tables are too restrictive. Further, queries start looking really hairy as lambdas, reduces, and mapcars chain together. SQL has shown a clearly superior succinctness of syntax. Worse, these queries are ridiculously non-optimized out of the gate. I've had to deal with this situation in multiple industry positions, and it is *not* acceptable for getting work done. It is too slow, too incoherent, and too inelegant.<br /><br />Hence, I am working on a solution; it started out as CL-LINQ, or, Common Lisp Language INtegrated Queries, a derivative of the C# approach. The initial cut can be found at my github for interested parties. It suffers from a basic design flaw: 100% in-memory storage and using lists for internal representation. <br /><br />I am proud to note that I've been able to begin work on an entirely improved and redesigned system. This system is derived from several key pieces. The first and most important is the data storage system, which is what I've been working on recently.<br /><br />Data is stored in data frames; each data frame has information about its headers. Data itself is in a 2D Common Lisp array, ensuring nearly-constant access time to a known cell. Data frames are loaded by pages, which contains a reference to the data table, as well as a reference to the backing store. Pages store information about the data in the data frame. Each page has a 1:1 mapping to a data frame. Pages are routed through a caching layer with a configurable caching strategy, allowing only data of interest to be loaded in memory at a given point in time. Finally, a table contains a number of pages, along with methods to access the headers, particular rows in the table, etc.<br /><br />After this system is done (perhaps 80% of the way done now), then the index system can be built. By building the indexes separate from the raw storage system, I can tune both for optimal behavior - indexes can be built as a tree over the data, while the data can be stored in an efficiently accessible mechanism.<br /><br />Finally, as the motivating factor, the query engine will be designed with both prior systems in mind. The query engine's complexity will be interacting with the index system, to ensure high speed JOINs. A carefully developed query macro system could actually precompile desired queries for optimal layout and speed, for instance.<br /><br /><br />Features that will be considered for this project include - integration with postgres as the storage engine - compiled optimization of queries - pluggable conversion system for arbitrary objects and their analysis. <br /><br />At the completion of this project, a library will be available for loading large amounts of data into data tables, computing queries and processing upon them, and then storing the transformed data into external sources.<br /><br /><img src="https://www.dreamwidth.org/tools/commentcount?user=pnathan&ditemid=3220" width="30" height="12" alt="comment count unavailable" style="vertical-align: middle;"/> comments