cl-linq information
Dec. 25th, 2013 06:56 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
The Lisp REPL is a particularly awesome tool, particularly when paired with SLIME or other customized evaluation system for live programming.
This insight has led to R, ipython, Macsyma, MySQL, Postgres, and other systems having their own REPL.
However, a serious problem in the Common Lisp REPL is the inability to sling large sums of data around easily, perform queries, etc. It's simply not built in to the system to have multimillion rows of data, perform queries on it, and feed it into particular functions. Lists are too slow; vectors are too primitive, hash tables are too restrictive. Further, queries start looking really hairy as lambdas, reduces, and mapcars chain together. SQL has shown a clearly superior succinctness of syntax. Worse, these queries are ridiculously non-optimized out of the gate. I've had to deal with this situation in multiple industry positions, and it is *not* acceptable for getting work done. It is too slow, too incoherent, and too inelegant.
Hence, I am working on a solution; it started out as CL-LINQ, or, Common Lisp Language INtegrated Queries, a derivative of the C# approach. The initial cut can be found at my github for interested parties. It suffers from a basic design flaw: 100% in-memory storage and using lists for internal representation.
I am proud to note that I've been able to begin work on an entirely improved and redesigned system. This system is derived from several key pieces. The first and most important is the data storage system, which is what I've been working on recently.
Data is stored in data frames; each data frame has information about its headers. Data itself is in a 2D Common Lisp array, ensuring nearly-constant access time to a known cell. Data frames are loaded by pages, which contains a reference to the data table, as well as a reference to the backing store. Pages store information about the data in the data frame. Each page has a 1:1 mapping to a data frame. Pages are routed through a caching layer with a configurable caching strategy, allowing only data of interest to be loaded in memory at a given point in time. Finally, a table contains a number of pages, along with methods to access the headers, particular rows in the table, etc.
After this system is done (perhaps 80% of the way done now), then the index system can be built. By building the indexes separate from the raw storage system, I can tune both for optimal behavior - indexes can be built as a tree over the data, while the data can be stored in an efficiently accessible mechanism.
Finally, as the motivating factor, the query engine will be designed with both prior systems in mind. The query engine's complexity will be interacting with the index system, to ensure high speed JOINs. A carefully developed query macro system could actually precompile desired queries for optimal layout and speed, for instance.
Features that will be considered for this project include - integration with postgres as the storage engine - compiled optimization of queries - pluggable conversion system for arbitrary objects and their analysis.
At the completion of this project, a library will be available for loading large amounts of data into data tables, computing queries and processing upon them, and then storing the transformed data into external sources.
This insight has led to R, ipython, Macsyma, MySQL, Postgres, and other systems having their own REPL.
However, a serious problem in the Common Lisp REPL is the inability to sling large sums of data around easily, perform queries, etc. It's simply not built in to the system to have multimillion rows of data, perform queries on it, and feed it into particular functions. Lists are too slow; vectors are too primitive, hash tables are too restrictive. Further, queries start looking really hairy as lambdas, reduces, and mapcars chain together. SQL has shown a clearly superior succinctness of syntax. Worse, these queries are ridiculously non-optimized out of the gate. I've had to deal with this situation in multiple industry positions, and it is *not* acceptable for getting work done. It is too slow, too incoherent, and too inelegant.
Hence, I am working on a solution; it started out as CL-LINQ, or, Common Lisp Language INtegrated Queries, a derivative of the C# approach. The initial cut can be found at my github for interested parties. It suffers from a basic design flaw: 100% in-memory storage and using lists for internal representation.
I am proud to note that I've been able to begin work on an entirely improved and redesigned system. This system is derived from several key pieces. The first and most important is the data storage system, which is what I've been working on recently.
Data is stored in data frames; each data frame has information about its headers. Data itself is in a 2D Common Lisp array, ensuring nearly-constant access time to a known cell. Data frames are loaded by pages, which contains a reference to the data table, as well as a reference to the backing store. Pages store information about the data in the data frame. Each page has a 1:1 mapping to a data frame. Pages are routed through a caching layer with a configurable caching strategy, allowing only data of interest to be loaded in memory at a given point in time. Finally, a table contains a number of pages, along with methods to access the headers, particular rows in the table, etc.
After this system is done (perhaps 80% of the way done now), then the index system can be built. By building the indexes separate from the raw storage system, I can tune both for optimal behavior - indexes can be built as a tree over the data, while the data can be stored in an efficiently accessible mechanism.
Finally, as the motivating factor, the query engine will be designed with both prior systems in mind. The query engine's complexity will be interacting with the index system, to ensure high speed JOINs. A carefully developed query macro system could actually precompile desired queries for optimal layout and speed, for instance.
Features that will be considered for this project include - integration with postgres as the storage engine - compiled optimization of queries - pluggable conversion system for arbitrary objects and their analysis.
At the completion of this project, a library will be available for loading large amounts of data into data tables, computing queries and processing upon them, and then storing the transformed data into external sources.