Accelerated Programming for Data Analysis and Processing
- Wednesday, 22. January 2025, 10:00 - 12:00
- INF 205, Room 2.414
- Thi Kim Tuyen Le
Address
INF 205
Room 2.414Organizer
Dean
Event Type
Doctoral Examination
In the past decade, data science has made remarkable progress, evidenced by the proliferation of data-driven strategies, the rapid growth of data science-related jobs, and the expansion of university curricula in this field. Consequently, data strengthens its role as a paramount asset for organizations. Nonetheless, along with great benefits come nontrivial challenges.
Particularly, (1) many domain experts, who are proficient in their respective fields but lack programming skills, face difficulties in learning and utilizing numerous data science tool-kits. In addition, (2) data science practitioners invest substantial effort in adapting implementations when switching between platforms or programming languages. Furthermore, (3) transitioning from “small” to "big" datasets often requires additional work, including the deployment of complex data structures, adoption of new libraries, and potential re-implementation.
This dissertation targets resolving these three issues while expediting scripting for developers. We utilized low-code techniques and Machine Learning (ML)-based approaches to accelerate programming tasks. Additionally, we deployed multiple libraries with domain-specific operations to simplify implementation tasks when transitioning across platforms. Moreover, within these libraries, we standardized Application Programming Interfaces for both sequential and parallel processing, enabling users to seamlessly switch between those two with ease. Accordingly, this doctoral project emphasizes both research contributions and practical applications.
In practical terms, we developed a Visual Studio Code extension called NLDSL to support the development and utilization of Domain Specific Languages (DSLs), particularly for data analysis and processing. This extension simplifies scripting tasks for end-users and developers by harnessing the benefits of natural language-like DSLs. Users can readily reuse customized DSLs through shared DSL templates. Specifically, these DSLs employ unified grammars for both sequential and parallel operations to address scalability concerns. The extension has received positive feedback from the community, underscoring the need for such extension types.
Our research contributions primarily focus on accelerating programming with ML code generation techniques and enriching the above libraries for reproduction via published source code and data. We conceived and evaluated an ensemble of code recommenders, named Extended Network, to illustrate the enhanced accuracy achieved by the ensemble-like architecture. Besides, we deployed a refined evaluation method, CT3, to reveal valuable insights while comparing code completion approaches, a task often hindered by classical aggregated evaluation. Finally, we proposed One-shot Correction, a procedure to integrate user feedback into generative Artificial Intelligence models without explicit re-training, facilitating in-depth analysis of unexpected outcomes. The effectiveness of these methods was demonstrated through our empirical studies.