October 15, 2017
My name is George Traskas and I am from Greece. I went to Aristotle University of Thessaloniki and graduated with a BSc in Chemistry and a MSc in Chemical Technology. During my MSc studies and for a couple of years I was involved in several water treatment research projects. I am an experienced chemist with a strong background in analytical chemistry and I work at CPERI/CERTH for the last 10 years.
Statistics and data analysis in chemistry is a growing concern these days. Thus, last year I started learning data analysis at Udacity and completed successfully the Data Analyst NanoDegree Program. This web site is mainly focused on data science and contains a collection of my work in machine learning, data wrangling, exploratory data analysis, and statistics.
- Machine Learning - Regressions, SVM’s, Decision Trees, Naive Bayes, Clustering, Neural Networks
- Cross Validation and Model Evaluation
- Learning Curves and Error Analysis
- Feature Scaling and Normalization
- Feature Selection and Dimensionality Reduction (PCA)
- Statistics - Descriptive, Inferential, Hypothesis Testing, Q-Tests, T-Tests
- Data Mining/Wrangling - xls, csv, html, json, xml, SQL
- Data Analysis - Univariate, Multivariate, Correlations, Visualizations
- Jupyter Notebooks
Describe a time you experienced a challenge while building a product/project and how you overcame it.
Recently, I was faced with creating a bunch of quality control charts (QCC) required by an external ISO 9001 audit. I had several hundreds of observations for 57 features. I quickly found that this was a lot of work to do it manually in Excel or similar software, so I searched for tools that could do this work automatically. I found that R has a library for quality control charting called “qcc”. Installing R and the required packages in my working machine was a matter of a few minutes. Then, I read my exported “csv” data into a data frame. Eventually, iterating my list of data with a “for-loop”, I created in a few seconds a comprehensive “html” report with statistical summaries and informative control charts.
What big-data problem would you solve that can benefit society at a large scale?
I constantly watch the rapid increasing volume of analytical and biomedical data in chemistry and life sciences, which requires the development of new methods and approaches for their handling. One of the biggest challenges is the analysis of chemical/biological data from millions of compounds for further clinical prediction purposes applying machine learning methods. After finishing my Data Analyst Nanodegree in Udacity, I started working on a new project about breast cancer prediction using a free dataset from Kaggle. You can find my completed project in the Posts section.
How do you see data science and machine learning affect the way we design software?
I believe that one of the benefits of data science and machine learning is the efficient analysis of massive amounts of data and the accurate predictions on them. I consider that it is easier now for software engineers to create applications crafted to our personal needs, since software can handle more data inputs and outputs and can learn in real time our preferences providing a better experience for the user. For example, a “smart” home application could provide conveniences such as energy efficiency and comfort with minimal supervision.
What is the most interesting fact or trend you’ve learned from analyzing data?
One thing that I’ve found really interesting is the “magic” when you reveal hidden structures and patterns from unlabelled data in the unsupervised machine learning classifications. The conventional data analysis is to implement manually every step from preprocessing to visualising and analysing. With new advanced analytics and machine learning, this process is now more automated and helps you reveal relationships that otherwise would be impossible. For example in medicinal chemistry, a computational tool called Quantitative Structure-Activity Relationship, uses machine learning algorithms and validation methods to find mathematical relationships between the observed physicochemical properties and the chemical structure of a compound.
I am always keen on data science and related topics. Feel free to drop me an email or contact me via LinkedIn, if you are interested in working in the same topics.