Posts

An Introduction to Statistical Learning - in Python

Image
 The day we have all been waiting for is here!  The authors of  An Introduction to Statistical Learning with Applications in R   have finally launched the Python version of the book.   I am a big fan of this material, as well as the FREE online course made available by the authors (you can find it in here ). Honestly, there is no better introduction to Machine Learning with such a solid footing in statistics as this one.  The book contained exercises and examples in R, and now they just released a Python version of it!!! Chapter 10, on Deep Learning, was actually slightly changed to use PyTorch instead of Tensorflow (as it was done in the previous R version).  When I was studying with this book, I implemented a Tensorflow Python version of the labs and exercises they made available in R. If you are curious and wants to check the Tensorflow Python version of the Deep Learning chapter you can find it in my github .

Linux/POSIX commands that every Data Scientist should know

Sometimes, we face the challenge to work on legacy projects or systems that have very little documentation if any. I see a lot of data scientist struggling to locate themselves in these projects, so I decided to write here a few very useful and basic Linux/POSIX compliant commands that every data scientist/engineer/programmer should know (imho).   First remember that you can always type $ man command                                                                                                     to get  more information on the command . This should tell you what the command is and how you can use it. For example, the following should give you the manual of the awk command. $   man awk                                                                                                                        Let's say you have a  File/Library not found error. One thing you can try is the locate command. $ locate pattern                                                         

Top 5 Best Data Science and Machine Learning Courses

New Data Science enthusiasts usually wonder what are the what are the best resources to best master this area. I am a huge fan of online courses (specially if they are free 😆) and decided to share my top 5 favorite ones. All courses below should have their main content available for free, so you can learn Machine Learning without investing too much! Statistical Learning             This course from Stanford University, taught by  Trevor Hastie and  Robert Tibshirani is an absolutely  amazing introduction to Machine Learning .  You might have heard about Prof. Tibishirani for being responsible for developing the Lasso method.  The classes are a great mix of practical intuition and theoretical concepts. Besides the Professors are funny and adorable (if you don't mind me saying).  Applied Machine Learning in Python           Here we have a much more practical introduction to Machine Learning and Data Science, with amazing examples in Python and details about arguments to be used in

Apache Hadoop Admin Tricks and Tips

In this post I will share some tips I learned after using the Apache Hadoop environment for some years, and  doing many many workshops and courses. The information here considers Apache Hadoop around version 2.9, but it could probably be extended to other similar versions. These are considerations for when building or using a Hadoop cluster. Some are considerations over the Cloudera distribution. Anyway, hope it helps!  Don't use Hadoop for millions of small files. It overloads the namenode and makes it slower. It is not difficult to overload the namenode. Always check capability vs number of files. Files on Hadoop usually should be more than 100 MB. You have to have a 1 GB of memory for around 1 million files in the namenode. Nodes usually fail after 5 years. Node failures is one of the most frequent problems in H adoop . Big companies like facebook and google should have node failures by the minute. The MySQL on Cloudera Manager does not have redundancy. This could

BigData White Papers

I don't know about you, but I always like to read the white papers that originate OpenSource projects (when available of course :) ). I have been working with BigData quite a lot lately and this area is mostly dominated by Apache OpenSource projects.  So, naturally (given the nerd that I am) I tried to investigate their history. I created a list of articles and companies that originated most BigData Apache projects. Here it is! Hope you guys find it interesting too. :) Apache Hadoop  Based on: Google MapReduce and GFS  Papers: https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf Apache Spark   Created by: University of California, Berkeley  Papers:  http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf http://people.csail.mit.edu/matei/p

Deep Learning, TensorFlow and Tensor Core

Image
I was lucky enough to get a ticket to the Google I/O 2017 on a Google Code Jam for Women  (for girls that don't know, Google has some programming contest for women and the best classified win tickets to the conference). One of the main topics of the conference was for sure its new Deep Learning library TensorFlow . TensorFlow is Google's OpenSource Machine Learning library that runs both on CPU and GPU. Two very cool things were presented at Google I/O:  TPU (Tensor Processing Unit) - a GPU optimized specifically for TensorFlow that can be used on the Google Cloud Engine  TensorFlow Lite - a TensorFlow low weight version to run on Android and make developer's lives easier Last week, at a BigData meetup in Chicago, I discovered that Nvidia also created a specific GPU hardware for processing Deep Learning, the Tensor Core .  With all this infrastructure and APIs being made available, Deep Learning can be done considerably easier and faster. At Google I/

Errors when using the neuralnet package in R

Image
Ok, so you read a bunch of stuff on how to do Neural Networks and how many layers or nodes you should add, and etc... But when you start to implement the actual Neural Networks you face a ton of dummy errors that stop your beautiful inspirational programming. This post talks about some errors you might face when using the neuralnet package in R. First, remember, to use the package you should install it: install.packages("neuralnet") Then library(" neuralnet") to load the package. Error 1 One error that might happen training your neural network is this: nn <- neuralnet(formula1,data=new_data, hidden=c(5,3)) Error in terms.formula(formula) : invalid model formula in ExtractVars This happens when the name of the variables in formula "formula1" are in a non desired format. For example if you named your columns (or variables) as numbers you would get this error. So change your column names and re-run the model! Example: label ~ 1