Just a girl in Tech

Posts

Linux/POSIX commands that every Data Scientist should know

May 22, 2022

Sometimes, we face the challenge to work on legacy projects or systems that have very little documentation if any. I see a lot of data scientist struggling to locate themselves in these projects, so I decided to write here a few very useful and basic Linux/POSIX compliant commands that every data scientist/engineer/programmer should know (imho). First remember that you can always type $ man command ...

Top 5 Best Data Science and Machine Learning Courses

May 21, 2022

New Data Science enthusiasts usually wonder what are the what are the best resources to best master this area. I am a huge fan of online courses (specially if they are free 😆) and decided to share my top 5 favorite ones. All courses below should have their main content available for free, so you can learn Machine Learning without investing too much! Statistical Learning This course from Stanford University, taught by Trevor Hastie and Robert Tibshirani is an absolutely amazing introduction to Machine Learning . You might have heard about Prof. Tibishirani for being responsible for developing the Lasso method. The classes are a great mix of practical intuition and theoretical concepts. Besides the Professors are funny and adorable (if you don't mind me saying). Applied Machine Learning in Python Here we have a much more practical introduction to Machine Learning an...

Apache Hadoop Admin Tricks and Tips

May 24, 2018

In this post I will share some tips I learned after using the Apache Hadoop environment for some years, and doing many many workshops and courses. The information here considers Apache Hadoop around version 2.9, but it could probably be extended to other similar versions. These are considerations for when building or using a Hadoop cluster. Some are considerations over the Cloudera distribution. Anyway, hope it helps! Don't use Hadoop for millions of small files. It overloads the namenode and makes it slower. It is not difficult to overload the namenode. Always check capability vs number of files. Files on Hadoop usually should be more than 100 MB. You have to have a 1 GB of memory for around 1 million files in the namenode. Nodes usually fail after 5 years. Node failures is one of the most frequent problems in H adoop . Big companies like facebook and google should have node failures by the minute. The MySQL on Cloudera Manager does not have redunda...

BigData White Papers

November 10, 2017

I don't know about you, but I always like to read the white papers that originate OpenSource projects (when available of course :) ). I have been working with BigData quite a lot lately and this area is mostly dominated by Apache OpenSource projects. So, naturally (given the nerd that I am) I tried to investigate their history. I created a list of articles and companies that originated most BigData Apache projects. Here it is! Hope you guys find it interesting too. :) Apache Hadoop Based on: Google MapReduce and GFS Papers: https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf Apache Spark Created by: University of California, Berkeley Papers: http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf http://peo...

Deep Learning, TensorFlow and Tensor Core

August 11, 2017

I was lucky enough to get a ticket to the Google I/O 2017 on a Google Code Jam for Women (for girls that don't know, Google has some programming contest for women and the best classified win tickets to the conference). One of the main topics of the conference was for sure its new Deep Learning library TensorFlow . TensorFlow is Google's OpenSource Machine Learning library that runs both on CPU and GPU. Two very cool things were presented at Google I/O: TPU (Tensor Processing Unit) - a GPU optimized specifically for TensorFlow that can be used on the Google Cloud Engine TensorFlow Lite - a TensorFlow low weight version to run on Android and make developer's lives easier Last week, at a BigData meetup in Chicago, I discovered that Nvidia also created a specific GPU hardware for processing Deep Learning, the Tensor Core . With all this infrastructure and APIs being made available, Deep Learning can be done considerably easier and faster. At Go...

Errors when using the neuralnet package in R

August 02, 2017

Ok, so you read a bunch of stuff on how to do Neural Networks and how many layers or nodes you should add, and etc... But when you start to implement the actual Neural Networks you face a ton of dummy errors that stop your beautiful inspirational programming. This post talks about some errors you might face when using the neuralnet package in R. First, remember, to use the package you should install it: install.packages("neuralnet") Then library(" neuralnet") to load the package. Error 1 One error that might happen training your neural network is this: nn <- neuralnet(formula1,data=new_data, hidden=c(5,3)) Error in terms.formula(formula) : invalid model formula in ExtractVars This happens when the name of the variables in formula "formula1" are in a non desired format. For example if you named your columns (or variables) as numbers you would get this error. So change your column names and re-run the model! Example: label ~ 1 ...

Search This Blog