Linux/POSIX commands that every Data Scientist should know

Sometimes, we face the challenge to work on legacy projects or systems that have very little documentation if any.
I see a lot of data scientist struggling to locate themselves in these projects, so I decided to write here a few very useful and basic Linux/POSIX compliant commands that every data scientist/engineer/programmer should know (imho). 


 First remember that you can always type

$ man command                                                                                                  

to get more information on the command. This should tell you what the command is and how you can use it. For example, the following should give you the manual of the awk command.

$ man awk                                                                                                                     

Let's say you have a File/Library not found error. One thing you can try is the locate command.

$ locate pattern                                                                                                     

  Locate will return any repo that matches the pattern passed. With this, one can check if a file is on your computer, and where it is.
 whereis file is also a good tool to find programs, but with whereis you have to specify the exact name of the program you want found. For example

$ whereis python                                               

will show you where the program python (the one in your PATH, what you execute when you type "python" in the command line on your terminal) is located.

Let's say one realizes that the they do have the file you were looking for,  but still gets an error. In this case, they might not have the right permission to access it. You can change its permissions rights with:

$ chmod  755  file                                                                                                   

or

$ chmod  u+x  file                                             

Let's say the program you want is not installed at all on your system. If you are on an ubuntu environment, you should be able to install it with:

$ apt-cache search pattern                                                                                           

With this you will get a bunch of distinct results matching pattern. See in the list the program you want to install. This is the program you are going to install next

$ sudo apt-get install program                                                                   
 
On a mac os, we usually use brew:

$ brew install file                                              

If what you need is a python package, you can run:

$ pip install package                                            
                                                
BTW if you ever want to check the list of python packages installed on your computer, you can run:

$ pip freeze                                                     

Let's say you are compiling a program and getting "Error 1" as output, but you have no idea what error 1 is, or where it could be in the code should be. You can type:

$ grep -r "Error 1"  .                                                                                                 

This will look recursively for the string starting from you current directory, and output all files that present this string.
If there are too many and, you can type instead

$ grep -r "Error 1"  . | less                                                                               

This will give you the ability to scroll the screen up and down and see results better.

Ok, so you ran your program, but it is still not working properly. Let's say some application is getting stuck. If you have the program you want to kill on your terminal, you can stop its execution by pressing CRTL + C. If not, or if it is on background,  you can look for it's run id with

$ ps -e                                                                                                                                        

Look for your application pid (process id, the number beside your program's name) and type

$ kill number                                                                                                                      

Another great resource is the find command. You can find files by name or size! For example:

$ find . -name "*.jar"                                            

Will find all files with the .jar extension, in any repo located under your current directory. You can also use it to find large files, like:

$ find / -size +100M                                              
 
The above command finds all files with size equal or greater to 100MB in your computer!

Last, my favorite of all time. nohup. Nohup is a great tool to let a script or program run in a remote system even if you get disconnected from it! So let's say you have sshed to whatever system you need to ssh to, and need to execute a program that takes hours to finish. With nohup, you can exit the system and the program continues to run!

$ nohup python potato.py &                                       

will leave the potato.py executing while you can go and finish your business elsewhere.

Of course you can still be an absolutely amazing data scientist without knowing any of these, but they can definitely be life savers and might be worth taking the time to learn them! 

:D

Comments

Popular posts from this blog

Slope One

Apache Mahout

Error when using smooth.spline