Shell command is very useful for initial exploratory analysis of data files.
A simple head
would allow explore the content of a huge file and It is much faster than firing up an editor. With so many different tools around for data analysis, I believed that command prompt one-liner could be one of the most understated tools.
Head / tail
views the first / last lines of a file.
# Print the first 10 lines
head -n10 test.csv
# Print the last 10 lines
tail -n10 test.csv
More / less
allows one’s to inspect a file quickly. It is much quicker than firing up a text editor.
# More allows scrolling forward only.
more test.csv
# Less allows scrolling in either directions and has a lot more features.
less test.csv
A combination of these tools can be quite powerful in navigating large files.
# Explore the last 1000 lines of test.csv interactively.
tail -n1000 test.csv | less
Very often we would like to look at a log file while another program is continously outputting messages to it.
tail -f everIncreasing.log
By adding the -f
flag, tail will keep monitoring the file and show you the latest changes.
Grep
searches for a certain regular expression pattern and prints the line(s) matching the expression.
For example, searching for Chris within a text file would be:
grep Apple input.csv
One can then easily pipe the output to wc
to count the number of matching lines.
grep Apple input.csv | wc -l
To obtain the unique entires in a column of a csv file, first we would like to obtain the column by cut
. Then, sort
its output and count the unique entries with uniq
.
<input.csv cut -d, -f1 | sort | uniq -c
I have once handled a csv file that has an extra comma at the end of every line.
Because of this extra comma, it could not be read by pandas/R.
This seems a be a task prefectly handled by sed
.
sed s/,$// wrongcomma.csv > cleanFile.csv
However, the regular expression fails to match any of the expected string.
It turns out that the line ending must be unix for grep
to work.
It can be easily converted by the command dos2unix.
<wrongcomma.csv dos2unix | sed s/,$// wrongcomma.csv > cleanFile.csv
Another common problem with non English file is encoding. I have handled a lot of data with simplified Chinese character.
# Convert from gb2312 to utf8 encoding
iconv -f gb2312 -t utf8 myfile.csv > correct.csv
Greg Reda shared his experience with unix commands for data science. http://www.gregreda.com/2013/07/15/unix-commands-for-data-science/
Jeroen introduced a few more tools for data science in the following post. I personally find that jq and csvkit introduced in the post are very useful tools introduced in this article very useful for command line data exploration. Also check out his book “Data Science at the Command Line”. http://jeroenjanssens.com/2013/09/19/seven-command-line-tools-for-data-science.html