Manipulating Text Files

If you handle a lot of data from various programs, instruments and sources, it is inevitable that sooner or later you will be encountered with a large text file from which you need to extract some data. Of course there are numerous tools and libraries that have been written to import space or comma separated ASCII data into everything from C++, Python, all the way to Excel. Generally this will require you to load the entire file, which may be overkill if you only require one small part of a large data set. The following outline a couple of neat trick that can be used to extract only the start, end or a single column of data from these sorts of file.

note: all of the following methods require the use of a unix based commands, applicable to almost all distribution of linux and Mac OsX.

Consider a case where you have a large file that contains a series of comma separated variables, where each row corresponds to a recorded taken at a different time.

boundary_layer.dat: file containing

 
#time, Temp1, Temp2, Temp3
0.0, 21.1, 25.4, 110.3
...
1.0, 24.1, 21.4, 114.3

If you quickly wanted to see only the first line of the code to check the initial temperatures at time zero, this can be done using the head commands, e.g.

head -n 2 boundary_layer.dat

The option -n 2 tells the terminal to display the first two lines of the file, e.g.

#time, Temp1, Temp2, Temp3
0.0, 21.1, 25.4, 110.3

Alternatively if you were only interested in the last time record, you could use tail, e.g.

tail -n 1 boundary_layer.dat

1.0, 24.1, 21.4, 114.3

The above cases are very simple and are something that any linux user should be very familiar with, but what if I wanted to quickly extract all the values of Temp1? Clearly a head or a tail call would not work in this case. However, this can be achieved with the cut command.

Lets say I want to extract the 1st column from the above file. I can do this by using the cat commands and piping this data into cut as follows:

cat boundary_layer.dat |cut -d',' -f1 

the out of which is:

time
0.0

1.0

Or for a more complex case, lets say we had an ASCII file which uses tabs or multiple spaces to seperate 20 columns of data. Imagine we wanted to extract only columns 2,3,7,8,11,15,16,18, remove the extra spaces and send this data to a new ASCII file. This can be done using a combination of tr which removes the extra white space and cut as follows:

cat boundary_layer_profile.x0030.dat | tr -s ' ' \
|cut -d' ' -f2,3,7,8,11,15,16,18 > profile.x0030.txt 

Of course these are just some examples that I have found useful from time to time. Feel free to share any similar methods you might have had to use.

Leave a Reply

Your email address will not be published. Required fields are marked *