I have a txt file consisting of tab-separated data with type double. The data file is over 10 GB, so I just wish to read the data line-by-line and then do some processing. Particularly, the data is layout as an matrix with, say 1001 columns, and millions of rows. Below is just a fake sample to show the layout.
10.2 30.4 42.9 ... 3232.000 23232.45
...
...
7.234 824.23232 ... 4009.23 230.01
...
For each line I'd like to store the first 1000 values in an array, and the last value in a separate variable. I am new to C, so it would be nice if you could kindly point out major steps.
Update:
Thanks for all valuable suggestions and solutions. I just figured out one simple example where I just read a 3-by-4 matrix row by row from a txt file. For each row, the first 3 elements are stored in x, and the last element is stored in vector y. So x is a n-by-p matrix with n=p=3, y is a 1-by-3 vector.
Below is my data file and my code.
Data file:
1.112272 -0.345324 0.608056 0.641006
-0.358203 0.300349 -1.113812 -0.321359
0.155588 2.081781 0.038588 -0.562489
My code:
#include<math.h>
#include <stdlib.h>
#include<stdio.h>
#include <string.h>
#define n 3
#define p 3
void main() {
FILE *fpt;
fpt = fopen("./data_temp.txt", "r");
char line[n*(p+1)*sizeof(double)];
char *token;
double *x;
x = malloc(n*p*sizeof(double));
double y[n];
int index = 0;
int xind = 0;
int yind = 0;
while(fgets(line, sizeof(line), fpt)) {
//printf("%d\n", sizeof(line));
//printf("%s\n", line);
token = strtok(line, "\t");
while(token != NULL) {
printf("%s\n", token);
if((index+1) % (p+1) == 0) { // the last element in each line;
yind = (index + 1) / (p+1) - 1; // get index for y vector;
sscanf(token, "%lf", &(y[yind]));
} else {
sscanf(token, "%lf", &(x[xind]));
xind++;
}
//sscanf(token, "%lf", &(x[index]));
index++;
token = strtok(NULL, "\t");
}
}
int i = 0;
int j = 0;
puts("Print x matrix:");
for(i = 0; i < n*p; i++) {
printf("%f\n", x[i]);
}
printf("\n");
puts("Print y vector:");
for(j = 0; j < n; j++) {
printf("%f\t", y[j]);
}
printf("\n");
free(x);
fclose(fpt);
}
With above, hopefully things will work if I replace data_temp.txt with my raw 10 GB data file (of course change values of n,p, and some other code wherever necessary.)
I have additional questions that I wish if you could help me.
- I first initialized
char line[]aschar line[(p+1)*sizeof(double)](note not multiplyingn). But the line cannot be read completely. How could I assign memory JUST for one single line? What's the lenght? I assume it's(p+1)*sizeof(double)since there are(p+1)doubles in each line. Should I also assign memory for\tand\n? If so, how? - Does the code look reasonable to you? How could I make it more efficient since this code will be executed over millions of rows?
- If I don't know the number of columns or rows in the raw 10 GB file, how could I quickly count rows and columns?
Again I am new to C, any comments are very appreciated. Thanks a lot!
via Chebli Mohamed
Aucun commentaire:
Enregistrer un commentaire