1/*- 2 * See the file LICENSE for redistribution information. 3 * 4 * Copyright (c) 2005,2008 Oracle. All rights reserved. 5 * 6 * $Id: README,v 1.22 2008/01/08 20:58:23 bostic Exp $ 7 */ 8 9The "comma-separated value" (csv) directory is a suite of three programs: 10 11 csv_code: write "helper" code on which to build applications, 12 csv_load: import csv files into a Berkeley DB database, 13 csv_query: query databases created by csv_load. 14 15The goal is to allow programmers to easily build applications for using 16csv databases. 17 18You can build the three programs, and run a sample application in this 19directory. 20 21First, there's the sample.csv file: 22 23 Adams,Bob,01/02/03,green,apple,37 24 Carter,Denise Ann,04/05/06,blue,banana,38 25 Eidel,Frank,07/08/09,red,cherry,38 26 Grabel,Harriet,10/11/12,purple,date,40 27 Indals,Jason,01/03/05,pink,orange,32 28 Kilt,Laura,07/09/11,yellow,grape,38 29 Moreno,Nancy,02/04/06,black,strawberry,38 30 Octon,Patrick,08/10/12,magenta,kiwi,15 31 32The fields are: 33 Last name, 34 First name, 35 Birthdate, 36 Favorite color, 37 Favorite fruit, 38 Age 39 40Second, there's a "description" of that csv file in sample.desc: 41 42 version 1 { 43 LastName string 44 FirstName string 45 BirthDate 46 Color string index 47 Fruit string index 48 Age unsigned_long index 49 } 50 51The DESCRIPTION file maps one-to-one to the fields in the csv file, and 52provides a data type for any field the application wants to use. (If 53the application doesn't care about a field, don't specify a data type 54and the csv code will ignore it.) The string "index" specifies there 55should be a secondary index based on the field. 56 57The "field" names in the DESCRIPTION file don't have to be the same as 58the ones in the csv file (and, as they may not have embedded spaces, 59probably won't be). 60 61To build in the sample directory, on POSIX-like systems, type "make". 62This first builds the program csv_code, which it then run, with the file 63DESCRIPTION as an input. Running csv_code creates two additional files: 64csv_local.c and csv_local.h. Those two files are then used as part of 65the build process for two more programs: csv_load and csv_query. 66 67You can load now load the csv file into a Berkeley DB database with the 68following command: 69 70 % ./csv_load -h TESTDIR < sample.csv 71 72The csv_load command will create a directory and four databases: 73 74 primary primary database 75 Age secondary index on Age field 76 Color secondary index on Color field 77 Fruit secondary index on Fruit field 78 79You can then query the database: 80 81 % ./csv_query -h TESTDIR 82 Query: id=2 83 Record: 2: 84 LastName: Carter 85 FirstName: Denise 86 Color: blue 87 Fruit: banana 88 Age: 38 89 Query: color==green 90 Record: 1: 91 LastName: Adams 92 FirstName: Bob 93 Color: green 94 Fruit: apple 95 Age: 37 96 97and so on. 98 99The csv_code process also creates source code modules that support 100building your own applications based on this database. First, there 101is the local csv_local.h include file: 102 103 /* 104 * DO NOT EDIT: automatically built by csv_code. 105 * 106 * Record structure. 107 */ 108 typedef struct __DbRecord { 109 u_int32_t recno; /* Record number */ 110 111 /* 112 * Management fields 113 */ 114 void *raw; /* Memory returned by DB */ 115 char *record; /* Raw record */ 116 size_t record_len; /* Raw record length */ 117 118 u_int32_t field_count; /* Field count */ 119 u_int32_t version; /* Record version */ 120 121 u_int32_t *offset; /* Offset table */ 122 123 /* 124 * Indexed fields 125 */ 126 #define CSV_INDX_LASTNAME 1 127 char *LastName; 128 129 #define CSV_INDX_FIRSTNAME 2 130 char *FirstName; 131 132 #define CSV_INDX_COLOR 4 133 char *Color; 134 135 #define CSV_INDX_FRUIT 5 136 char *Fruit; 137 138 #define CSV_INDX_AGE 6 139 u_long Age; 140 } DbRecord; 141 142This defines the DbRecord structure that is the primary object for this 143csv file. As you can see, the intersting fields in the csv file have 144mappings in this structure. 145 146Also, there are routines in the Dbrecord.c file your application can use 147to handle DbRecord structures. When you retrieve a record from the 148database the DbRecord structure will be filled in based on that record. 149 150Here are the helper routines: 151 152 int 153 DbRecord_print(DbRecord *recordp, FILE *fp) 154 Display the contents of a DbRecord structure to the specified 155 output stream. 156 157 int 158 DbRecord_init(const DBT *key, DBT *data, DbRecord *recordp) 159 Fill in a DbRecord from a returned database key/data pair. 160 161 int 162 DbRecord_read(u_long key, DbRecord *recordp) 163 Read the specified record (DbRecord_init will be called 164 to fill in the DbRecord). 165 166 int 167 DbRecord_discard(DbRecord *recordp) 168 Discard the DbRecord structure (must be called after the 169 DbRecord_read function), when the application no longer 170 needs the returned DbRecord. 171 172 int 173 DbRecord_search_field_name(char *field, char *value, OPERATOR op) 174 Display the DbRecords where the field (named by field) has 175 the specified relationship to the value. For example: 176 177 DbRecord_search_field_name("Age", "35", GT) 178 179 would search for records with a "Age" field greater than 180 35. 181 182 int 183 DbRecord_search_field_number( 184 u_int32_t fieldno, char *value, OPERATOR op) 185 Display the DbRecords where the field (named by field) 186 has the specified relationship to the value. The field 187 number used as an argument comes from the csv_local.h 188 file, for example, CSV_INDX_AGE is the field index for 189 the "Age" field in this csv file. For example: 190 191 DbRecord_search_field_number(CSV_INDX_AGE, 35, GT) 192 193 would search for records with a "Age" field greater than 194 35. 195 196 Currently, the csv code only supports three types of data: 197 strings, unsigned longs and doubles. Others can easily be 198 added. 199 200The usage of the csv_code program is as follows: 201 202 usage: csv_code [-v] [-c source-file] [-f input] [-h header-file] 203 -c output C source code file 204 -h output C header file 205 -f input file 206 -v verbose (defaults to off) 207 208 -c A file to which to write the C language code. By default, 209 the file "csv_local.c" is used. 210 211 -f A file to read for a description of the fields in the 212 csv file. By default, csv_code reads from stdin. 213 214 -h A file to which to write the C language header structures. 215 By default, the file "csv_local.h" is used. 216 217 -v The -v verbose flag outputs potentially useful debugging 218 information. 219 220There are two applications built on top of the code produced by 221csv_code, csv_load and csv_query. 222 223The usage of the csv_load program is as follows: 224 225 usage: csv_load [-v] [-F format] [-f csv-file] [-h home] [-V version] 226 -F format (currently supports "excel") 227 -f input file 228 -h database environment home directory 229 -v verbose (defaults to off) 230 231 -F See "Input format" below. 232 233 -f If an input file is specified using the -f flag, the file 234 is read and the records in the file are stored into the 235 database. By default, csv_load reads from stdin. 236 237 -h If a database environment home directory is specified 238 using the -h flag, that directory is used as the 239 Berkeley DB directory. The default for -h is the 240 current working directory or the value of the DB_HOME 241 environment variable. 242 243 -V Specify a version number for the input (the default is 1). 244 245 -v The -v verbose flag outputs potentially useful debugging 246 information. It can be specified twice for additional 247 information. 248 249The usage of csv_query program is as follows: 250 251 usage: csv_query [-v] [-c cmd] [-h home] 252 253 -c A command to run, otherwise csv_query will enter 254 interactive mode and prompt for user input. 255 256 -h If a database environment home directory is specified 257 using the -h flag, that directory is used as the 258 Berkeley DB directory. The default for -h is the 259 current working directory or the value of the DB_HOME 260 environment variable. 261 262 -v The -v verbose flag outputs potentially useful debugging 263 information. It can be specified twice for additional 264 information. 265 266The query program currently supports the following commands: 267 268 ? Display help screen 269 exit Exit program 270 fields Display list of field names 271 help Display help screen 272 quit Exit program 273 version Display database format version 274 field[op]value Display fields by value (=, !=, <, <=, >, >=, ~, !~) 275 276The "field[op]value" command allows you to specify a field and a 277relationship to a value. For example, you could run the query: 278 279 csv_query -c "price < 5" 280 281to list all of the records with a "price" field less than "5". 282 283Field names and all string comparisons are case-insensitive. 284 285The operators ~ and !~ do match/no-match based on the IEEE Std 1003.2 286(POSIX.2) Basic Regular Expression standard. 287 288As a special case, every database has the field "Id", which matches the 289record number of the primary key. 290 291Input format: 292 The input to the csv_load utility is a text file, containing 293 lines of comma-separated fields. 294 295 Blank lines are ignored. All non-blank lines must be comma-separated 296 lists of fields. 297 298 By default: 299 <nul> (\000) bytes and unprintable characters are stripped, 300 input lines are <nl> (\012) separated, 301 commas cannot be escaped. 302 303 If "-F excel" is specified: 304 <nul> (\000) bytes and unprintable characters are stripped, 305 input lines are <cr> (\015) separated, 306 <nl> bytes (\012) characters are stripped from the input, 307 commas surrounded by double-quote character (") are not 308 treated as field separators. 309 310Storage format: 311 Records in the primary database are stored with a 32-bit unsigned 312 record number as the key. 313 314 Key/Data pair 0 is of the format: 315 [version] 32-bit unsigned int 316 [field count] 32-bit unsigned int 317 [raw record] byte array 318 319 For example: 320 [1] 321 [5] 322 [field1,field2,field3,field4,field5] 323 324 All other Key/Data pairs are of the format: 325 [version] 32-bit unsigned int 326 [offset to field 1] 32-bit unsigned int 327 [offset to field 2] 32-bit unsigned int 328 [offset to field 3] 32-bit unsigned int 329 ... 32-bit unsigned int 330 [offset to field N] 32-bit unsigned int 331 [offset past field N] 32-bit unsigned int 332 [raw record] byte array 333 334 For example: 335 [1] 336 [0] 337 [2] 338 [5] 339 [9] 340 [14] 341 [19] 342 [a,ab,abc,abcd,abcde] 343 012345678901234567890 << byte offsets 344 0 1 2 345 346 So, field 3 of the data can be directly accessed by using 347 the "offset to field 3", and the length of the field is 348 the "((offset to field 4) - (offset to field 3)) - 1". 349 350Limits: 351 The csv program stores the primary key in a 32-bit unsigned 352 value, limiting the number of records in the database. New 353 records are inserted after the last existing record, that is, 354 new records are not inserted into gaps left by any deleted 355 records. This will limit the total number of records stored in 356 any database. 357 358Versioning: 359 Versioning is when a database supports multiple versions of the 360 records. This is likely to be necessary when dealing with large 361 applications and databases, as record fields change over time. 362 363 The csv application suite does not currently support versions, 364 although all of the necessary hooks are there. 365 366 The way versioning will work is as follows: 367 368 The XXX.desc file needs to support multiple version layouts. 369 370 The generated C language structure defined should be a superset 371 of all of the interesting fields from all of the version 372 layouts, regardless of which versions of the csv records those 373 fields exist in. 374 375 When the csv layer is asked for a record, the record's version 376 will provide a lookup into a separate database of field lists. 377 That is, there will be another database which has key/data pairs 378 where the key is a version number, and the data is the field 379 list. At that point, it's relatively easy to map the fields 380 to the structure as is currently done, except that some of the 381 fields may not be filled in. 382 383 To determine if a field is filled in, in the structure, the 384 application has to have an out-of-band value to put in that 385 field during DbRecord initialization. If that's a problem, the 386 alternative would be to add an additional field for each listed 387 field -- if the additional field is set to 1, the listed field 388 has been filled in, otherwise it hasn't. The csv code will 389 support the notion of required fields, so in most cases the 390 application won't need to check before simply using the field, 391 it's only if a field isn't required and may be filled in that 392 the check will be necessary. 393 394TODO: 395 Csv databases are not portable between machines of different 396 byte orders. To make them portable, all of the 32-bit unsigned 397 int fields currently written into the database should be 398 converted to a standard byte order. This would include the 399 version number and field count in the column-map record, and the 400 version and field offsets in the other records. 401 402 Add Extended RE string matches. 403 404 Add APIs to replace the reading of a schema file, allow users to 405 fill in a DbRecord structure and do a put on it. (Hard problem: 406 how to flag fields that aren't filled in.) 407 408 Add a second sample file, and write the actual versioning code. 409