• Home
  • History
  • Annotate
  • only in this directory
NameDateSize

..27-May-201519

code.cH A D27-Apr-20098.8 KiB

csv.hH A D27-Apr-20092.1 KiB

csv_extern.hH A D27-Apr-20091.4 KiB

db.cH A D27-Apr-20095.3 KiB

DbRecord.cH A D27-Apr-200911 KiB

load.cH A D27-Apr-20097.6 KiB

load_main.cH A D27-Apr-20092.3 KiB

MakefileH A D27-Apr-20091.9 KiB

query.cH A D27-Apr-20094.2 KiB

query_main.cH A D27-Apr-20091.9 KiB

READMEH A D27-Apr-200912.6 KiB

sample.csvH A D27-Apr-2009304

sample.descH A D27-Apr-2009187

util.cH A D27-Apr-20095.6 KiB

README

1/*-
2 * See the file LICENSE for redistribution information.
3 *
4 * Copyright (c) 2005,2008 Oracle.  All rights reserved.
5 *
6 * $Id: README,v 1.22 2008/01/08 20:58:23 bostic Exp $
7 */
8
9The "comma-separated value" (csv) directory is a suite of three programs:
10
11	csv_code:  write "helper" code on which to build applications,
12	csv_load:  import csv files into a Berkeley DB database,
13	csv_query: query databases created by csv_load.
14
15The goal is to allow programmers to easily build applications for using
16csv databases.
17
18You can build the three programs, and run a sample application in this
19directory.
20
21First, there's the sample.csv file:
22
23	Adams,Bob,01/02/03,green,apple,37
24	Carter,Denise Ann,04/05/06,blue,banana,38
25	Eidel,Frank,07/08/09,red,cherry,38
26	Grabel,Harriet,10/11/12,purple,date,40
27	Indals,Jason,01/03/05,pink,orange,32
28	Kilt,Laura,07/09/11,yellow,grape,38
29	Moreno,Nancy,02/04/06,black,strawberry,38
30	Octon,Patrick,08/10/12,magenta,kiwi,15
31
32The fields are:
33	Last name,
34	First name,
35	Birthdate,
36	Favorite color,
37	Favorite fruit,
38	Age
39
40Second, there's a "description" of that csv file in sample.desc:
41
42	version 1 {
43		LastName	string
44		FirstName	string
45		BirthDate
46		Color		string index
47		Fruit		string index
48		Age		unsigned_long index
49	}
50
51The DESCRIPTION file maps one-to-one to the fields in the csv file, and
52provides a data type for any field the application wants to use.  (If
53the application doesn't care about a field, don't specify a data type
54and the csv code will ignore it.)  The string "index" specifies there
55should be a secondary index based on the field.
56
57The "field" names in the DESCRIPTION file don't have to be the same as
58the ones in the csv file (and, as they may not have embedded spaces,
59probably won't be).
60
61To build in the sample directory, on POSIX-like systems, type "make".
62This first builds the program csv_code, which it then run, with the file
63DESCRIPTION as an input.  Running csv_code creates two additional files:
64csv_local.c and csv_local.h.  Those two files are then used as part of
65the build process for two more programs: csv_load and csv_query.
66
67You can load now load the csv file into a Berkeley DB database with the
68following command:
69
70	% ./csv_load -h TESTDIR < sample.csv
71
72The csv_load command will create a directory and four databases:
73
74	primary		primary database
75	Age		secondary index on Age field
76	Color		secondary index on Color field
77	Fruit		secondary index on Fruit field
78
79You can then query the database:
80
81	% ./csv_query -h TESTDIR
82	Query: id=2
83	Record: 2:
84		LastName: Carter
85		FirstName: Denise
86		Color: blue
87		Fruit: banana
88		Age: 38
89	Query: color==green
90	Record: 1:
91		LastName: Adams
92		FirstName: Bob
93		Color: green
94		Fruit: apple
95		Age: 37
96
97and so on.
98
99The csv_code process also creates source code modules that support
100building your own applications based on this database.  First, there
101is the local csv_local.h include file:
102
103	/*
104	 *  DO NOT EDIT: automatically built by csv_code.
105	 *
106	 * Record structure.
107	 */
108	typedef struct __DbRecord {
109		u_int32_t	 recno;		/* Record number */
110
111		/*
112		 * Management fields
113		 */
114		void		*raw;		/* Memory returned by DB */
115		char		*record;	/* Raw record */
116		size_t		 record_len;	/* Raw record length */
117
118		u_int32_t	 field_count;	/* Field count */
119		u_int32_t	 version;	/* Record version */
120
121		u_int32_t	*offset;	/* Offset table */
122
123		/*
124		 * Indexed fields
125		 */
126	#define	CSV_INDX_LASTNAME	1
127		char		*LastName;
128
129	#define	CSV_INDX_FIRSTNAME	2
130		char		*FirstName;
131
132	#define	CSV_INDX_COLOR	4
133		char		*Color;
134
135	#define	CSV_INDX_FRUIT	5
136		char		*Fruit;
137
138	#define	CSV_INDX_AGE	6
139		u_long		 Age;
140	} DbRecord;
141
142This defines the DbRecord structure that is the primary object for this
143csv file.  As you can see, the intersting fields in the csv file have
144mappings in this structure.
145
146Also, there are routines in the Dbrecord.c file your application can use
147to handle DbRecord structures.  When you retrieve a record from the
148database the DbRecord structure will be filled in based on that record.
149
150Here are the helper routines:
151
152	int
153	DbRecord_print(DbRecord *recordp, FILE *fp)
154		Display the contents of a DbRecord structure to the specified
155		output stream.
156
157	int
158	DbRecord_init(const DBT *key, DBT *data, DbRecord *recordp)
159		Fill in a DbRecord from a returned database key/data pair.
160
161	int
162	DbRecord_read(u_long key, DbRecord *recordp)
163		Read the specified record (DbRecord_init will be called
164		to fill in the DbRecord).
165
166	int
167	DbRecord_discard(DbRecord *recordp)
168		Discard the DbRecord structure (must be called after the
169		DbRecord_read function), when the application no longer
170		needs the returned DbRecord.
171
172	int
173	DbRecord_search_field_name(char *field, char *value, OPERATOR op)
174		Display the DbRecords where the field (named by field) has
175		the specified relationship to the value.  For example:
176
177		DbRecord_search_field_name("Age", "35", GT)
178
179		would search for records with a "Age" field greater than
180		35.
181
182	int
183	DbRecord_search_field_number(
184	    u_int32_t fieldno, char *value, OPERATOR op)
185		Display the DbRecords where the field (named by field)
186		has the specified relationship to the value.  The field
187		number used as an argument comes from the csv_local.h
188		file, for example, CSV_INDX_AGE is the field index for
189		the "Age" field in this csv file.  For example:
190
191		DbRecord_search_field_number(CSV_INDX_AGE, 35, GT)
192
193		would search for records with a "Age" field greater than
194		35.
195
196	Currently, the csv code only supports three types of data:
197	strings, unsigned longs and doubles.  Others can easily be
198	added.
199
200The usage of the csv_code program is as follows:
201
202	usage: csv_code [-v] [-c source-file] [-f input] [-h header-file]
203		-c	output C source code file
204		-h	output C header file
205		-f	input file
206		-v	verbose (defaults to off)
207
208	-c      A file to which to write the C language code.  By default,
209		the file "csv_local.c" is used.
210
211	-f      A file to read for a description of the fields in the
212		csv file.  By default, csv_code reads from stdin.
213
214	-h	A file to which to write the C language header structures.
215		By default, the file "csv_local.h" is used.
216
217	-v      The -v verbose flag outputs potentially useful debugging
218		information.
219
220There are two applications built on top of the code produced by
221csv_code, csv_load and csv_query.
222
223The usage of the csv_load program is as follows:
224
225	usage: csv_load [-v] [-F format] [-f csv-file] [-h home] [-V version]
226		-F	format (currently supports "excel")
227		-f      input file
228		-h      database environment home directory
229		-v      verbose (defaults to off)
230
231	-F	See "Input format" below.
232
233	-f      If an input file is specified using the -f flag, the file
234		is read and the records in the file are stored into the
235		database.  By default, csv_load reads from stdin.
236
237	-h      If a database environment home directory is specified
238		using the -h flag, that directory is used as the
239		Berkeley DB directory.  The default for -h is the
240		current working directory or the value of the DB_HOME
241		environment variable.
242
243	-V	Specify a version number for the input (the default is 1).
244
245	-v      The -v verbose flag outputs potentially useful debugging
246		information.  It can be specified twice for additional
247		information.
248
249The usage of csv_query program is as follows:
250
251	usage: csv_query [-v] [-c cmd] [-h home]
252
253	-c      A command to run, otherwise csv_query will enter
254		interactive mode and prompt for user input.
255
256	-h      If a database environment home directory is specified
257		using the -h flag, that directory is used as the
258		Berkeley DB directory.  The default for -h is the
259		current working directory or the value of the DB_HOME
260		environment variable.
261
262	-v      The -v verbose flag outputs potentially useful debugging
263		information.  It can be specified twice for additional
264		information.
265
266The query program currently supports the following commands:
267
268	?               Display help screen
269	exit            Exit program
270	fields          Display list of field names
271	help            Display help screen
272	quit            Exit program
273	version         Display database format version
274	field[op]value  Display fields by value (=, !=, <, <=, >, >=, ~, !~)
275
276The "field[op]value" command allows you to specify a field and a
277relationship to a value.  For example, you could run the query:
278
279	csv_query -c "price < 5"
280
281to list all of the records with a "price" field less than "5".
282
283Field names and all string comparisons are case-insensitive.
284
285The operators ~ and !~ do match/no-match based on the IEEE Std 1003.2
286(POSIX.2) Basic Regular Expression standard.
287
288As a special case, every database has the field "Id", which matches the
289record number of the primary key.
290
291Input format:
292	The input to the csv_load utility is a text file, containing
293	lines of comma-separated fields.
294
295	Blank lines are ignored.  All non-blank lines must be comma-separated
296	lists of fields.
297
298	By default:
299		<nul> (\000) bytes and unprintable characters are stripped,
300		input lines are <nl> (\012) separated,
301		commas cannot be escaped.
302
303	If "-F excel" is specified:
304		<nul> (\000) bytes and unprintable characters are stripped,
305		input lines are <cr> (\015) separated,
306		<nl> bytes (\012) characters are stripped from the input,
307		commas surrounded by double-quote character (") are not
308		treated as field separators.
309
310Storage format:
311	Records in the primary database are stored with a 32-bit unsigned
312	record number as the key.
313
314	Key/Data pair 0 is of the format:
315		[version]		32-bit unsigned int
316		[field count]		32-bit unsigned int
317		[raw record]		byte array
318
319	For example:
320		[1]
321		[5]
322		[field1,field2,field3,field4,field5]
323
324	All other Key/Data pairs are of the format:
325		[version]		32-bit unsigned int
326		[offset to field 1]	32-bit unsigned int
327		[offset to field 2]	32-bit unsigned int
328		[offset to field 3]	32-bit unsigned int
329		...			32-bit unsigned int
330		[offset to field N]	32-bit unsigned int
331		[offset past field N]	32-bit unsigned int
332		[raw record]		byte array
333
334	For example:
335		[1]
336		[0]
337		[2]
338		[5]
339		[9]
340		[14]
341		[19]
342		[a,ab,abc,abcd,abcde]
343		 012345678901234567890		<< byte offsets
344		 0	   1	     2
345
346	So, field 3 of the data can be directly accessed by using
347	the "offset to field 3", and the length of the field is
348	the "((offset to field 4) - (offset to field 3)) - 1".
349
350Limits:
351	The csv program stores the primary key in a 32-bit unsigned
352	value, limiting the number of records in the database.  New
353	records are inserted after the last existing record, that is,
354	new records are not inserted into gaps left by any deleted
355	records.  This will limit the total number of records stored in
356	any database.
357
358Versioning:
359	Versioning is when a database supports multiple versions of the
360	records.  This is likely to be necessary when dealing with large
361	applications and databases, as record fields change over time.
362
363	The csv application suite does not currently support versions,
364	although all of the necessary hooks are there.
365
366	The way versioning will work is as follows:
367
368	The XXX.desc file needs to support multiple version layouts.
369
370	The generated C language structure defined should be a superset
371	of all of the interesting fields from all of the version
372	layouts, regardless of which versions of the csv records those
373	fields exist in.
374
375	When the csv layer is asked for a record, the record's version
376	will provide a lookup into a separate database of field lists.
377	That is, there will be another database which has key/data pairs
378	where the key is a version number, and the data is the field
379	list.  At that point, it's relatively easy to map the fields
380	to the structure as is currently done, except that some of the
381	fields may not be filled in.
382
383	To determine if a field is filled in, in the structure, the
384	application has to have an out-of-band value to put in that
385	field during DbRecord initialization.  If that's a problem, the
386	alternative would be to add an additional field for each listed
387	field -- if the additional field is set to 1, the listed field
388	has been filled in, otherwise it hasn't.  The csv code will
389	support the notion of required fields, so in most cases the
390	application won't need to check before simply using the field,
391	it's only if a field isn't required and may be filled in that
392	the check will be necessary.
393
394TODO:
395	Csv databases are not portable between machines of different
396	byte orders.  To make them portable, all of the 32-bit unsigned
397	int fields currently written into the database should be
398	converted to a standard byte order.  This would include the
399	version number and field count in the column-map record, and the
400	version and field offsets in the other records.
401
402	Add Extended RE string matches.
403
404	Add APIs to replace the reading of a schema file, allow users to
405	fill in a DbRecord structure and do a put on it.  (Hard problem:
406	how to flag fields that aren't filled in.)
407
408	Add a second sample file, and write the actual versioning code.
409