Convert CSV to TXT
Identifying the problem rows
You can look at the data with a text editor or use powershell. We load all the raw text into $ge1 - we plan to fix the data stage by stage.
$ge = get-content "$pwd\ge2015.csv" $ge[104]
Which shows
S14000006,S92000003,"Ayr, Carrick and Cumnock",Scotland,Scotland,Scotland,County,Scottish National Party,SNP,Corri,Wilson,Female,No,No,25492,0.488268306,0.307935794
(The error was reported on 105 but because we skipped row 1 the actual first problem is on 104) Observe that the constituency name has a comma in it - BULK INSERT is interpreting this as a field delimiter.
Fixing the problem rows
The best way to fix this is to use someone else's code to deal with CSV - but if we are careful you can do this yourself with regular expressions. Later versions of SQL Server allow text delimiters to be defined, python has a csv module.
You can use replace - we can replace the , inside a double quote using a regular expression:
$ge[104] -replace ',"(.*),(.*)"', ',"$1|$2"'
shows this
S14000006,S92000003,"Ayr| Carrick and Cumnock",Scotland,Scotland,Scotland,County,Scottish National Party,SNP,Corri,Wilson,Female,No,No,25492,0.488268306,0.307935794
You can apply this to the whole file using
$ge = $ge -replace ',"(.[^"]),(.[^"])"', ',"$1|$2"'
Then write the file out to ge2.csv
$ge > "$pwd\ge2.csv"
You can try loading this into the table ge again - but take care to delete any rows you created last time.
sqlcmd -S .\sqlexpress -E -d gisq -Q "BULK INSERT ge FROM '$pwd\ge2.csv' WITH (FIELDTERMINATOR=',', ROWTERMINATOR='\n', FIRSTROW=2)"
More problems
You will probably find further problems:
- there is a constituency with two commas in it
- your data still includes double quote marks that should be replaced
- you have | marks in your data that should be replaced with commas
You can deal with these problems either in powershell, a capable text editor, excel or in SQL
Here is a solution in powershell:
$ge1 = get-content "$pwd\ge2015.csv" $ge1 = $ge1 -replace ',"([^"]*),([^"]*)"', ',"$1|$2"' $ge1 = $ge1 -replace ',"([^"]*),([^"]*)"', ',"$1|$2"' $ge1 = $ge1 -replace '"', "" $ge1 = $ge1 -replace ",", "`t" $ge1 > "$pwd\ge2.csv"