Extract HTML tags out of a csv file

atelier1 · March 20, 2013, 10:54am

I know there are some genius here present that can do miracles… And I do hope for a hint in the right direction for a little problem I have.

I migrate data from one database to another. Thanks to some tools that EE provide I can do that with XML files, if well translated from CSV files.

I do have access to the MySQL server, and I can export in several formats.

Now I extracted all the tables from the old server into a CSV file. Problem is only that the old website ( not made by me) used TinyMc so certain fields are cluttered with HTML tags. And it is a mess in there.

How can I extract the HTML without disturbing the structure of the CSV? I tried several things my way with TextWrangler, but after that did not work out well.

BTW translating a CSV to XML is not easy as well, I discovered. A lot of trial and error…

Thanks for looking at this.

offtopic mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

waltd · March 20, 2013, 2:29pm

Do you have Sequel Pro? You can use that (on your Mac) to export the database as XML or CSV, assuming you can get it to connect to that database server. If all you have is the CSV, you could try importing the CSV into a new MySQL database on your Mac, and then export that as XML, again in the same application.

As far as converting the CSV in a text editor, you’re going to need to escape the HTML bits as CDATA blocks, not sure what the exact syntax is for that, but that’s where I would start hunting.

Walter

On Mar 20, 2013, at 6:54 AM, atelier wrote:

I know there are some genius here present that can do miracles… And I do hope for a hint in the right direction for a little problem I have.

offtopic mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

atelier1 · March 20, 2013, 3:07pm

Thank you. I’d never heard of Sequel Pro, but it sound to me as the right tool. I have MAMP installed, so I can test this locally in all safety first.

I do not understand the part with CDATA blocks, or how to handle that. http://www.w3schools.com/xml/xml_cdata.asp I am not at my Mac right now, shall post some of the HTML mark-up later, to show what I want to get rid of in some of the fields.

offtopic mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

waltd · March 20, 2013, 3:15pm

Here’s a very concise description: http://www.quackit.com/xml/tutorial/xml_cdata.cfm

Walter

On Mar 20, 2013, at 11:07 AM, atelier wrote:

Thank you. I’d never heard of Sequel Pro, but it sound to me as the right tool. I have MAMP installed, so I can test this locally in all safety first.

I do not understand the part with CDATA blocks, or how to handle that. http://www.w3schools.com/xml/xml_cdata.asp I am not at my Mac right now, shall post some of the HTML mark-up later, to show what I want to get rid of in some of the fields.

offtopic mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

offtopic mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

atelier1 · March 20, 2013, 3:49pm

OK, I understand the concept, thanks.

offtopic mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

atelier1 · March 20, 2013, 8:07pm

Problem is that the ‘blocks’ are all different.
I see in a glance in some fields

<p align=""left""><span style=""font-size: 12pt;""><span style=""font-family: Arial,Helvetica,Sans-Serif; font-size: small;"">

and

<p align=""left""><span style=""font-size: 12pt;""><span style=""font-family: Arial,Helvetica,Sans-Serif; font-size: small;"">

<h2 style=""page-break-after: avoid;""><span style=""font-size: 10pt;"">

Sigh, what am I doing…

offtopic mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

atelier1 · March 20, 2013, 8:10pm

Actually everything between < and > has to go. In the whole file.

offtopic mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

waltd · March 20, 2013, 9:06pm

Are you trying to strip all tags out, or are you trying to encode the data so you can upload it to the new server? If you strip the tags, you will be left with nothing but plain text (this may be your goal). Can you elaborate on your desired end-state?

Here’s how to strip all tags from a text file (note – this will strip XML and HTML tags, so do this on your CSV version).

In BBEdit or TextWrangler, open the text file and open the Search dialog. Check the “Grep” option on.

In the Search field, enter just this:

<[^>]+?>

In the Replace field, make sure there is nothing at all entered.

Click Replace All and marvel at the power of regular expressions.

Walter

On Mar 20, 2013, at 4:10 PM, atelier wrote:

Actually everything between < and > has to go. In the whole file.

offtopic mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

offtopic mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

atelier1 · March 20, 2013, 9:37pm

Walter, sir, you are indeed a star. My goal is to strip all HTML entities. The formatting takes place at the frontend.

Thinking loud, one small problem remains, the double quotes which have to go as well. I guess I add them in the regulair expression search and replace?

I am afraid that I loose the structure of the CSV, not being able to transform it into XML ( my goal)

I must fiddle a bit with this, think of a stepped solution.

Thanks for the advise!

offtopic mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

waltd · March 21, 2013, 12:15am

Be careful that you don’t remove those quotes too aggressively. Remember, they may be needed to preserve the CSV structure (if the content contains commas, it cannot be delimited by them unless fenced off by quotes).

I recommend that you fool around with Sequel Pro a bit. Make a dummy database table and put some varied content in it, then export as CSV. See what happens to it – some fields get quoted, others do not.

Walter

On Mar 20, 2013, at 5:37 PM, atelier wrote:

Walter, sir, you are indeed a star. My goal is to strip all HTML entities. The formatting takes place at the frontend.

Thinking loud, one small problem remains, the double quotes which have to go as well. I guess I add them in the regulair expression search and replace?

I am afraid that I loose the structure of the CSV, not being able to transform it into XML ( my goal)

I must fiddle a bit with this, think of a stepped solution.

Thanks for the advise!

offtopic mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

offtopic mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

atelier1 · March 21, 2013, 7:10am

Good that you reminded me of Sequel Pro, to test the export/ import.

offtopic mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

atelier1 · March 21, 2013, 8:51am

Walter, it is amazing.
After having set up Sequel Pro and my Mamp, I imported the 2 sql dumps. No problems, two tables present.
Then I just (…) opened the sql dump right into TextWrangler, did the GREP thing, saved and imported again.
The formatting gone.

Flabbergasted. Though I did not do the big test, setting up a local EE build to see if the migration had succeeded, I do have hope I made a big first step!

Thanks a lot.

offtopic mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

david.ledger · March 21, 2013, 9:14am

At 17:37 -0400 20/3/13, atelier wrote:

Walter, sir, you are indeed a star. My goal is to strip all HTML
entities. The formatting takes place at the frontend.

Thinking loud, one small problem remains, the double quotes which
have to go as well. I guess I add them in the regulair expression
search and replace?

I am afraid that I loose the structure of the CSV, not being able to
transform it into XML ( my goal)

Remember that csv isn’t really a valid format. It’s not difficult to
make data that can’t be saved in it such that it can be reliably
extracted again. A data ‘cell’ that contains text containing double
quotes and commas in an unfurtunate arrangement can blow it.

David

–
David Ledger - Freelance Unix Sysadmin in the UK.
email@hidden
www.ivdcs.co.uk

offtopic mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

atelier1 · March 21, 2013, 11:56am

I think I managed deleting all of the HTML tags by working directly in the sql file. And I could import this back into the database. I’ll test it out, then see how the export as XML works.

offtopic mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

Todd1 · March 21, 2013, 2:55pm

atelier,

I think you mentioned recently that you use Coda (version 2?), if so, then Coda’s built-in MySQL tool is a stripped-down version of Sequel Pro. I guess Panic struck a deal with Sequel Pro devs.

Todd
http://xiiro.com

offtopic mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

atelier1 · March 21, 2013, 5:11pm

Hi Todd

Yes, I mentioned Coda, as part of my plans to ‘change horses’. I am still in that process, so Coda has to be evaluated yet…Good you mentioned their MySQL tool, thanks!

offtopic mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options